6 |
7 | {{ end }}
--------------------------------------------------------------------------------
/content/Proposals/Accepted/README.md:
--------------------------------------------------------------------------------
1 | # Accepted
2 |
3 | This is a list of accepted proposals. This means proposal was accepted, but not yet implemented.
4 |
5 | ## Internal Accepted Proposals
6 |
7 | * ...
8 |
--------------------------------------------------------------------------------
/content/Proposals/Rejected/README.md:
--------------------------------------------------------------------------------
1 | # Rejected
2 |
3 | This is a list of rejected proposals.
4 |
5 | > NOTE: This does not mean we can return to them and accept!
6 |
7 | ## Internal Rejected Proposals
8 |
9 | * ...
10 |
--------------------------------------------------------------------------------
/content/Services/RHOBS/consumption-billing.md:
--------------------------------------------------------------------------------
1 | # Consumption Billing
2 |
3 | Consumption billing is done by Red Hat SubWatch. Data is periodically taken from RHOBS metrics Query APIs. If you are internal Red Hat Managed Service, contact SubWatch team to learn how you can use consumption billing.
4 |
--------------------------------------------------------------------------------
/.bingo/hugo.mod:
--------------------------------------------------------------------------------
1 | module _ // Auto generated by https://github.com/bwplotka/bingo. DO NOT EDIT
2 |
3 | go 1.16
4 |
5 | // TODO(bwplotka): Only go 1.14 is on Netlify image,
6 | // so hugo can't be newer (it depends on io/fs from Go 1.16).
7 |
8 | require github.com/gohugoio/hugo v0.80.0 // CGO_ENABLED=1 -tags=extended
9 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/README.md:
--------------------------------------------------------------------------------
1 | # Openshift Cluster Monitoring
2 |
3 | Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring.
4 |
5 | The official [OpenShift documentation](https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html#understanding-the-monitoring-stack_monitoring-overview) contains all user-facing information such as usage and configuration.
6 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Binaries for programs and plugins
2 | *.exe
3 | *.exe~
4 | *.dll
5 | *.so
6 | *.dylib
7 |
8 | # Test binary, built with `go test -c`
9 | *.test
10 |
11 | # Output of the go coverage tool, specifically when used with LiteIDE
12 | *.out
13 | .idea
14 | .bin
15 | .envrc
16 | web/node_modules
17 | web/resources/_gen
18 | web/public
19 |
20 | # Dependency directories (remove the comment below to include it)
21 | # vendor/
22 |
--------------------------------------------------------------------------------
/content/Teams/observability-platform/README.md:
--------------------------------------------------------------------------------
1 | # Observability Platform
2 |
3 | We are the team responsible (mainly) for:
4 |
5 | * [RHOBS](../../Services/RHOBS) including:
6 | * [Telemeter](../../Services/RHOBS/use-cases/telemetry.md)
7 | * [MST tenants](../../Services/RHOBS/use-cases/observability.md).
8 | * [Thanos](../../Projects/Observability/thanos.md).
9 | * [Observatorium](../../Projects/Observability/observatorium.md)
10 |
--------------------------------------------------------------------------------
/web/assets/favicons/_head.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/web/assets/scss/_variables_project.scss:
--------------------------------------------------------------------------------
1 | $primary: #2B388F;
2 |
3 | .td-sidebar-nav {
4 | font-size: 90% !important;
5 | }
6 |
7 | .td-max-width-on-larger-screens, .td-content > pre, .td-content > .highlight, .td-content > .lead, .td-content > h1, .td-content > h2, .td-content > ul, .td-content > ol, .td-content > p, .td-content > blockquote, .td-content > dl dd, .td-content .footnotes, .td-content > .alert {
8 | max-width: 95% !important;
9 | }
10 |
--------------------------------------------------------------------------------
/web/assets/favicons/browserconfig.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 | transparent
10 |
11 |
12 |
--------------------------------------------------------------------------------
/web/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "web",
3 | "version": "0.1.0",
4 | "description": "I don't know what I do, glad Prem Saraswat is there to help me out",
5 | "main": "index.js",
6 | "scripts": {
7 | "test": "echo \"Error: no test specified\" && exit 1"
8 | },
9 | "author": "me",
10 | "license": "Apache-2.0",
11 | "dependencies": {
12 | "autoprefixer": "^10.2.5",
13 | "postcss": "^8.2.15",
14 | "postcss-cli": "^8.3.1"
15 | }
16 | }
17 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/dashboards.md:
--------------------------------------------------------------------------------
1 | # Dashboards
2 |
3 | ## Targeted audience
4 |
5 | This document is intended for OpenShift developers that want to add visualization dashboards for their operators and operands in the OCP administrator console.
6 |
7 | ## Getting started
8 |
9 | Please refer to the [document](https://docs.google.com/document/d/1UwHwkL-YtrRJYm-A922IeW3wvKEgCR-epeeeh3CBOGs/edit) written by the Observability UI team.
10 |
11 | The team can also be found in the `#forum-observability-ui` Slack channel.
12 |
--------------------------------------------------------------------------------
/netlify.toml:
--------------------------------------------------------------------------------
1 | [build]
2 | publish = "web/public"
3 | functions = "functions"
4 |
5 | [build.environment]
6 | # HUGO_VERSION = "..." is set by bingo which allows reproducible local environment.
7 | NODE_VERSION = "15.5.1"
8 | NPM_VERSION = "7.3.0"
9 |
10 | [context.production]
11 | command = "make web WEBSITE_BASE_URL=${URL}"
12 |
13 | [context.deploy-preview]
14 | command = "make web WEBSITE_BASE_URL=${DEPLOY_PRIME_URL}"
15 |
16 | [context.branch-deploy]
17 | command = "make web WEBSITE_BASE_URL=${DEPLOY_PRIME_URL}"
18 |
--------------------------------------------------------------------------------
/.bingo/variables.env:
--------------------------------------------------------------------------------
1 | # Auto generated binary variables helper managed by https://github.com/bwplotka/bingo v0.5.1. DO NOT EDIT.
2 | # All tools are designed to be build inside $GOBIN.
3 | # Those variables will work only until 'bingo get' was invoked, or if tools were installed via Makefile's Variables.mk.
4 | GOBIN=${GOBIN:=$(go env GOBIN)}
5 |
6 | if [ -z "$GOBIN" ]; then
7 | GOBIN="$(go env GOPATH)/bin"
8 | fi
9 |
10 |
11 | BINGO="${GOBIN}/bingo-v0.5.1"
12 |
13 | HUGO="${GOBIN}/hugo-v0.80.0"
14 |
15 | MDOX="${GOBIN}/mdox-v0.9.0"
16 |
17 |
--------------------------------------------------------------------------------
/content/Projects/Observability/prometheus.md:
--------------------------------------------------------------------------------
1 | # Prometheus
2 |
3 | Prometheus is a monitoring and alerting system which collects and stores metrics. In the broader sense, it is a collection of tools including (but not limited to) Alertmanager, node_exporter, etc.
4 |
5 | ## Links
6 |
7 | * [GitHub project](https://github.com/prometheus/prometheus)
8 | * [Upstream documentation](https://prometheus.io/docs/)
9 | * [Community](https://prometheus.io/community/)
10 | * [Maintainers](https://github.com/prometheus/prometheus/blob/main/MAINTAINERS.md)
11 |
--------------------------------------------------------------------------------
/content/Projects/Observability/prometheusOp.md:
--------------------------------------------------------------------------------
1 | # Prometheus Operator
2 |
3 | The Prometheus Operator is a Kubernetes Operator which manages Prometheus, Alertmanager and ThanosRuler deployments.
4 |
5 | ## Links
6 |
7 | * [GitHub project](https://github.com/prometheus-operator/prometheus-operator)
8 | * [Upstream documentation](https://prometheus-operator.dev/)
9 | * [Contributing guide](https://prometheus-operator.dev/docs/community/contributing/)
10 | * [Maintainers](https://github.com/prometheus-operator/prometheus-operator/blob/main/MAINTAINERS.md)
11 |
--------------------------------------------------------------------------------
/content/Products/Observability-Operator/README.md:
--------------------------------------------------------------------------------
1 | # Observability Operator
2 |
3 | [Observability Operator](https://github.com/rhobs/observability-operator) is a Kubernetes operator created in Go to setup and manage multiple highly available Monitoring Stack using Prometheus and Thanos Querier.
4 |
5 | Eventually, this operator may also cover Logging and Tracing.
6 |
7 | The project relies heavily on the [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) library.
8 |
9 | > More details to follow in this document
10 |
11 | Check the observability-operator's [README](https://github.com/rhobs/observability-operator#readme) page for more details.
12 |
--------------------------------------------------------------------------------
/content/Projects/prometheus-api-client-python.md:
--------------------------------------------------------------------------------
1 | # prometheus-api-client-python
2 |
3 | This is a python wrapper for the Prometheus API. It also includes some tools for processing metric data using Pandas Dataframes.
4 |
5 | ### Installation
6 |
7 | To install the latest release:
8 |
9 | `pip install prometheus-api-client`
10 |
11 | ### Links:
12 | * [Source](https://github.com/aicoe/prometheus-api-client-python)
13 | * [Usage Examples](https://github.com/aicoe/prometheus-api-client-python#usage)
14 | * [API Documentation](https://prometheus-api-client-python.readthedocs.io/en/master/source/prometheus_api_client.html)
15 | * [Slack](https://operatefirst.slack.com/archives/C01S2707XKM)
16 |
--------------------------------------------------------------------------------
/.github/workflows/docs.yaml:
--------------------------------------------------------------------------------
1 | name: docs
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 | tags:
8 | pull_request:
9 |
10 | jobs:
11 | lint:
12 | runs-on: ubuntu-latest
13 | name: Docs formatting and link checking.
14 | steps:
15 | - name: Checkout code into the Go module directory.
16 | uses: actions/checkout@v2
17 |
18 | - name: Install Go
19 | uses: actions/setup-go@v2
20 | with:
21 | go-version: 1.15.x
22 |
23 | - uses: actions/cache@v4
24 | with:
25 | path: ~/go/pkg/mod
26 | key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
27 |
28 | - name: Formatting docs.
29 | env:
30 | GOBIN: /tmp/.bin
31 | run: make docs-check
32 |
--------------------------------------------------------------------------------
/.bingo/README.md:
--------------------------------------------------------------------------------
1 | # Project Development Dependencies.
2 |
3 | This is directory which stores Go modules with pinned buildable package that is used within this repository, managed by https://github.com/bwplotka/bingo.
4 |
5 | * Run `bingo get` to install all tools having each own module file in this directory.
6 | * Run `bingo get ` to install that have own module file in this directory.
7 | * For Makefile: Make sure to put `include .bingo/Variables.mk` in your Makefile, then use $() variable where is the .bingo/.mod.
8 | * For shell: Run `source .bingo/variables.env` to source all environment variable for each tool.
9 | * For go: Import `.bingo/variables.go` to for variable names.
10 | * See https://github.com/bwplotka/bingo or -h on how to add, remove or change binaries dependencies.
11 |
12 | ## Requirements
13 |
14 | * Go 1.14+
15 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Handbook
2 |
3 | 
4 |
5 | [](https://app.netlify.com/sites/rhobs-handbook/deploys)
6 |
7 | *Welcome to the Red Hat Monitoring Group Handbook*
8 |
9 | ## Goals
10 |
11 | For now check the [Handbook Proposal](content/Proposals/Accepted/202106-handbook.md)
12 |
13 | ## Using
14 |
15 | This handbook was designed to be used in two ways:
16 |
17 | * Through our website `https://rhobs-handbook.netlify.app/`.
18 | * Directly from GitHub UI by navigating to [`content`](https://github.com/rhobs/handbook/tree/main/content) directory.
19 |
20 | ## Contributing
21 |
22 | Anyone is welcome to read and contribute to our handbook! It's licensed with Apache 2, and any team member, another Red Hat team member or even anyone outside Red Hat is welcome to help us to our refine processes, update documentation and FAQs.
23 |
24 | If you were inspired by handbook let us know too!
25 |
26 | See details on how to contribute in our [Contributing Guide](content/contributing.md)
27 |
28 | ## Initial Author
29 |
30 | [`@bwplotka`](https://bwplotka.dev)
31 |
--------------------------------------------------------------------------------
/content/Services/RHOBS/use-cases/observability.md:
--------------------------------------------------------------------------------
1 | # MST
2 |
3 | > For RHOBS Overview see [this document](README.md)
4 |
5 | TBD(https://github.com/rhobs/handbook/issues/23)
6 |
7 | ## Support
8 |
9 | TBD(https://github.com/rhobs/handbook/issues/23)
10 |
11 | ### Escalations
12 |
13 | TBD(https://github.com/rhobs/handbook/issues/23)
14 |
15 | ## Service Level Agreement
16 |
17 | 
18 |
19 | If you manage [“Observatorium”](../../../Projects/Observability/observatorium.md), the Service Level Objectives can go ultra-high in all dimensions, such as availability and data loss. The freshness aspect for reading APIs is trickier, as it also depends on client collection pipeline availability, among other things, which is out of the scope of the Observatorium.
20 |
21 | RHOBS has currently established the following default Service Level Objectives. This is based on the infrastructure dependencies we have listed [here (internal)](https://visual-app-interface.devshift.net/services#/services/rhobs/app.yml).
22 |
23 | > NOTE: We envision future improvements to the quality of service by offering additional SLO tiers for different tenants. Currently, all tenants have the same SLOs.
24 |
25 | > Previous docs (internal):
26 | > * [2019-10-30](https://docs.google.com/document/d/1LN-3yDtXmiDmGi5ZwllklJCg3jx-4ysNv6oUZudFj2g/edit#heading=h.20e6cn146nls)
27 | > * [2021-02-10](https://docs.google.com/document/d/1iGRsFMR9YmWG8Mk95UXU_PAUKvk1_zyNUkevbk7ZnFw/edit#heading=h.bupciudrwmna)
28 |
29 | TBD(https://github.com/rhobs/handbook/issues/23)
30 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | include .bingo/Variables.mk
2 |
3 | WEBSITE_DIR ?= web
4 | WEBSITE_BASE_URL ?= https://rhobs-handbook.netlify.app/
5 | MD_FILES_TO_FORMAT=$(shell find content -name "*.md") README.md
6 |
7 | help: ## Displays help.
8 | @awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m\033[0m\n\nTargets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-10s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
9 |
10 | all: docs web
11 |
12 | .PHONY: docs
13 | docs: ## Format docs, localise links, ensure GitHub format.
14 | docs: $(MDOX)
15 | $(MDOX) fmt --links.localize.address-regex="https://rhobs-handbook.netlify.app/.*" *.md $(MD_FILES_TO_FORMAT)
16 |
17 | .PHONY: docs-check
18 | docs-check: ## Checks if docs are localised and links are correct.
19 | docs-check: $(MDOX)
20 | $(MDOX) fmt --check \
21 | -l --links.validate.config-file=.mdox.validator.yaml --links.localize.address-regex="https://rhobs-handbook.netlify.app/.*" *.md $(MD_FILES_TO_FORMAT)
22 |
23 | .PHONY: web-pre
24 | web-pre: ## Pre process docs using mdox transform which converts it from GitHub structure to Hugo one.
25 | web-pre: $(MDOX)
26 | @rm -rf $(WEBSITE_DIR)/content # Do it in mdox itself.
27 | $(MDOX) transform --log.level=debug --config-file=.mdox.yaml
28 |
29 | $(WEBSITE_DIR)/node_modules:
30 | @git submodule update --init --recursive
31 | cd $(WEBSITE_DIR)/themes/docsy/ && npm install && rm -rf content
32 |
33 | .PHONY: web
34 | web: ## Build website in publish directory.
35 | web: $(WEBSITE_DIR)/node_modules $(HUGO) web-pre
36 | cd $(WEBSITE_DIR) && $(HUGO) -b $(WEBSITE_BASE_URL)
37 |
38 | .PHONY: web-serve
39 | web-serve: ## Build website and run in foreground process in localhost:1313
40 | web-serve: $(WEBSITE_DIR)/node_modules $(HUGO) web-pre
41 | cd $(WEBSITE_DIR) && $(HUGO) serve
42 |
--------------------------------------------------------------------------------
/content/Services/RHOBS/analytics.md:
--------------------------------------------------------------------------------
1 | # Analytics based on Observability Data
2 |
3 | We use the [`Apache Parquet`](https://parquet.apache.org/) format (e.g [Amazon RedShift](https://aws.amazon.com/redshift/)) to persist metrics data for RHOBS analytics.
4 |
5 | Red Hat maintains community-driven project that is able to transform Prometheus data into Parquet files. See [Obslytics project to learn more](https://github.com/thanos-community/obslytics). It works by consuming (currently internal) RHOBS Store API.
6 |
7 | If you wish to use this tool against RHOBS, contact us to discuss this feature request or add ticket to our `MON` JIRA project.
8 |
9 | ## Existing Production Pipelines
10 |
11 | ### Telemeter Analytics
12 |
13 | DataHub and CCX teams are currently leveraging such analytics pipeline based on [Telemeter](use-cases/telemetry.md). The pipeline looks as follows:
14 |
15 | 
16 |
17 | *[Source](https://docs.google.com/drawings/d/19Z0vtVjlvU_p6aU3hn6PXdN0AWbd2FA4RAC0dT3hU_w/edit)*
18 |
19 | 1. DataHub runs [Thanos Replicate](https://thanos.io/tip/components/tools.md/#bucket-replicate) tool in their PSI cluster, which copies all fresh 2h size blocks from RHOBS, once they appear in RHOBS S3 object storage (~3h after ingestion).
20 | 2. Replicated blocks are copied to DataHub Ceph object storage and stored for unlimited time.
21 | 3. DataHub runs smaller Thanos system. For this part of the system, only the Thanos Store Gateway is needed to provide Store API for further Obslytics step.
22 | 4. Instance of [Obslytics](https://github.com/thanos-community/obslytics) transforms Thanos metrics format to [`Apache Parquet`](https://parquet.apache.org/) and saves it in Ceph Object Storage.
23 | 5. Parquet data is then imported in [Amazon RedShift](https://aws.amazon.com/redshift/), and is joined with other data sources.
24 |
--------------------------------------------------------------------------------
/.bingo/Variables.mk:
--------------------------------------------------------------------------------
1 | # Auto generated binary variables helper managed by https://github.com/bwplotka/bingo v0.5.1. DO NOT EDIT.
2 | # All tools are designed to be build inside $GOBIN.
3 | BINGO_DIR := $(dir $(lastword $(MAKEFILE_LIST)))
4 | GOPATH ?= $(shell go env GOPATH)
5 | GOBIN ?= $(firstword $(subst :, ,${GOPATH}))/bin
6 | GO ?= $(shell which go)
7 |
8 | # Below generated variables ensure that every time a tool under each variable is invoked, the correct version
9 | # will be used; reinstalling only if needed.
10 | # For example for bingo variable:
11 | #
12 | # In your main Makefile (for non array binaries):
13 | #
14 | #include .bingo/Variables.mk # Assuming -dir was set to .bingo .
15 | #
16 | #command: $(BINGO)
17 | # @echo "Running bingo"
18 | # @$(BINGO)
19 | #
20 | BINGO := $(GOBIN)/bingo-v0.5.1
21 | $(BINGO): $(BINGO_DIR)/bingo.mod
22 | @# Install binary/ries using Go 1.14+ build command. This is using bwplotka/bingo-controlled, separate go module with pinned dependencies.
23 | @echo "(re)installing $(GOBIN)/bingo-v0.5.1"
24 | @cd $(BINGO_DIR) && $(GO) build -mod=mod -modfile=bingo.mod -o=$(GOBIN)/bingo-v0.5.1 "github.com/bwplotka/bingo"
25 |
26 | HUGO := $(GOBIN)/hugo-v0.80.0
27 | $(HUGO): $(BINGO_DIR)/hugo.mod
28 | @# Install binary/ries using Go 1.14+ build command. This is using bwplotka/bingo-controlled, separate go module with pinned dependencies.
29 | @echo "(re)installing $(GOBIN)/hugo-v0.80.0"
30 | @cd $(BINGO_DIR) && CGO_ENABLED=1 $(GO) build -tags=extended -mod=mod -modfile=hugo.mod -o=$(GOBIN)/hugo-v0.80.0 "github.com/gohugoio/hugo"
31 |
32 | MDOX := $(GOBIN)/mdox-v0.9.0
33 | $(MDOX): $(BINGO_DIR)/mdox.mod
34 | @# Install binary/ries using Go 1.14+ build command. This is using bwplotka/bingo-controlled, separate go module with pinned dependencies.
35 | @echo "(re)installing $(GOBIN)/mdox-v0.9.0"
36 | @cd $(BINGO_DIR) && $(GO) build -mod=mod -modfile=mdox.mod -o=$(GOBIN)/mdox-v0.9.0 "github.com/bwplotka/mdox"
37 |
38 |
--------------------------------------------------------------------------------
/content/Projects/Observability/observatorium.md:
--------------------------------------------------------------------------------
1 | # Observatorium
2 |
3 | Observatorium is an observability system designed to enable the ingestion, storage (short and long term) and querying capabilities for three major observability signals: metrics, logging and tracing. It unifies horizontally scalable, multi-tenant systems like Thanos, Loki, and in the future, Jaeger to deploy them in a single stack with consistent APIs. On top of that it's designed to be managed as a service thanks to consistent tenancy, authorization and rate limiting across all three signals.
4 |
5 | ### Official Documentation
6 |
7 | https://observatorium.io
8 |
9 | ### APIs
10 |
11 | TBD(https://github.com/rhobs/handbook/issues/22)
12 |
13 | #### Read: Metrics
14 |
15 | * GET /api/metrics/v1/api/v1/query
16 | * GET /api/metrics/v1/api/v1/query_range
17 | * GET /api/metrics/v1/api/v1/series
18 | * GET /api/metrics/v1/api/v1/labels
19 | * GET /api/metrics/v1/api/v1/label/ /values
20 |
21 | ### Notable Talks/Blog Posts
22 |
23 | * 04.2021: [Upstream-First High Scale Prometheus Ecosystem](https://www.youtube.com/watch?v=r0fRFH_921E&list=PLj6h78yzYM2PZb0QuIkm6ZY-xTuNA5zRO&index=6)
24 |
25 | ### Bug Trackers
26 |
27 | https://github.com/observatorium/observatorium/issues
28 |
29 | ### Communication Channels
30 |
31 | The CNCF Slack workspace's ([join here](https://cloud-native.slack.com/messages/CHY2THYUU)) channels:
32 |
33 | * `#observatorium` for user related things.
34 | * `#observatorium-dev` for developer related things.
35 |
36 | ### Proposal Process
37 |
38 | TBD
39 |
40 | ### Our Usage
41 |
42 | We use Observatorium as a Service for our [Red Hat Observability Service (RHOBS)](../../Services/RHOBS).
43 |
44 | We also know of several other companies installing Observatorium on their own (as of 2021.07.07):
45 |
46 | * RHACM
47 | * Managed Tenants until they can get access to RHBOBS (e.g. Kafka Team)
48 | * IBM
49 | * GitPod
50 |
51 | ### Maintainers
52 |
53 | https://github.com/observatorium/observatorium/blob/main/docs/community/maintainers.md
54 |
--------------------------------------------------------------------------------
/content/Projects/Observability/kubestatemetrics.md:
--------------------------------------------------------------------------------
1 | # Kube State Metrics
2 |
3 | `kube-state-metrics` (KSM) is a service that listens to the Kubernetes API server and generates metrics about the state of the objects. It's an add-on agent to generate and expose cluster-level metrics.
4 |
5 | ### Official Documentation
6 |
7 | - Overview: [`README.md`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs)
8 | - Resource-wise documentation: [`/docs`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs)
9 | - Design documentation: [`/docs/design`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs/design)
10 | - Developer documentation: [`/docs/developer`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs/developer)
11 |
12 | ### Informational Media
13 |
14 | - [PromCon 2017: Lightning Talk - `kube-state-metrics` - Frederic Branczyk](https://www.youtube.com/watch?v=nUkHeY48mIQ)
15 | - [Intro and Deep Dive: Kubernetes SIG Instrumentation - David Ashpole & Han Kang, Frederic Branczyk](https://youtu.be/NzoG--2UqEk?t=888)
16 | - [Episode 38: Custom Resources in `kube-state-metrics`](https://www.youtube.com/watch?v=rkaG4M5mo-8)
17 | - [`kube-state-metrics` on Google Cloud](https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/kube_state_metrics)
18 |
19 | ### Bug Trackers
20 |
21 | - [`/issues`](https://github.com/kubernetes/kube-state-metrics/issues)
22 | - [`issues.redhat.com`](https://issues.redhat.com/browse/MON-2858?jql=project%20%3D%20MON%20AND%20issuetype%20in%20(Bug%2C%20Epic%2C%20Story%2C%20Task%2C%20Sub-task)%20AND%20resolution%20%3D%20Unresolved%20AND%20text%20~%20%22kube-state-metrics%22%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC)
23 |
24 | ### Get Involved
25 |
26 | - [`#kube-state-metrics`](https://kubernetes.slack.com/archives/CJJ529RUY)
27 | - [`#sig-instrumentation`](https://kubernetes.slack.com/archives/C20HH14P7)
28 | - [SIG Instrumentation Biweekly Minutes](https://docs.google.com/document/d/1FE4AQ8B49fYbKhfg4Tx0cui1V0eI4o3PxoqQPUwNEiU)
29 |
30 | ### Internal Usages
31 |
32 | - [`cluster-monitoring-operator/assets/kube-state-metrics/`](https://github.com/openshift/cluster-monitoring-operator/tree/master/assets/kube-state-metrics)
33 | - [`openshift-state-metrics/`](https://github.com/openshift/openshift-state-metrics)
34 |
35 | ### Maintainers
36 |
37 | - [`/OWNERS.md`](https://github.com/kubernetes/kube-state-metrics/blob/main/OWNERS)
38 |
39 | ### Miscellaneous
40 |
41 | - [`/releases`](https://github.com/kubernetes/kube-state-metrics/releases)
42 |
--------------------------------------------------------------------------------
/content/Projects/Observability/thanos.md:
--------------------------------------------------------------------------------
1 | # Thanos
2 |
3 | Thanos is a horizontally scalable, multi-tenant monitoring system in a form of distributed time series database that supports Prometheus data format.
4 |
5 | ### Official Documentation
6 |
7 | https://thanos.io/tip/thanos/getting-started.md
8 |
9 | ### APIs
10 |
11 | * Querying: Prometheus APIs, Remote Read
12 | * Series: Prometheus APIs, gRPC SeriesAPI
13 | * Metric Metadata: Prometheus API, gRPC MetricMetadataAPI
14 | * Rules, Alerts: Prometheus API, gRPC RulesAPI
15 | * Targets: Prometheus API, gRPC TargetsAPI
16 | * Exemplars: Prometheus API, gRPC ExemplarsAPI
17 | * Receiving: Prometheus Remote Write
18 |
19 | ### Tutorials
20 |
21 | https://katacoda.com/thanos
22 |
23 | ### Notable Talks/Blog Posts
24 |
25 | * 12.2020: [Absorbing Thanos Infinite Powers for Multi-Cluster Telemetry](https://www.youtube.com/watch?v=6Nx2BFyr7qQ)
26 | * 12.2020: [Turn It Up to a Million: Ingesting Millions of Metrics with Thanos Receive](https://www.youtube.com/watch?v=5MJqdJq41Ms)
27 | * 02.2019: [FOSDEM + demo](https://fosdem.org/2019/schedule/event/thanos_transforming_prometheus_to_a_global_scale_in_a_seven_simple_steps/)
28 | * 03.2019: [Alibaba Cloud user story](https://www.youtube.com/watch?v=ZS6zMksfipc)
29 | * [CloudNative Deep Dive](https://www.youtube.com/watch?v=qQN0N14HXPM)
30 | * [CloudNative Intro](https://www.youtube.com/watch?v=m0JgWlTc60Q)
31 | * [Prometheus in Practice: HA with Thanos](https://www.slideshare.net/ThomasRiley45/prometheus-in-practice-high-availability-with-thanos-devopsdays-edinburgh-2019)
32 |
33 | * [Banzai Cloud user story](https://banzaicloud.com/blog/multi-cluster-monitoring/)
34 |
35 | ### Bug Trackers
36 |
37 | https://github.com/thanos-io/thanos/issues
38 |
39 | ### Communication Channels
40 |
41 | The CNCF Slack workspace's ([join here](https://cloud-native.slack.com/messages/CHY2THYUU)) channels:
42 |
43 | * `#thanos` for user related things.
44 | * `#thanos-dev` for developer related things.
45 |
46 | ### Proposal Process
47 |
48 | https://thanos.io/tip/contributing/contributing.md/#adding-new-features--components
49 |
50 | ### Our Usage
51 |
52 | We use Thanos in many places within Red Hat, notably:
53 |
54 | * In [Prometheus Operator (sidecar)](prometheusOp.md)
55 | * In Openshift Platform Monitoring (PM) (see [CMO](../../Products/OpenshiftMonitoring))
56 | * In Openshift User Workload Monitoring (UWM)
57 | * In [RHOBS](../../Services/RHOBS) (so [Observatorium](observatorium.md))
58 |
59 | ### Maintainers
60 |
61 | https://thanos.io/tip/thanos/maintainers.md/#core-maintainers-of-this-repository
62 |
--------------------------------------------------------------------------------
/.mdox.yaml:
--------------------------------------------------------------------------------
1 | version: 1
2 |
3 | inputDir: "content"
4 | outputDir: "web/content/en"
5 | extraInputGlobs:
6 | - "README.md"
7 | - "web/assets/favicons"
8 |
9 | gitIgnored: true
10 | localLinksStyle:
11 | hugo:
12 | indexFileName: "_index.md"
13 |
14 | transformations:
15 | - glob: "../README.md"
16 | path: /_index.md
17 | frontMatter:
18 | template: |
19 | title: "{{ .Origin.FirstHeader }}"
20 | lastmod: "{{ .Origin.LastMod }}"
21 | menu:
22 | main:
23 | weight: 1
24 | pre:
25 | cascade:
26 | - type: "docs"
27 | _target:
28 | path: "/**"
29 |
30 | - glob: "Proposals/README.md"
31 | path: _index.md
32 | frontMatter:
33 | template: |
34 | title: "{{ .Origin.FirstHeader }}"
35 | lastmod: "{{ .Origin.LastMod }}"
36 | menu:
37 | main:
38 | weight: 2
39 | pre:
40 |
41 | - glob: "Products/OpenshiftMonitoring/instrumentation.md"
42 | frontMatter:
43 | template: |
44 | title: "{{ .Origin.FirstHeader }}"
45 | lastmod: "{{ .Origin.LastMod }}"
46 | weight: 5
47 | - glob: "Products/OpenshiftMonitoring/collecting_metrics.md"
48 | frontMatter:
49 | template: |
50 | title: "{{ .Origin.FirstHeader }}"
51 | lastmod: "{{ .Origin.LastMod }}"
52 | weight: 10
53 | - glob: "Products/OpenshiftMonitoring/alerting.md"
54 | frontMatter:
55 | template: |
56 | title: "{{ .Origin.FirstHeader }}"
57 | lastmod: "{{ .Origin.LastMod }}"
58 | weight: 20
59 | - glob: "Products/OpenshiftMonitoring/dashboards.md"
60 | frontMatter:
61 | template: |
62 | title: "{{ .Origin.FirstHeader }}"
63 | lastmod: "{{ .Origin.LastMod }}"
64 | weight: 25
65 | - glob: "Products/OpenshiftMonitoring/telemetry.md"
66 | frontMatter:
67 | template: |
68 | title: "{{ .Origin.FirstHeader }}"
69 | lastmod: "{{ .Origin.LastMod }}"
70 | weight: 30
71 | - glob: "Products/OpenshiftMonitoring/faq.md"
72 | frontMatter:
73 | template: |
74 | title: "{{ .Origin.FirstHeader }}"
75 | lastmod: "{{ .Origin.LastMod }}"
76 | weight: 40
77 |
78 | - glob: "**/README.md"
79 | path: _index.md
80 | frontMatter: &defaultFrontMatter
81 | template: |
82 | title: "{{ .Origin.FirstHeader }}"
83 | lastmod: "{{ .Origin.LastMod }}"
84 |
85 | - glob: "**.md"
86 | frontMatter:
87 | <<: *defaultFrontMatter
88 |
89 | - glob: "../web/assets/**"
90 | path: "/../../static/**"
91 |
92 | - glob: "**"
93 | path: "/../../static/**"
94 |
--------------------------------------------------------------------------------
/web/config.toml:
--------------------------------------------------------------------------------
1 | baseURL = "/"
2 | title = "handbook"
3 | theme = ["docsy"]
4 |
5 | enableRobotsTXT = true
6 |
7 |
8 | # Language settings
9 | contentDir = "content/en"
10 | defaultContentLanguage = "en"
11 | defaultContentLanguageInSubdir = false
12 | enableMissingTranslationPlaceholders = true
13 |
14 | disableKinds = ["taxonomy", "taxonomyTerm"]
15 |
16 | # Highlighting config
17 | pygmentsCodeFences = true
18 | pygmentsUseClasses = false
19 | pygmentsUseClassic = false
20 | # See https://help.farbox.com/pygments.html
21 | pygmentsStyle = "tango"
22 |
23 | # Image processing configuration.
24 | [imaging]
25 | resampleFilter = "CatmullRom"
26 | quality = 75
27 | anchor = "smart"
28 |
29 | [services.googleAnalytics]
30 | # Comment out the next line to disable GA tracking. Also disables the feature described in [params.ui.feedback].
31 | # id = "UA-00000000-0"
32 |
33 | # Language configuration
34 |
35 | [languages.en]
36 | title = "Red Hat Monitoring Group Handbook"
37 | description = "Red Hat Monitoring Group Handbook"
38 | languageName = "English"
39 |
40 | [markup.goldmark.renderer]
41 | unsafe = true
42 |
43 | [markup.highlight]
44 | # See a complete list of available styles at https://xyproto.github.io/splash/docs/all.html
45 | style = "tango"
46 | guessSyntax = "true"
47 |
48 | # Comment out if you don't want the "print entire section" link enabled.
49 | [outputs]
50 | section = ["HTML", "print"]
51 |
52 | [params]
53 | copyright = "The Red Hat Monitoring Group Authors"
54 |
55 | # Repository configuration (URLs for in-page links to opening issues and suggesting changes)
56 | # TODO(bwplotka): Changes are suggested to wrong, auto-generated directory, disabled for now. Fix it!
57 | #github_repo = "https://github.com/rhobs/handbook"
58 | github_branch= "main"
59 | # Enable Algolia DocSearch
60 | algolia_docsearch = true
61 | offlineSearch = false
62 | prism_syntax_highlighting = false
63 |
64 | # User interface configuration
65 | [params.ui]
66 | sidebar_menu_compact = false
67 | breadcrumb_disable = false
68 | sidebar_search_disable = false
69 | navbar_logo = true
70 | footer_about_disable = false
71 |
72 | # Adds a H2 section titled "Feedback" to the bottom of each doc. The responses are sent to Google Analytics as events.
73 | # This feature depends on [services.googleAnalytics] and will be disabled if "services.googleAnalytics.id" is not set.
74 | # If you want this feature, but occasionally need to remove the "Feedback" section from a single page,
75 | # add "hide_feedback: true" to the page's front matter.
76 | [params.ui.feedback]
77 | enable = true
78 | # The responses that the user sees after clicking "yes" (the page was helpful) or "no" (the page was not helpful).
79 | yes = 'Glad to hear it! Please tell us how we can improve.'
80 | no = 'Sorry to hear that. Please tell us how we can improve.'
81 |
82 | # Adds a reading time to the top of each doc.
83 | # If you want this feature, but occasionally need to remove the Reading time from a single page,
84 | # add "hide_readingtime: true" to the page's front matter
85 | [params.ui.readingtime]
86 | enable = true
87 |
88 | [[params.links.user]]
89 | name = "GitHub"
90 | url = "https://github.com/rhobs/handbook"
91 | icon = "fab fa-github"
92 | desc = "Ask question, discuss on GitHub!"
93 |
94 | [[menu.main]]
95 | name = "Blog #monitoring"
96 | weight = 50
97 | url = "https://www.openshift.com/blog/tag/monitoring"
98 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/alerting.md:
--------------------------------------------------------------------------------
1 | # Alerting
2 |
3 | ## Targeted audience
4 |
5 | This document is intended for OpenShift developers that want to write alerting rules for their operators and operands.
6 |
7 | ## Configuring alerting rules
8 |
9 | You configure alerting rules based on the metrics being collected for your component(s). To do so, you should create `PrometheusRule` objects in your operator/operand namespace which will also be picked up by the Prometheus operator (provided that the namespace has the `openshift.io/cluster-monitoring="true"` label for layered operators).
10 |
11 | Here is an example of a PrometheusRule object with a single alerting rule:
12 |
13 | ```yaml
14 | apiVersion: monitoring.coreos.com/v1
15 | kind: PrometheusRule
16 | metadata:
17 | name: cluster-example-operator-rules
18 | namespace: openshift-example-operator
19 | spec:
20 | groups:
21 | - name: operator
22 | rules:
23 | - alert: ClusterExampleOperatorUnhealthy
24 | annotations:
25 | description: Cluster Example operator running in pod {{$labels.namespace}}/{{$labels.pods}} is not healthy.
26 | summary: Operator Example not healthy
27 | expr: |
28 | max by(pod, namespace) (last_over_time(example_operator_healthy[5m])) == 0
29 | for: 15m
30 | labels:
31 | severity: warning
32 | ```
33 |
34 | You can choose to configure all your alerting rules into a single `PrometheusRule` object or split them into different objects (one per component). The mechanism to deploy the object(s) depends on the context: it can be deployed by the Cluster Version Operator (CVO), the Operator Lifecycle Manager (OLM) or your own operator.
35 |
36 | ## Guidelines
37 |
38 | Please refer to the [Alerting Consistency](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md) OpenShift enhancement proposal for the recommendations applying to OCP built-in alerting rules.
39 |
40 | If you need a review of alerting rules from the OCP monitoring team, you can reach them on the `#forum-openshift-monitoring` channel.
41 |
42 | ## Identifying alerting rules without a namespace label
43 |
44 | The enhancement proposal mentioned above states the following for OCP built-in alerts:
45 |
46 | > Alerts SHOULD include a namespace label indicating the source of the alert.
47 |
48 | Unfortunately this isn't something that we can verify by static analysis because the namespace label can come from the PromQL result or be added statically. Nevertheless we can still use the Telemetry data to identify OCP alerts that don't respect this statement.
49 |
50 | First, create an OCP cluster from the latest stable release. Once it is installed, run this command to return the list of all OCP built-in alert names:
51 |
52 | ```bash
53 | curl -sk -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" \
54 | https://$(oc get routes -n openshift-monitoring thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/rules \
55 | | jq -cr '.data.groups | map(.rules) | flatten | map(select(.type =="alerting")) | map(.name) | unique |join("|")'
56 | ```
57 |
58 | Then from https://telemeter-lts.datahub.redhat.com, retrieve the list of all alerts matching the names that fired without a namespace label, grouped by minor release:
59 |
60 | ```
61 | count by (alertname,version) (
62 | alerts{alertname=~"",namespace=""} *
63 | on(_id) group_left(version) max by(_id, version) (
64 | label_replace(id_version_ebs_account_internal:cluster_subscribed{version=~"4.\d\d.*"}, "version", "$1", "version", "^(4.\\d+).*$")
65 | )
66 | )
67 | ```
68 |
69 | You should now track back the non-compliant alerts to their component of origin and file bugs against them ([example](https://issues.redhat.com/browse/OCPBUGS-17191)).
70 |
71 | The exercise should be done at regular intervals, at least once per release cycle.
72 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/instrumentation.md:
--------------------------------------------------------------------------------
1 | # Instrumentation guidelines
2 |
3 | This document details good practices to adopt when you instrument your application for Prometheus. It is not meant to be a replacement of the [upstream documentation](https://prometheus.io/docs/practices/instrumentation/) but an introduction focused on the OpenShift use case.
4 |
5 | ## Targeted audience
6 |
7 | This document is intended for OpenShift developers that want to instrument their operators and operands for Prometheus.
8 |
9 | ## Getting started
10 |
11 | To instrument software written in Golang, see the official [Golang client](https://pkg.go.dev/github.com/prometheus/client_golang). For other languages, refer to the [curated list](https://prometheus.io/docs/instrumenting/clientlibs/#client-libraries) of client libraries.
12 |
13 | Prometheus stores all data as time series which are a stream of timestamped values (samples) identified by a metric name and a set of unique labels (a.ka. dimensions or key/value pairs). Its data model is described in details in this [page](https://prometheus.io/docs/concepts/data_model/). Time series would be represented like this:
14 |
15 | ```
16 | # HELP http_requests_total Total number of HTTP requests by method and handler.
17 | # TYPE http_requests_total counter
18 | http_requests_total{method="GET", handler="/messages"} 500
19 | http_requests_total{method="POST", handler="/messages"} 10
20 | ```
21 |
22 | Prometheus supports 4 [metric types](https://prometheus.io/docs/concepts/metric_types/):
23 | * Gauge which represents a single numerical value that can arbitrarily go up and down.
24 | * Counter, a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. When querying a counter metric, you usually apply a `rate()` or `increase()` function.
25 | * Histogram which represents observations (usually things like request durations or response sizes) and counts them in configurable buckets.
26 | * Summary which represents observations too but it reports configurable quantiles over a (fixed) sliding time window. In practice, they are rarely used.
27 |
28 | Adding metrics for any operation should be part of the code review process like any other factor that is kept in mind for production ready code.
29 |
30 | To learn more about when to use which metric type, how to name metrics and how to choose labels, read the following documentation. {{% alert color="info" %}}OpenShift follows the outlined conventions whenever possible. Any exceptions should be reviewed and properly motivated.{{% /alert %}}
31 | * [Prometheus naming recommendations](https://prometheus.io/docs/practices/naming/)
32 | * [Prometheus instrumentation](https://prometheus.io/docs/practices/instrumentation/)
33 | * [Kubernetes metric instrumentation guide](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-instrumentation.md)
34 | * [Instrumenting a Go application for Prometheus](https://prometheus.io/docs/guides/go-application/)
35 |
36 | ## Example
37 |
38 | Here is a fictional Go code example instrumented with a Gauge metric and a multi-dimensional Counter metric:
39 |
40 | ```golang
41 | cpuTemp := prometheus.NewGauge(prometheus.GaugeOpts{
42 | Name: "cpu_temperature_celsius",
43 | Help: "Current temperature of the CPU.",
44 | })
45 |
46 | hdFailures := prometheus.NewCounterVec(
47 | prometheus.CounterOpts{
48 | Name: "hd_errors_total",
49 | Help: "Number of hard-disk errors.",
50 | },
51 | []string{"device"},
52 | )}
53 |
54 | reg := prometheus.NewRegistry()
55 | reg.MustRegister(cpuTemp, m.hdFailures)
56 |
57 | cpuTemp.Set(55.2)
58 |
59 | // Record 1 failure for the /dev/sda device.
60 | hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc()
61 | // Record 3 failures for the /dev/sdb device.
62 | hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
63 | hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
64 | hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
65 | ```
66 |
67 | ## Labels
68 |
69 | Defining when to add and when not to add a label to a metric is a [difficult choice](https://prometheus.io/docs/practices/instrumentation/#use-labels). The general rule is: the fewer labels, the better. Every unique combination of label names and values creates a new time series and Prometheus memory usage is mostly driven by the number of times series loaded into RAM during ingestion and querying. A good rule of thumb is to have less than 10 time series per metric name and target. A common mistake is to store dynamic information such as usernames, IP addresses or error messages into a label which can lead to thousands of time series.
70 |
71 | Labels such as `pod`, `service`, `job` and `instance` shouldn't be set by the application. Instead they are discovered at runtime by Prometheus when it queries the Kubernetes API to discover which targets should be scraped for metrics.
72 |
73 | ## Custom collectors
74 |
75 | It is sometimes not feasible to use one of the 4 Metric types, typically when your application already has the information stored for other purpose (for instance, it maintains a list of custom objects retrieved from the Kubernetes API). In this case, the [custom collector](https://pkg.go.dev/github.com/prometheus/client_golang@v1.20.4/prometheus#hdr-Custom_Collectors_and_constant_Metrics) pattern can be useful.
76 |
77 | You can find an example of this pattern in the [github.com/prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator/blob/3df0811bdc7c046cb283006d94092e42219a0e2f/pkg/operator/operator.go#L166-L191) project.
78 |
79 | ## Next steps
80 |
81 | * [Collect metrics](collecting_metrics.md) with Prometheus.
82 | * [Configure alerting](alerting.md) with Prometheus.
83 | * [Add dashboards](dashboards.md) to the OCP console.
84 |
--------------------------------------------------------------------------------
/content/Services/RHOBS/use-cases/telemetry.md:
--------------------------------------------------------------------------------
1 | # Telemetry (Telemeter)
2 |
3 | > For RHOBS Overview see [this document](README.md)
4 |
5 | Telemeter is the metrics-only hard tenant of the RHOBS service designed as a centralized OpenShift Telemetry pipeline for OpenShift Container Platform. It is an essential part of gathering real-time telemetry for remote health monitoring, automation and billing purposes.
6 |
7 | * [OpenShift Documentation](https://docs.openshift.com/container-platform/latest/support/remote_health_monitoring/about-remote-health-monitoring.html#telemetry-about-telemetry_about-remote-health-monitoring) about the Telemetry service.
8 | * [Internal documentation](https://gitlab.cee.redhat.com/data-hub/dh-docs/-/blob/master/docs/interacting-with-telemetry-data.adoc) for interacting with the Telemetry data.
9 |
10 | ## Product Managers
11 |
12 | * Roger Floren
13 |
14 | ## Big Picture Overview
15 |
16 | 
17 |
18 | *[Source](https://docs.google.com/drawings/d/1eIAxCUS2v8Bt0-Ken2gHnx-Q1u8JNPfi2rMB-Azz5zI/edit)*
19 |
20 | ## Support
21 |
22 | To escalate issues use, depending on issue type:
23 |
24 | * For questions related to the service or kind of data it ingests, use `telemetry-sme@redhat.com` (internal) mail address. For quick questions you can try to use [#forum-telemetry](https://coreos.slack.com/archives/CEG5ZJQ1G) on CoreOS Slack.
25 | * For functional bugs or feature requests use Bugzilla, with Product: `Openshift Container Platform` and `Telemeter` component ([example bug](https://bugzilla.redhat.com/show_bug.cgi?id=1914956)). You can additionally notify us about a new bug on `#forum-telemetry` on CoreOS Slack.
26 | * For functional bugs or feature requests for historical storage (Data Hub), use the PNT Jira project.
27 |
28 | > For the managing team: See our internal [agreement document](https://docs.google.com/document/d/1iAhzVxm2ovqkWxJCLplwR7Z-1gzXhfRKcHqXnpQh9Hg/edit#).
29 |
30 | ### Escalations
31 |
32 | For urgent escalation use:
33 |
34 | * For Telemeter Service Unavailability: `@app-sre-ic` and `@observatorium-oncall` on CoreOS Slack.
35 | * For Historical Data (DataHub) Service Unavailability: `@data-hub-ic` on CoreOS Slack.
36 |
37 | ## Service Level Agreement
38 |
39 | 
40 |
41 | RHOBS has currently established the following default Service Level Objectives. This is based on the infrastructure dependencies we have listed [here (internal)](https://visual-app-interface.devshift.net/services#/services/rhobs/app.yml).
42 |
43 | > Previous docs (internal):
44 | > * [2019-10-30](https://docs.google.com/document/d/1LN-3yDtXmiDmGi5ZwllklJCg3jx-4ysNv6oUZudFj2g/edit#heading=h.20e6cn146nls)
45 | > * [2021-02-10](https://docs.google.com/document/d/1iGRsFMR9YmWG8Mk95UXU_PAUKvk1_zyNUkevbk7ZnFw/edit#heading=h.bupciudrwmna)
46 |
47 | #### Metrics SLIs
48 |
49 | | API | SLI Type | SLI Spec | Period | SLI Implementation | Dashboard |
50 | |----------|--------------|----------------------------------------|--------|--------------------------------|-----------------------------------------------------------------------------------------------------|
51 | | `/write` | Availability | The % of successful (non 5xx) requests | 28d | Metrics from Observatorium API | [Dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADJ/telemeter?orgId=1&refresh=1m) |
52 | | `/write` | Latency | The % of requests under X latency | 28d | Metrics from Observatorium API | [Dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADJ/telemeter?orgId=1&refresh=1m) |
53 |
54 | Read Metrics TBD.
55 |
56 | *Agreements*:
57 |
58 | > NOTE: No entry for your case (e.g. dev/staging) means zero formal guarantees.
59 |
60 | | SLI | Date of Agreement | Tier | SLO | Notes |
61 | |-----------------------|-------------------|--------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
62 | | `/write` Availability | 2020/2019 | Internal (default) | 99% success rate for incoming [requests](#write-limits) | This depends on SSO RedHat com SLO (98.5%). In worst case (everyone needs to refresh token) we have below 98.5%, in the avg case with caching being 5m (we cannot change it) ~99% (assuming we can preserve 5m) |
63 | | `/write` Latency | 2020/2019 | Internal (default) | 95% of [requests](#write-limits) < 250ms, 99% of [requests](#write-limits) < 1s | |
64 |
65 | ##### Write Limits
66 |
67 | Within our SLO, the write request must match following criteria to be considered valid:
68 |
69 | * Valid remote write requests using official remote write protocol (See conformance test)
70 | * Valid credentials: (explanation TBD(https://github.com/rhobs/handbook/issues/24))
71 | * Max samples: TBD(https://github.com/rhobs/handbook/issues/24)
72 | * Max series: TBD(https://github.com/rhobs/handbook/issues/24)
73 | * Rate limit: TBD(https://github.com/rhobs/handbook/issues/24)
74 |
75 | TODO: Provide example tune-ed client Prometheus configurations for remote write
76 |
--------------------------------------------------------------------------------
/content/Proposals/Accepted/202205-oo.md:
--------------------------------------------------------------------------------
1 | ## Evolution of the Observability Operator (fka MSO)
2 |
3 | * **Owners:**
4 | * [`@dmohr`](https://github.com/danielm0hr)
5 | * [`@jan--f`](https://github.com/jan--f)
6 |
7 | * **Related Tickets:**
8 | * https://issues.redhat.com/browse/MON-2279
9 | * https://issues.redhat.com/browse/MON-2482
10 | * https://issues.redhat.com/browse/MON-2508 (short-term focus)
11 |
12 | * **Other docs:**
13 | * [Original proposal for MSO](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md)
14 |
15 | > TL;DR: As a mid-to-long term vision for the Monitoring Stack Operator we propose to rename MSO as OO (Observability Operator). With the name, we propose to establish OO as *the* open source, (single) cluster-side component for OpenShift observability needs. In conjunction with Observatorium / RHOBS - as *the* multi-cluster, multi-tenant, scalable observability backend component. OO is thought to manage different kinds of cluster-side monitoring, alerting, logging, and tracing (and potentially profiling) stack setups covering the needs of OpenShift variants like the client-side for the fully managed multi-cluster use cases to HyperShift and single node air gapped setups.
16 |
17 | ## Why
18 |
19 | With the rise of new OpenShift variants with very different needs regarding observability, the desire for a consistent way of providing differently configured monitoring stacks grows. Examples:
20 | - Traditional single-cluster on-prem OpenShift deployments need a self-contained monitoring stack where all components (scraping, storage, visualization, alerting, log collection) run on the same cluster. This is the kind of setup [Cluster Monitoring Operator](https://github.com/openshift/cluster-monitoring-operator) was designed for.
21 | - Multi-cluster deployments need a central (aggregated, federated) view on metrics, logs and alerts. Certain components of the stack don't run on the workload clusters but in some central infrastructure.
22 | - Resource-constraint deployments need a stripped down version of the monitoring stack, e.g. only forwarding signals to a central infrastructure or disabling certain parts of the stack completely.
23 | - Mixed deployments (e.g. OpenShift + OpenStack) are conceptually very similar to the multi-cluster use case, also needing a central pane of glass for observability signals.
24 | - Special purpose deployments (e.g. OpenShift Data Foundation) have special requirements when it comes to monitoring, that are tricky to align with the existing CMO setup.
25 | - Looking at eventually correlating different observability signals also the cluster-side stack would potentially benefit from a holistic approach for deploying monitoring, logging and tracing components and configuring them in the right way to work together.
26 | - Managed service deployments need a multi tenancy capable way of deploying many similarly built monitoring stacks to a single cluster. This is the short-term focus for OO.
27 |
28 | The proposal is to combine all these (and more) use cases into one single (meta) operator (as recommended by [the operator best practices](https://sdk.operatorframework.io/docs/best-practices/best-practices/#development)) which can be configured with e.g. presets to instruct lower-level operators (like prometheus-operator, potentially Loki operator or Jaeger one) to deploy purpose-built monitoring stacks for different uses cases. This is similar to the original CMO concept but with much higher flexibility, and feature velocity in mind, thanks to not being tied to OpenShift Core versioning.
29 |
30 | Additionally, supporting multiple different ways of deploying monitoring stacks (CMO as the standard OpenShift way, OO for managed services, something else for e.g. HyperShift or edge scenarios, ...) is a burden for the team. Instead, eventually supporting only one way to deploy monitoring stacks - with OO - covering all these use cases makes it a lot simpler and far more consistent.
31 |
32 | ### Pitfalls of the current solution
33 |
34 | CMO is built for traditional self-operated single-cluster focused deployments of OpenShift. It intentionally lacks the flexibility for many other use cases (see above) in order to provide monitoring that is resilient against configuration drift. E.g. the reason for creating OO (MSO) in the first place - supporting managed service uses cases - can't currently be covered by CMO. See the [original MSO proposal](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md) for more details.
35 |
36 | The results of this lack of flexibility can be readily observed: Red Hat teams have built their own solutions for their monitoring use cases, e.g. leveraging community operators or self-written deployments, with varying success, reliability and supportability.
37 |
38 | ## Goals
39 |
40 | Goals and use cases for the solution as proposed in [How](#how):
41 |
42 | * Widen the scope of OO to cover additional use cases besides managed services.
43 | * Replace existing ways of deploying monitoring stacks across Red Hat products with OO.
44 | * Focus on OpenShift use cases primarily but don't exclude vanilla Kubernetes as a possible target.
45 | * Create an API that easily allows common configuration across observability signals.
46 |
47 | ## Non-Goals
48 |
49 | * Create a multi-cluster capable observability operator.
50 |
51 | ## How
52 |
53 | * Define use cases to be covered in detail.
54 | * Prioritize use cases and add needed features one by one.
55 |
56 | ## Alternatives
57 |
58 | 1. Tackle each monitoring use case across Red Hat products one by one and build a custom solution for them. This would lead to many different (but potentially simpler) implementations which need to be supported.
59 | 2. Develop signal specific operators that can handle the required use cases. This would likely require an API between those operators to apply common configuration.
60 |
61 | ## Action Plan
62 |
63 | Collection of requirements and prioritization of use cases currently in progess (Q3 2022).
64 |
--------------------------------------------------------------------------------
/content/Proposals/process.md:
--------------------------------------------------------------------------------
1 | # Proposals Process
2 |
3 | ## Where to Propose Changes/Where to Submit Proposals?
4 |
5 | As defined in [handbook proposal](Accepted/202106-handbook.md#pitfalls-of-the-current-solution), our Handbook should tell you that Handbook is meant to be an index for our team resources and a linking point to other distributed projects we maintain or contribute to.
6 |
7 | First, we need to identify if the idea we have is something we can contribute to an upstream project, or it does not fit anywhere else, so we can leverage the [Handbok Proposal directory](..) and the [process](#proposal-process). See the below algorithm to find it out:
8 |
9 | 
10 |
11 | [Internal Team Drive for Public and Confidential Proposals](https://drive.google.com/drive/folders/1WGqC3gMCxIQlrnjDUYfNUTPYYRI5Cxto)
12 |
13 | ## Proposal Process
14 |
15 | If there is no problem, there is no need for changing anything, no need for a proposal. This might feel trivial, but we should first ask ourselves this question before even thinking about writing a proposal.
16 |
17 | It takes time to propose an idea, find consensus and implement more significant concepts, so let's not waste time before it's worth it. But, unfortunately, even good ideas sometimes have to wait for a good moment to discuss them.
18 |
19 | Let's assume the idea sounds interesting to you; what to do next, where to propose it? How to review it? Follow the algorithm below:
20 |
21 | 
22 |
23 | > Note: It's totally ok to reject a proposal if a team member feels the idea is wrong. It's better to explicitly oppose it than to ignore it and leave it in limbo.
24 |
25 | > NOTE: We would love to host Logging and Tracing Teams if they choose to follow our process, but we don't want to enforce it. We are happy to extend this process from the Monitoring Group handbook to Observability Group. Still, it has to grow organically (if the Logging, Tracing team will see the value of joining us here).
26 |
27 | ### On Review Process
28 |
29 | As you see on the above algorithm, if the content relates to any upstream project, it should be proposed, reviewed and potentially implemented together with the community. This does not mean that you can involve other team members towards this effort. Share the proposal with team members, even if they are not part of maintainer's team on a given project, any feedback, and voice are useful and can help to move idea further.
30 |
31 | Similar to proposals that touch our team only, despite mentioning mandatory approval process from leads, anyone can give feedback! Our process is in fact very similar to [Hashicorp's RFC process](https://works.hashicorp.com/articles/writing-practices-and-culture):
32 |
33 | > Once you’ve written the first draft of an RFC, share it with your team. They’re likely to have the most context on your proposal and its potential impacts, so most of your feedback will probably come at this stage. Any team member can comment on and approve an RFC, but you need explicit approval only from the appropriate team leads in order to move forward. Once the RFC is approved and shared with stakeholders, you can start implementing the solution. For major projects, also share the RFC to the company-wide email list. While most members of the mailing list will just read the email rather than the full RFC, sending it to the list gives visibility into major decisions being made across the company.
34 |
35 | ## Templates
36 |
37 | ### Google Docs Template
38 |
39 | [Open Source Design Doc Template](https://docs.google.com/document/d/1zeElxolajNyGUB8J6aDXwxngHynh4iOuEzy3ylLc72U/edit#).
40 |
41 | ### Markdown Template:
42 |
43 | ## Your Proposal Title
44 |
45 | * **Owners:**
46 | * `<@author: single champion for the moment of writing>`
47 |
48 | * **Related Tickets:**
49 | * ``
50 |
51 | * **Other docs:**
52 | * ``
53 |
54 | > TL;DR: Give here a short summary of what this document is proposing and what components it is touching. Outline rough idea of proposer's view on proposed changes.
55 | >
56 | > *For example: This design doc is proposing a consistent design template for “example.com” organization.*
57 |
58 | ## Why
59 |
60 | Put here a motivation behind the change proposed by this design document, give context.
61 |
62 | *For example: It’s important to clearly explain the reasons behind certain design decisions in order to have a consensus between team members, as well as external stakeholders. Such a design document can also be used as a reference and knowledge-sharing purposes. That’s why we are proposing a consistent style of the design document that will be used for future designs.*
63 |
64 | ### Pitfalls of the current solution
65 |
66 | What specific problems are we hitting with the current solution? Why it’s not enough?
67 |
68 | *For example, We were missing a consistent design doc template, so each team/person was creating their own. Because of inconsistencies, those documents were harder to understand, and it was easy to miss important sections. This was causing certain engineering time to be wasted.*
69 |
70 | ## Goals
71 |
72 | Goals and use cases for the solution as proposed in [How](#how):
73 |
74 | * Allow easy collaboration and decision making on design ideas.
75 | * Have a consistent design style that is readable and understandable.
76 | * Have a design style that is concise and covers all the essential information.
77 |
78 | ### Audience
79 |
80 | If not clear, the target audience that this change relates to.
81 |
82 | ## Non-Goals
83 |
84 | * Move old designs to the new format.
85 | * Not doing X,Y,Z.
86 |
87 | ## How
88 |
89 | Explain the full overview of the proposed solution. Some guidelines:
90 |
91 | * Make it concise and **simple**; put diagrams; be concrete, avoid using “really”, “amazing” and “great” (:
92 | * How you will test and verify?
93 | * How you will migrate users, without downtime. How we solve incompatibilities?
94 | * What open questions are left? (“Known unknowns”)
95 |
96 | ## Alternatives
97 |
98 | The section stating potential alternatives. Highlight the objections reader should have towards your proposal as they read it. Tell them why you still think you should take this path [[ref](https://twitter.com/whereistanya/status/1353853753439490049)]
99 |
100 | 1. This is why not solution Z...
101 |
102 | ## Action Plan
103 |
104 | The tasks to do in order to migrate to the new idea.
105 |
106 | * [ ] Task one
107 | * [ ] Task two ...
108 |
--------------------------------------------------------------------------------
/content/Services/RHOBS/README.md:
--------------------------------------------------------------------------------
1 | # Red Hat Observability Service
2 |
3 | *Previous Documents:*
4 |
5 | * [Initial Design for RHOBS (Internal Content)](https://docs.google.com/document/d/1cSz_ZbS35mk8Op92xhB9ijW1ivOtJuD1uAzPiBdSUqs/edit)
6 | * [Initial FAQ (Internal Content)](https://docs.google.com/document/d/1_xnJBS3v7n4m229L3tqCqBXzZy55yu6dxCJY-vh_Egs/edit)
7 |
8 | ## What
9 |
10 | Red Hat Observability Service (RHOBS) is a managed, centralized, multi-tenant, scalable backend for observability data. Functionally it is an internal deployment of [Observatorium](../../Projects/Observability/observatorium.md) project. RHOBS is designed to allow ingesting, storing and consuming (visualisations, import, alerting, correlation) observability signals like metrics, logging and tracing.
11 |
12 | This document provides the basic overview of the RHOBS service. If you want to learn about RHOBS architecture, look for [Observatorium](https://observatorium.io) default deployment and its design.
13 |
14 | ### Background
15 |
16 | With the amount of managed Openshift clusters, for Red Hat’s own use as well as for customers, there is a strong need to improve the observability of those clusters and of their workloads to the multi-cluster level. Moreover, with the “clusters as cattle” concept, more automation and complexity there is a strong need for a uniform service gathering observability data including metrics, logs, traces, etc into a remote, centralized location for multi-cluster aggregation and long term storage.
17 |
18 | This need is due to many factors:
19 |
20 | * Managing applications running on more than one cluster, which is a default nowadays (cluster as a cattle),
21 | * Need to offload data and observability services from edge cluster to reduce complexity and cost of edge cluster (e.g remote health capabilities).
22 | * Allow offloading teams from maintaining their own observability service.
23 |
24 | It’s worth noting that there is also a significant benefit to collecting and using multiple signals within one system:
25 | * Correlating signals and creating a smooth and richer debugging UX.
26 | * Sharing common functionality, like rate limiting, retries, auth, etc, which allows a consistent integration and management experience for users.
27 |
28 | The Openshift Monitoring Team began preparing for this shift in 2018 with the [Telemeter](use-cases/telemetry.md) Service. In particular, while creating the second version of the Telemeter Service, we put effort into developing and contributing to open source systems and integration to design [“Observatorium”](../../Projects/Observability/observatorium.md): a multi-signal, multi-tenant system that can be operated easily and cheaply as a Service either by Red Hat or on-premise. After extending the scope of the RHOBS, Telemeter become the first "tenant" of the RHOBS.
29 |
30 | In Summer 2020, the Monitoring Team together with the OpenShift Logging Team added a logging signal to [“Observatorium”](../../Projects/Observability/observatorium.md) and started to manage it for internal teams as the RHOBS.
31 |
32 | ### Current Usage
33 |
34 | RHOBS is running in production and has already been offered to various internal teams, with more extensions and expansions coming in the near future.
35 |
36 | * There is currently no plan to offer RHOBS to external customers.
37 | * However anyone is welcome to deploy and manage an RHOBS-like-service on their own using [Observatorium](../../Projects/Observability/observatorium.md).
38 |
39 | Usage (state as of 2021.07.01):
40 |
41 | The metric usage is visualised in the following diagram:
42 |
43 | 
44 |
45 | RHOBS is functionally separated into two main usage categories:
46 |
47 | * Since 2018 we run Telemeter tenant for metric signal (hard tenancy, `telemeter-prod-01` cluster). See [telemeter](use-cases/telemetry.md) for details.
48 | * Since 2021 we ingest metrics for selected Managed Services as soft tenants in an independent deployment (separate soft tenant, `telemeter-prod-01` cluster). See [MST](use-cases/observability.md) for details.
49 |
50 | Other signals:
51 |
52 | * Since 2020 we ingest logs for the DPTP team (hard tenancy, `telemeter-prod-01` cluster).
53 |
54 | > Hard vs Soft tenancy: In principle hard tenancy means that in order to run system for multiple tenants you run each stack (can be still in single cluster, yet isolated through e.g namespaces) for each tenant. Soft tenancy means that we reuse same endpoint and services to handle tenant APIs. For tenant both tenancy models should be invisible. We chose Telemeter to be deployed in different stack, because of Telemeter critical nature and different functional purpose that makes Telemeter performance characteristics bit different to normal monitoring setup (more analytics-driven cardinality).
55 |
56 | ### Owners
57 |
58 | RHOBS was initially designed by `Monitoring Group` and its metric signal is managed and supported by `Monitoring` Group team (OpenShift organization) together with AppSRE team (Service Delivery organization).
59 |
60 | Other signals are managed by others team (each together with AppSRE help):
61 |
62 | * Logging signal: OpenShift Logging Team
63 | * Tracing signal: OpenShift Tracing Team
64 |
65 | Telemeter part client side is maintained by In-cluster Monitoring team, but managed by corresponding client cluster owner.
66 |
67 | ### Metric Label Restrictions
68 |
69 | There is small set of reserved label name that are reserved on RHOBS side. If remote write request contains series with those labels, they will be overridden (not honoured).
70 |
71 | Restricted labels:
72 |
73 | * `tenant_id`: Internal tenant ID in form of UUID that is injected to all series in various placed and constructed from tenant authorization (HTTP header).
74 | * `receive`: Special label used internally.
75 | * `replica` and `rule_replica`: Special label used internally to replicate receive and rule data for reliability.
76 |
77 | Special labels:
78 |
79 | * `prometheus_replica`: Use this label for Agent/Prometheus unique scraper ID if you scrape and push from multiple of replicas scraping the same data. RHOBS is configured to deduplicate metrics based on this.
80 |
81 | Recommended labels:
82 |
83 | * `_id` for cluster ID (in form of lowercase UUID) is the common way of identifying clusters.
84 | * `prometheus` is a name of Prometheus (or any other scraper) that was used to collect metrics.
85 | * `job` is scrape job name that is usually dynamically set to abstracted service name representing microservice (usually deployment/statefulset name)
86 | * `namespace` is Kubernetes namespace. Useful to make sure identify same microservices across namespaces.
87 | * For more "version" based granularit (warning - every pod rollout creates new time series).
88 | * `pod` is a pod name
89 | * `instance` is an IP:port (useful when pod has sidecars)
90 | * `image` is image name and tag.
91 |
92 | ### Support
93 |
94 | For support see:
95 |
96 | * [MST escalation flow](use-cases/observability.md#support) and [MST SLA](use-cases/observability.md#service-level-agreement).
97 | * [Telemeter escalation flow](use-cases/telemetry.md#support) and [Telemeter SLA](use-cases/telemetry.md#service-level-agreement).
98 |
--------------------------------------------------------------------------------
/content/contributing.md:
--------------------------------------------------------------------------------
1 | # Contributing
2 |
3 | This document explains the process of modifying this handbook.
4 |
5 | > You were instead looking at how to contribute to the projects we maintain and contribute as part of Monitoring Group? Check out the [Projects list!](Projects/README.md)
6 |
7 | ## Who can Contribute?
8 |
9 | Anyone is welcome to read and contribute to our handbook! It's licensed with Apache 2. Any team member, another Red Hat team member, or even anyone outside Red Hat is welcome to help us refine processes, update documentation and FAQs.
10 |
11 | If the handbook inspired you, let us know too!
12 |
13 | ## What and How You Can Contribute?
14 |
15 | Follow [Modifying Documents Guide](#modifying-documents) and add GitHub Pull Request to [https://github.com/rhobs/handbook](https://github.com/rhobs/handbook):
16 |
17 | * If you find a broken link, typo, unclear statements, gap in the documentation.
18 | * If you find a broken element on `https://rhobs-handbook.netlify.app/` website or via [GitHub UI](https://github.com/rhobs/handbook/tree/main/content).
19 | * If you want to add a new proposal (please also follow the proposal process (TBD)).
20 | * If there is no active and existing PR for a similar work item. [TODO General Contribution Guidelines](https://github.com/rhobs/handbook/issues/7)
21 |
22 | Add GitHub Issue to https://github.com/rhobs/handbook:
23 |
24 | * If you don't want to spend time doing the actual work, you want to report it and potentially do it in your free time.
25 | * If there is no current issue for a similar work item.
26 |
27 | ## Modifying Documents
28 |
29 | This section will guide you through changing, adding or removing anything from handbook content.
30 |
31 | > NOTE: It's ok to start with some large changes in Google Docs first for easier collaboration. Then, you can use [Google Chrome Plugin](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to convert it to markdown later on.
32 |
33 | ### Prerequisites
34 |
35 | * Familiarise with [GitHub Flavored Markdown Spec](https://github.github.com/gfm/). Our source documentation is written in that markdown dialect.
36 | * Check the goals for this handbook defined in the [Handbook proposal](Proposals/Accepted/202106-handbook.md) to ensure you understand the scope of the handbook materials.
37 | * Make sure you have installed Go 1.15+ on your machine.
38 |
39 | ### How to decide where to put the content?
40 |
41 | 
42 |
43 | NOTE: Make sure to not duplicate too much information with the project documentation. Docs are aging quickly.
44 |
45 | ### Editing the content
46 |
47 | As you can read from the [Handbook proposal](Proposals/Accepted/202106-handbook.md), our website framework ([`mdox`](https://github.com/bwplotka/mdox) and [Hugo](https://gohugo.io/) is designed to allow maintaining the source documentation files written in markdown in website agnostic way. Just write a markdown like you would write it if our main rendering point would be a [GitHub UI](https://github.com/rhobs/handbook/tree/main/content) itself!
48 |
49 | #### Changing Documentation Files
50 |
51 | 1. Go to https://github.com/rhobs/handbook/ and fork it.
52 |
53 | 2. Git clone it on your machine, add a branch to your fork (alternatively, you can use GitHub UI to edit file).
54 |
55 | 3. Add, edit or remove any markdown file in [`./content`](https://github.com/rhobs/handbook/tree/main/content) directory. The structure of directories will map to the menu on the website.
56 |
57 | 4. Format markdown.
58 |
59 | We mandate opinionated markdown formatting on our docs to ensure maintainability. As humans, we can make mistakes, and formatting will quickly fix inconsistent newlines, whitespaces etc. It will also validate if your relative links are pointing to something existing (!). Run `make docs` from repo root.
60 |
61 | > NOTE: Use relative links directly to markdown files when pointing to another file in the `content` directory. Don't link to the same resources on the website. The framework will transform from link to md file into the link to relative URL path on the website.
62 |
63 | > NOTE: The CI job will check the formatting and local links, so before committing, make sure to run `make docs` from the main repo root!
64 |
65 | 1. Create PR
66 |
67 | 2. Once the PR is merged, the content will appear on our website `https://rhobs-handbook.netlify.app/` and [GitHub UI](https://github.com/rhobs/handbook/tree/main/content).
68 |
69 | #### Adding Static Content
70 |
71 | All static content that you want to link to or embed (e.g. images) add to the `content` (ideally to [`content/assets`](https://github.com/rhobs/handbook/tree/main/content/assets) for clean grouping) and relatively link in your main document.
72 |
73 | It's recommended to check if your page renders well before merging (see [section below](#early-previews)).
74 |
75 | > NOTE: If you used some tool for creating diagrams (e.g checkout amazing https://excalidraw.com/), put the source file into assets too. If online tool was used, e.g Google Draw, add link to the markdown so we can edit those easily.
76 |
77 | #### Early Previews
78 |
79 | While editing documentation, you can run the website locally to check how it all looks!
80 |
81 | *Locally:*
82 |
83 | Run `make web-serve` from repo root. This will serve the website on `localhost:1313`.
84 |
85 | > NOTE: Currently, some options reload in runtime e.g. CSS, but content changes require `make web-serve` restart. See https://github.com/bwplotka/mdox/issues/35 that tracks fix for this.
86 |
87 | *From the PR:*
88 |
89 | Click `details` on website preview CI check from netlify:
90 |
91 | 
92 |
93 | > Something does not work? Link works on GitHub but not on the website? Don't try to hack this up. It might bug with our framework. Please raise an issue on https://github.com/rhobs/handbook/ for us to investigate.
94 |
95 | ## Advanced: Under the hood
96 |
97 | We do advanced conversions from source files in `content` dir to the website. Let's do a quick overview of our framework.
98 |
99 | * [`./web`](https://github.com/rhobs/handbook/tree/main/web) contains resources related to the website.
100 | * [`./web/themes`](https://github.com/rhobs/handbook/tree/main/web/themes) hosts our Hugo theme [`Docsy`](https://themes.gohugo.io/docsy/) as git submodule.
101 | * [`./web/config.toml`](https://github.com/rhobs/handbook/blob/main/web/config.toml) holds Hugo and theme configuration.
102 | * [`./netlify.yaml`](https://github.com/rhobs/handbook/blob/main/netlify.toml) holds our Netlify deployment configuration.)
103 | * [`.mdox.yaml`](https://github.com/rhobs/handbook/blob/main/.mdox.yaml) holds https://github.com/bwplotka/mdox configuration that tells mdox how to *pre-process* our docs.
104 |
105 | When you run `make web` or `make web-server`, a particular process is happening to "render" the website:
106 |
107 | 1. `mdox` using `.mdox.yaml` is transforming the resources from `content` directory, moving into website friendly form (adding frontmatter, adjusting links and many others) and outputs results in `./web/content` and `./web/static` which are never committed (temporary directories).
108 | 2. We run Hugo, which renders the markdown files from `./web/content` and using the rest of the resources in `./web` to render the website in `./web/publish`.
109 | 3. Netlify then knows that website has to be hosted from `./web/publish`.
110 |
111 | ## How to contact us?
112 |
113 | Feel free to put an [issue or discussion topic on our handbook repo](https://github.com/rhobs/handbook), ask @bwplotka or @onprem users on Slack or catch us on Twitter.
114 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/faq.md:
--------------------------------------------------------------------------------
1 | # Frequently asked questions
2 |
3 | This serves as a collection of resources that relate to FAQ around configuring/debugging the in-cluster monitoring stack. Particularly it applies to two OpenShift Projects:
4 |
5 | * [Platform Cluster Monitoring - PM](https://docs.openshift.com/container-platform/latest/monitoring/understanding-the-monitoring-stack.html#understanding-the-monitoring-stack_understanding-the-monitoring-stack)
6 | * [User Workload Monitoring - UWM](https://docs.openshift.com/container-platform/latest/monitoring/enabling-monitoring-for-user-defined-projects.html)
7 |
8 | ## How can I (as a monitoring developer) troubleshoot support cases?
9 |
10 | See this [presentation](https://docs.google.com/presentation/d/1SY0xHNO-QMvhMi1kRSJlZuYnkuZ3nFLJvWjG0CM7wgw/edit) to understand which tools are at your disposal.
11 |
12 | ## How do I understand why targets aren't discovered and metrics are missing?
13 |
14 | Both `PM` and `UWM` monitoring stacks rely on the `ServiceMonitor` and `PodMonitor` custom resources in order to tell Prometheus which endpoints to scrape.
15 |
16 | The examples below show the namespace `openshift-monitoring`, which can be replaced with `openshift-user-workload-monitoring` when dealing with `UWM`.
17 |
18 | A detailed description of how the resources are linked exists [here](https://prometheus-operator.dev/docs/operator/troubleshooting/#troubleshooting-servicemonitor-changes), but we will walk through some common issues to debug the case of missing metrics.
19 |
20 | 1. Ensure the `serviceMonitorSelector` in the `Prometheus` CR matches the key in the `ServiceMonitor` labels.
21 | 2. The `Service` you want to scrape *must* have an explicitly named port.
22 | 3. The `ServiceMonitor` *must* reference the `port` by this name.
23 | 4. The label selector in the `ServiceMonitor` must match an existing `Service`.
24 |
25 | Assuming this criteria is met but the metrics don't exist, we can try debug the cause.
26 |
27 | There is a possibility Prometheus has not loaded the configuration yet. The following metrics will help to determine if that is in fact the case or if there are errors in the configuration:
28 |
29 | ```bash
30 | prometheus_config_last_reload_success_timestamp_seconds
31 | prometheus_config_last_reload_successful
32 | ```
33 |
34 | If there are errors with reloading the configuration, it is likely the configuration itself is invalid and examining the logs will highlight this.
35 |
36 | ```bash
37 | oc logs -n openshift-monitoring prometheus-k8s-0 -c
38 | ```
39 |
40 | Assuming that the reload was a success then the Prometheus should see the configuration.
41 |
42 | ```bash
43 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/status/config | grep ""
44 | ```
45 |
46 | If the `ServiceMonitor` does not exist in the output, the next step would be to investigate the logs of both `prometheus` and the `prometheus-operator` for errors.
47 |
48 | Assuming it does exist then we know `prometheus-operator` is doing its job. Double check the `ServiceMonitor` definition.
49 |
50 | Check the service discovery endpoint to ensure Prometheus can discover the target. It will need the appropriate RBAC to do so. An example can be found [here](https://github.com/openshift/cluster-monitoring-operator/blob/23201e012586d4864ca23593621f843179c47412/assets/prometheus-k8s/role-specific-namespaces.yaml#L35-L50).
51 |
52 | ## How do I troubleshoot the TargetDown alert?
53 |
54 | First of all, check the [TargetDown runbook](https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TargetDown.md).
55 |
56 | We have, in the past seen cases where the `TargetDown` alert was firing when all endpoints appeared to be up. The following commands fetch some useful metrics to help identify the cause.
57 |
58 | As the alert fires, get the list of active targets in Prometheus
59 |
60 | ```bash
61 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-0.json
62 |
63 | oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-1.json
64 | ```
65 |
66 | ---
67 |
68 | Reports all targets that Prometheus couldn’t connect to with some reason (timeout, refused, …)
69 |
70 | A `dialer_name` can be passed as a label to limit the query to interesting components. For example `{dialer_name=~".+openshift-.*"}`.
71 |
72 | ```bash
73 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=rate(net_conntrack_dialer_conn_failed_total{}[1h]) > 0' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-0.json
74 |
75 | oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=net_conntrack_dialer_conn_failed_total{} > 1' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-1.json
76 | ```
77 |
78 | ---
79 |
80 | Identify targets that are slow to serve metrics and may be considered as down.
81 |
82 | ```bash
83 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-0.json
84 |
85 | oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-1.json
86 | ```
87 |
88 | ## How do I troubleshoot high CPU usage of Prometheus?
89 |
90 | Often, when "high" CPU usage or spikes are identified it can be a symptom of expensive rules.
91 |
92 | A good place to start the investigation is the `/rules` endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times.
93 |
94 | A sorted list of rule evaluation times can be gathered with the following:
95 |
96 | ```bash
97 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -s 'http://localhost:9090/api/v1/rules' | jq -r '.data.groups[] | .rules[] | [.evaluationTime, .health, .name] | @tsv' | sort
98 | ```
99 |
100 | An overview of the timeseries database can be retrieved with:
101 |
102 | ```bash
103 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -s 'http://localhost:9090/api/v1/status/tsdb' | jq
104 | ```
105 |
106 | Within Prometheus, the `prometheus_rule_evaluation_duration_seconds` metric can be used to view evalutation time by quantile for each instance. Additionally, the `prometheus_rule_group_last_duration_seconds` can be used to determine the longest evaluating rulegroups.
107 |
108 | ## How do I retrieve CPU profiles?
109 |
110 | In cases where excessive CPU usage is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span.
111 |
112 | To gather CPU profiles over a period of 30 minutes, run the following:
113 |
114 | ```bash
115 | SLEEP_MINUTES=5
116 | duration=${DURATION:-30}
117 | while [ $duration -ne 0 ]; do
118 | for i in 0 1; do
119 | echo "Retrieving CPU profile for prometheus-k8s-$i..."
120 | oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/profile?seconds="$duration" > cpu.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
121 | done
122 | echo "Sleeping for $SLEEP_MINUTES minutes..."
123 | sleep $(( 60 * $SLEEP_MINUTES ))
124 | (( --duration ))
125 | done
126 | ```
127 |
128 | ## How do I debug high memory usage?
129 |
130 | The following queries might prove useful for debugging.
131 |
132 | Calculate the ingestion rate over the last two minutes:
133 |
134 | ```bash
135 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
136 | -- curl -s http://localhost:9090/api/v1/query --data-urlencode \
137 | 'query=sum by(pod,job,namespace) (max without(instance) (rate(prometheus_tsdb_head_samples_appended_total{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[2m])))' > samples_appended.json
138 | ```
139 |
140 | Calculate "non-evictable" memory:
141 |
142 | ```bash
143 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
144 | -- curl -s http://localhost:9090/api/v1/query --data-urlencode \
145 | 'query=sort_desc(sum by (pod,namespace) (max without(instance) (container_memory_working_set_bytes{namespace=~"openshift-monitoring|openshift-user-workload-monitoring", container=""})))' > memory.json
146 | ```
147 |
148 | ## How do I get memory profiles?
149 |
150 | In cases where excessive memory is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span.
151 |
152 | To gather memory profiles over a period of 30 minutes, run the following:
153 |
154 | ```bash
155 | SLEEP_MINUTES=5
156 | duration=${DURATION:-30}
157 | while [ $duration -ne 0 ]; do
158 | for i in 0 1; do
159 | echo "Retrieving memory profile for prometheus-k8s-$i..."
160 | oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/heap > heap.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
161 | done
162 | echo "Sleeping for $SLEEP_MINUTES minutes..."
163 | sleep $(( 60 * $SLEEP_MINUTES ))
164 | (( --duration ))
165 | done
166 | ```
167 |
--------------------------------------------------------------------------------
/content/Services/RHOBS/configuring-clients-for-observatorium.md:
--------------------------------------------------------------------------------
1 | # Configuring Clients for Red Hat’s Observatorium Instance
2 |
3 | - **Authors:**
4 | - [`@squat`](https://github.com/squat)
5 | - [`@spaparaju`](https://github.com/spaparaju)
6 | - [`@onprem`](https://github.com/onprem)
7 |
8 | ## Overview
9 |
10 | Teams that have identified a need for collecting their logs and/or metrics into a centralized service that offers querying and dashboarding may choose to send their data to Red Hat’s hosted Observatorium instance. This document details how to configure clients, such as Prometheus, to remote write data for tenants who have been onboarded to Observatorium following the team’s onboarding doc: [Onboarding a Tenant into Observatorium (internal)](https://docs.google.com/document/d/1pjM9RRvij-IgwqQMt5q798B_4k4A9Y16uT2oV9sxN3g/edit).
11 |
12 | ## 0. Register Service Accounts and Configure RBAC
13 |
14 | Before configuring any clients, follow the steps in the [Observatorium Tenant Onboarding doc (internal)](https://docs.google.com/document/d/1pjM9RRvij-IgwqQMt5q798B_4k4A9Y16uT2oV9sxN3g/edit) to register the necessary service accounts and give them the required permissions on the Observatorium platform. The result of this process should be an OAuth client ID and client secret pair for each new service account. Save these credentials somewhere secure.
15 |
16 | ## 1. Remote Writing Metrics to Observatorium
17 |
18 | ### Using the Cluster Monitoring Stack
19 |
20 | This section describes the process of sending metrics collected by the Cluster Monitoring stack on an OpenShift cluster to Observatorium.
21 |
22 | #### Background
23 |
24 | In order to remote write metrics from a cluster to Observatorium using the OpenShift Cluster Monitoring stack, the cluster’s Prometheus servers must be configured to authenticate and make requests to the correct URL. The OpenShift Cluster Monitoring ConfigMap exposes a user-editable field for configuring the Prometheus servers to remote write. However, because Prometheus does not support OAuth, it cannot authenticate directly with Observatorium and because the Cluster Monitoring stack is, for all intents and purposes, immutable, the Prometheus Pods cannot be configured with sidecars to do the authentication. For this reason, the Prometheus servers must be configured to remote write through an authentication proxy running on the cluster that in turn is pointed at Observatorium and is able to perform an OAuth flow and set the received access token on proxied requests.
25 |
26 | #### 1. Configure the Cluster Monitoring Stack
27 |
28 | The OpenShift Cluster Monitoring stack provides a ConfigMap that can be used to modify the configuration and behavior of the components. The first step is to modify the “cluster-monitoring-config” ConfigMap in the cluster to include a remote write configuration for Prometheus as shown below:
29 |
30 | ```yaml
31 | apiVersion: v1
32 | kind: ConfigMap
33 | metadata:
34 | name: cluster-monitoring-config
35 | namespace: openshift-monitoring
36 | data:
37 | config.yaml: |
38 | prometheusK8s:
39 | retention: 2h
40 | remoteWrite:
41 | - url: http://token-refresher.openshift-monitoring.svc.cluster.local
42 | queueConfig:
43 | max_samples_per_send: 500
44 | batch_send_deadline: 60s
45 | write_relabel_configs:
46 | - source_labels: [__name__]
47 | - regex: metric_name.*
48 | ```
49 |
50 | #### 2. Deploy the Observatorium Token-Refresher Proxy
51 |
52 | Because Prometheus does not have built-in support for acquiring OAuth2 access tokens for authorization, which are required by Observatorium, the in-cluster Prometheus must remote write its data through a proxy that is able to fetch access tokens and set them as headers on outbound requests. The Observatorium stack provides such a proxy, which may be deployed to the “openshift-monitoring” namespace and guarded by a NetworkPolicy so that only Prometheus can use the access tokens. The following snippet shows an example of how to deploy the proxy for the stage environment:
53 |
54 | ```bash
55 | export TENANT=
56 | export CLIENT_ID=
57 | export CLIENT_SECRET=
58 | # For staging:
59 | export STAGING=true
60 | cat < **NOTE:** OAuth2 support was added to Prometheus in version 2.27.0. If using an older version of Prometheus without native OAuth2 support, the remote write traffic needs to go through a proxy such as [token-refresher](https://github.com/observatorium/token-refresher/), similar to what is described above for Cluster Monitoring Stack.
182 |
183 | #### 1. Modify the Prometheus configuration
184 |
185 | The configuration file for Prometheus must be patched to include a remote write configuration as shown below. ``, ``, and `` need to be replaced with appropriate values.
186 |
187 | ```yaml
188 | remote_write:
189 | - url: https://observatorium-mst.api.stage.openshift.com/api/metrics/v1//api/v1/receive
190 | oauth2:
191 | client_id:
192 | client_secret:
193 | token_url: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
194 | ```
195 |
--------------------------------------------------------------------------------
/content/Proposals/Accepted/202106-handbook.md:
--------------------------------------------------------------------------------
1 | # 2021-06: Handbook
2 |
3 | * **Owners:**
4 | * [`@bwplotka`](https://github.com/bwplotka)
5 |
6 | * **Other docs:**
7 | * https://works.hashicorp.com/articles/writing-practices-and-culture
8 | * https://about.gitlab.com/handbook/handbook-usage/#wiki-handbooks-dont-scale
9 | * https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/5143
10 | * [Observability Team Process (Internal)](https://docs.google.com/document/d/1eojXStPdq1hYwv36pjE-vKR1q3dlBbpIx5w_L_v2gNo/edit)
11 | * [Monitoring Team Processes (Internal)](https://drive.google.com/drive/folders/1yU5DYhpBLzmp3q9SIhudRMZ1rsmQ_sR2)
12 |
13 | > TL;DR: I would like to propose to put all public documentation pieces related to the Monitoring Group (and not tied to a specific project) in the public GitHub repository called [`handbook`](https://github.com/rhobs/handbook). I propose to review all documents with a similar flow as code and put documents in the form of markdown files that can be read from both GitHub UI and automatically served on https://rhobs-handbook.netlify.app/ website.
14 | >
15 | > The [diagram](#flow-of-addingconsuming-documentation-to-handbook) below shows what fits into this handbook and what should be distributed to the relevant upstream project (e.g developer documentation).
16 |
17 | ## Why
18 |
19 | ##### Documentation is essential
20 |
21 | * Without good team processes documentation, collaboration within the team can be challenging. Members have to figure out what to do on their own, or tribal knowledge has to be propagated. Surprises and conflicts can arise. On-boarding new team members are hard. It's vital given that our Red Hat teams are distributed over the world and working remotely.
22 | * Additionally, it's hard for any internal or external team to discover how to reach us or escalate without noise.
23 | * Without a good team subject matter overview, it's hard to wrap your head around the number of projects we participate in. In addition, each team member is proficient in a different area, and we miss some "index" overview of where to navigate for various project aspects (documentation, contributing, proposals, chats).
24 | * Even if documentation is created, it risks being placed in the wrong place.
25 | * Without a place for written design proposals (those in progress, those accepted and rejected), the team risks repeating iterating over the same ideas or challenging old ideas already researched.
26 | * Without good operational or configuration knowledge, we keep asking the same question about, e.g. how to rollout service X or contribute to X etc.
27 |
28 | ##### Despite strong incentives, writing documentation has proven to be of one the most unwanted task among engineers
29 |
30 | Demotivation is because our (Google Docs based) process tends to create the following obstacles:
31 |
32 | * There are too many side decisions to make, e.g. where to put this documentation, what format to use, how long, how to be on-topic, are we sure this information is not recorded somewhere else? Every, even small decision takes our energy and have the risk of procrastination.
33 | * There is no review process, so it's hard to maintain a high quality of those documents.
34 | * Created documentation is tough to consume and discover.
35 | * Because docs are hard to discover, the documentation tends to be often duplicated, has gaps, or is obsolete.
36 | * Documents used to be private, which brings extra demotivation. Some of the information is useful for the public audience. Some of this could be useful for external contributors. It's hard to reuse such private docs without recreating them.
37 |
38 | All of those make people decide NOT to write documentation but rather schedule another meeting and repeat the same information repeatedly.
39 |
40 | On a similar side, anyone looking for information about our teams' work, proposals or project is demotivated to look, find and read our documentation because it's not consistent, not in a single place, hard to discover or not completed.
41 |
42 | ### Pitfalls of the current solution
43 |
44 | * It mainly exists in Google Docs, which has the following issues:
45 | * Not everything is in our Team drive, there are docs not owned by us, created adhoc.
46 | * It's painful to organize them well e.g in directories, since it's so easy so copy, create one.
47 | * Even if it's organized well, it's not easily discoverable.
48 | * Existing Google doc-based documents are hard to consume. The formatting is widely different. Naming is inconsistent.
49 | * *Document creation is rarely actionable*. There is no review process, so the effort of creating a relevant document might be wasted, as the document is lost. This also leads to docs being in the half-completed state, demotivating readers to look at it.
50 | * It's hard to track previous discussions around docs, who approved them (e.g. proposals).
51 | * It's not public, and it's hard to share best practices with other external and internal teams.
52 |
53 | ## Goals
54 |
55 | Goals and use cases for the solution as proposed in [How](#how):
56 |
57 | * Single source of truth for Monitoring Group Team docs like processes, overviews, runbooks, links for internal content.
58 | * Have a consistent documentation format that is readable and understandable.
59 | * Searchable and easily discoverable.
60 | * Process of adding documents should be easy and consistent.
61 | * Automation and normal review process should be in place to ensure high quality (e.g. link checking).
62 | * Allow public collaboration on processes and other docs.
63 |
64 | > NOTE: We would love to host Logging and Tracing Teams if they choose to follow our process, but we don't want to enforce it. We are happy to extend this handbook from Monitoring Group handbook to Observability Group, but it has to grow organically (if Logging, Tracing team will see the value joining us here).
65 |
66 | ### Audience
67 |
68 | The currently planned audience for proposed documentation content is following (in importance order):
69 |
70 | 1. Monitoring Group Team Members.
71 | 2. External Teams at Red Hat.
72 | 3. Teams outside Red Hat, contributors to our projects, potential future hires, people interested in best practices, team processes etc.
73 |
74 | ## Non-Goals
75 |
76 | * Support other formats than `Markdown` e.g. Asciidoc.
77 | * Replace official project or product documentation.
78 | * Precise design proposal process (it will come in a separate proposal).
79 | * Sharing Team Statuses, we use JIRA and GH issues for that.
80 |
81 | ## How
82 |
83 | The idea is simple:
84 |
85 | Let's make sure we maintain the process of adding/editing documentation **as easy and rewarding** as possible. This will increase the chances team members will document things more often and adopt this as a habit. Produced content will be more likely complete and up-to-date, increasing the chances it will be helpful to our audience, which will reduce the meeting burden. This will make writing docs much more rewarding, which creates a positive loop.
86 |
87 | I propose to use git repository [`handbook`](https://github.com/rhobs/handbook) to put all related team documentation pieces there. Furthermore, I suggest reviewing all documents with a similar flow as code and placing information in the form of markdown files that can be read from both GitHub UI and automatically served on https://rhobs-handbook.netlify.app/ website.
88 |
89 | Pros:
90 | * Matches our [goals](#goals).
91 | * Sharing by default.
92 | * Low barriers to write documents in a consistent format, low barrier to consume it.
93 | * Ensures high quality with local CI and review process.
94 |
95 | Cons:
96 | * Some website maintenance is needed, but we use the same and heavily automated flow in Prometheus-operator, Thanos, Observatorium websites etc.
97 |
98 | The idea of a handbook is not new. Many organizations do this e.g [GitLab](https://about.gitlab.com/handbook/handbook-usage/#wiki-handbooks-dont-scale).
99 |
100 | > NOTE: The website style might be not perfect (https://rhobs-handbook.netlify.app/). Feel free to propose issues, fixes to the overall look and readability!
101 |
102 | ### Flow of Adding/Consuming Documentation to Handbook
103 |
104 | 
105 |
106 | If you want to add or edit markdown documentation, refer to [our technical guide](../../contributing.md).
107 |
108 | ## Alternatives
109 |
110 | 1. Organize Team Google Drive with all Google docs we have.
111 |
112 | Pros:
113 | * Great for initial collaboration
114 |
115 | Cons:
116 | * Inconsistent format
117 | * Hard to track approvers
118 | * Never know when the doc is "completed."
119 | * Hard to maintain over time
120 | * Hard to share and reuse outside
121 |
122 | 1. Create Red Hat scoped only, a private handbook.
123 |
124 | Pros:
125 | * No worry if we share something internal?
126 |
127 | Cons:
128 | * We don't have many internal things we don't want to share at the current moment. All our projects and products are public.
129 | * Sharing means we have to duplicate the information, copy it in multiple places.
130 | * Harder to share with external teams
131 | * We can't use open source tools, CIs etc.
132 |
133 | ## Action Plan
134 |
135 | * [X] Create handbook repo and website
136 | * [X] Create website automation (done with [mdox](https://github.com/bwplotka/mdox))
137 | * [ ] Move existing up-to-date public documents (team processes, project overviews, faqs, design docs) over to the handbook (deadline: End of July).
138 | * Current Team Drive: [internal](https://drive.google.com/drive/folders/1PJHtAtxBUHxbmMx1xftrNSOJEIoYVhyO)
139 | * TIP: You can use [Google Chrome Plugin](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to convert Google Doc into markdown easily.
140 | * [ ] Clean up Team Drive from not used or non-relevant project (or move it to some trash dir).
141 |
--------------------------------------------------------------------------------
/content/Teams/observability-platform/sre.md:
--------------------------------------------------------------------------------
1 | # Observability Platform Team SRE Processes
2 |
3 | This document explains a few processes related to operating our production software together with AppSRE team. All operations can be summarised as the Site Reliability Engineering part of our work.
4 |
5 | ## Goals
6 |
7 | The main goals of our SRE work within this team is to deploy and operate software that meets functional ([API](../../Projects/Observability/observatorium.md) and performance ([Telemeter SLO](../../Services/RHOBS/use-cases/telemetry.md#service-level-agreement), [MST SLO](../../Services/RHOBS/use-cases/observability.md#service-level-agreement)) requirements of our customers.
8 |
9 | Currently, we offer internal service called [RHOBS](../../Services/RHOBS) that is used across the company.
10 |
11 | ## Releasing Changes / Patches
12 |
13 | In order to maintain our software on production reliably we need to be able to test, release and deploy our changes rapidly.
14 |
15 | We can divide things we change into a few categories. Let's elaborate all processes per each category:
16 |
17 | ### Software Libraries
18 |
19 | Library is a project that is not deployed directly, but rather it is a dependency for [micro-service application](#software-for-micro-services) we deploy. For example https://github.com/prometheus/client_golang or https://github.com/grpc-ecosystem/go-grpc-middleware
20 |
21 | *Testing*:
22 | * Linters, unit and integration tests on every PR.
23 |
24 | *Releasing*
25 | * GitHub release using git tag (RC first, then major/minor or patch release).
26 |
27 | *Deploying*
28 | * Bumping dependency in Service
29 | * Following on Service [release and deploy steps](#software-for-micro-services).
30 |
31 | ### Software for (Micro) Services
32 |
33 | Software for (micro) services usually lives in many separate open source repositories in GitHub e.g https://github.com/thanos-io/thanos, https://github.com/observatorium/api.
34 |
35 | *Testing*:
36 | * Linters, unit and integration tests on every PR.
37 |
38 | *Releasing*
39 | * GitHub release using git tag (RC first, then major/minor or patch release).
40 | * Building docker images in quay.io and dockerhub (backup) using CI for each tag or main branch PR.
41 |
42 | *Deploying*
43 | * Bumping docker tag version in configuration.
44 | * Following [configuration release and deployment steps](#configuration).
45 |
46 | ### Configuration
47 |
48 | All or configuration is rooted in https://github.com/rhobs/configuration configuration templates written in jsonnet. It can be then overridden by defined parameters in app-interface saas files .
49 |
50 | *Testing*:
51 | * [Building jsonnet resources](https://github.com/rhobs/configuration/blob/main/.github/workflows/ci.yml), linting jsonnet, validating openshift templates.
52 | * Validation on App-interface side against Kubernetes YAMLs compatibility
53 |
54 | *Releasing*
55 | * Merge to main and get the commit hash you want to deploy.
56 |
57 | *Deploying*
58 | * Make sure you are on RH VPN.
59 | * Propose PR to https://gitlab.cee.redhat.com/service/app-interface with the change of ref for the desired environment to desired commit sha is app-interface Saas file for desired tenant ([telemeter](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/rhobs/telemeter/cicd/saas.yaml#L70) or MST) or change environment parameter.
60 | * Ask team for review. If change is impacting production heavily, notify AppSRE.
61 | * If only `saas.yaml` file was changed `/lgtm` from Observability Platform team is enough for PR to get merged automatically.
62 | * If any other file was changed, AppSRE engineer has to lgtm it.
63 | * When merged, CI will deploy the changes to namespace specified in `saas.yaml` e.g to production.
64 |
65 | NOTE: Don't change both production and staging in the same MR. NOTE: Deploy to production only changed that were previously in staging (automation for this TBD).
66 |
67 | You can see the version change:
68 | * [On monitoring dashboard](e.g https://prometheus.telemeter-prod-01.devshift.net/graph?g0.range_input=1h&g0.expr=thanos_build_info&g0.tab=0)
69 | * See report of deploy on `#team-monitoring-info` Slack channel on CoreOS Slack.
70 |
71 | ### Monitoring Resources
72 |
73 | Grafana Dashboards are defined here: https://github.com/rhobs/configuration/tree/main/observability/dashboards Alerts and Rules here: https://github.com/rhobs/configuration/blob/main/observability/prometheusrules.jsonnet
74 |
75 | *Testing*:
76 | * ):
77 |
78 | *Releasing*
79 | * Merge to main and get the commit hash you want to deploy.
80 |
81 | *Deploying*
82 | * Use `synchronize.sh` to create an MR/PR against app-interface. This will copy all generated YAML resources to proper places.
83 | * Ask team for review. If change is impacting production heavily, notify AppSRE.
84 | * When merged, CI will deploy the change to production. You can see the version change on Monitoring dashboard too e.g https://prometheus.telemeter-prod-01.devshift.net/graph?g0.range_input=1h&g0.expr=thanos_build_info&g0.tab=0.
85 |
86 | ## On-Call Rotation
87 |
88 | Currently, RHOBS services are supported 24h/7d by two teams:
89 |
90 | * `AppSRE` team for infrastructure and "generalist" support. They are our first incident response point. They try to support our stack as far as runbooks and general knowledge allows.
91 | * `Observability Platform` Team (Dev on-call) for incidents impacting SLA outside of the AppSRE expertise (bug fixes, more complex troubleshooting). We are notified when AppSRE needs.
92 |
93 | ## Incident Handling
94 |
95 | This is the process we as the Observability Team try to follow during incident response.
96 |
97 | The incident occurs when any of our services violates SLO we set with our stakeholders. Refer to [Telemeter SLO](../../Services/RHOBS/use-cases/telemetry.md#service-level-agreement) and [MST SLO](../../Services/RHOBS/use-cases/observability.md#service-level-agreement) for details on SLA.
98 |
99 | NOTE: Following procedure applies to both Production and Staging. Many teams e.g SubWatch depends on working staging, so follow similar process as production. The only difference is that *we do not need to mitigate of fix staging issues outside of office hours*.
100 |
101 | *Potential Trigger Points*:
102 |
103 | * You got notified on Slack by AppSRE team or paged through Pager Duty.
104 | * You got notified about potential SLA violation by the customer: Unexpected responses, things that worked before do not work now etc.
105 | * You touch production for unrelated reasons and noticed something worrying (error logs, un-monitored resource etc).
106 |
107 | 1. If you are not on-call [notify Observability Platform on-call engineer](../../Services/RHOBS/use-cases/telemetry.md#escalations). If you are on-call, on-call engineer is not present or you agreed that you will handle this incident, go to step 2.
108 | 2. Straight away, create JIRA for potential incident. Don't think twice, it's easy to create and essential to track incident later on. Fill the following parts:
109 |
110 | * Title: Symptom you see.
111 | * Type: Bug
112 | * Priority: Try to assess how important it is. If impacting production it's a "Blocker"
113 | * Component: RHOBS
114 | * (Important) Label: `incident` `no-qe`
115 | * Description: Mention how you were notified (ideally with link to alert/Slack thread). Mention what you know so far.
116 |
117 |  See example incident tickets [here](https://issues.redhat.com/issues/?jql=project%20%3D%20MON%20AND%20labels%20%3D%20incident)
118 |
119 | 3. If AppSRE is not yet aware, drop link to created incident ticket to #sd-app-sre channel and notify `@app-sre-primary` and `@observatorium-oncall`. Don't repeat yourself, ask everyone to follow the ticket comments.
120 |
121 | * AppSRE may or may not create dedicated channel, do communication efforts and start on-call Zoom meeting. We as the dev team don't need to worry about those elements, go to step 4.
122 | 4. Investigate possible mitigation. Ensure the problem is mitigated before focusing on root cause analysis.
123 |
124 | * Important: Note all performed actions and observations on created JIRA ticket via comments. This allows anyone to follow up what was checked and how. It is also essential for detailed Post Mortem / RCA process later on.
125 | * Note on JIRA ticket all automation or monitoring gaps you wished you had. This will be useful for actions after the incident.
126 | 5. Ensure if incident is mitigated with AppSRE. Investigate root cause. If mitigation is applied and root cause is known, claim the incident being over.
127 | 6. Note the potential long term fixes and ideas. Close the incident JIRA ticket.
128 |
129 | ### After Incident
130 |
131 | 1. After some time (within week), start a Post-mortem (RCA) document in Google Docs. Use following [Google Docs RCA template](https://docs.google.com/document/d/12ZVT35yApp7D-uT4p29cEhS9mpzin4Z-Ufh9eOiiaKU/edit). Put it in our RCA Team Google directory [here](https://drive.google.com/drive/folders/1z1j2ZwljT9jv-aYu7bkXzi03XMdBdFT9). Link it in JIRA ticket too.
132 | 2. Collaborate on Post Mortem. Make sure it is *blameless* but accurate. Share it as soon as possible with the team and AppSRE.
133 | 3. Once done, schedule meeting with the Observability Platform and optionally AppSRE to discuss RCA/Post Mortem action items and effect.
134 |
135 | > Idea: If you have time, before sharing Post Mortem / RCA perform "Wheel of Misfortune". Select an on-call engineer who was not participaing in incident and simulate the error by triggering root-cause in safe environment. Then meet together with team and allow engineer to coordinate simulated incident. Help on the way to share knowledge and insights. This is the best way to on-board people to production topics.
136 |
137 | > NOTE: Freely available SRE book is a good source of general patterns around in efficient incident management. Recommended read!
138 |
--------------------------------------------------------------------------------
/content/Proposals/Done/202106-proposals-process.md:
--------------------------------------------------------------------------------
1 | # 2021-06: Proposal Process
2 |
3 | * **Owners:**:
4 | * [`@bwplotka`](https://github.com/bwplotka)
5 |
6 | * **Other docs:**
7 | * [KEP Process](https://github.com/kubernetes/enhancements/blob/master/keps/README.md)
8 | * [Observability Team Process (Internal)](https://docs.google.com/document/d/1eojXStPdq1hYwv36pjE-vKR1q3dlBbpIx5w_L_v2gNo/edit)
9 |
10 | > TL;DR: We would like to propose an improved, official proposal process for Monitoring Group that clearly states when, where and how to create proposal/enhancement/design documents.
11 |
12 | ## Why
13 |
14 | More extensive architectural, process, or feature decisions are hard to explain, understand and discuss. It takes a lot of time to describe the idea, to motivate interested parties to review it, give feedback and approve. That's why it is essential to streamline the proposal process.
15 |
16 | Given that we work in highly distributed teams and work with multiple communities, we need to allow asynchronous discussions. This means it's essential to structure the talks into shared documents. Persisting in those decisions, once approved or rejected, is equally important, allowing us to understand previous motivations.
17 |
18 | There is a common saying [`"I've just been around long enough to know where the bodies are buried"`](https://twitter.com/AlexJonesax/status/1400103567822835714). We want to ensure the team related knowledge is accessible to everyone, every day, no matter if the team member is new or part of the team for ten years.
19 |
20 | ### Pitfalls of the current solution
21 |
22 | Currently, the Observability Platform team have the process defined [here (internal)](https://docs.google.com/document/d/1eojXStPdq1hYwv36pjE-vKR1q3dlBbpIx5w_L_v2gNo/edit#heading=h.kpdg1wrd3pcc), whereas the In-Cluster part were not defining any official process ([as per here (internal)](https://docs.google.com/document/d/1vbDGcjMjJMTIWcua5Keajla9FzexjLKmVk7zoUc0_MI/edit#heading=h.n0ac5lllvh13)).
23 |
24 | In practice, both teams had somehow similar flow:
25 |
26 | * For upstream: Follow the upstream project's contributing guide, e.g Thanos
27 | * For downstream:
28 | * Depending on the size:
29 | * Small features can be proposed during the bi-weekly team-sync or directly in Slack.
30 | * If the team can reach consensus in this time, then document the decision somewhere written, e.g. an email, Slack message to which everyone can add an emoji reaction, etc.
31 | * Add a JIRA ticket to plan this work.
32 | * Large features might need a design doc:
33 | 1. Add a JIRA ticket for creating the design doc
34 | 2. Create a new Google Doc in the team folder based on [this template](https://docs.google.com/document/d/1ddl_dLxjoIvWQuRgLdzL2Gd1EX1mkJQUZ-rgUh-T4d8/edit)
35 | 3. Fill sections
36 | 4. Announce it on the team mailing list and Slack channel
37 | 5. Address comments / concerns 6 Define what "done" means for this proposal, i.e. what is the purpose of this design document:
38 | * Knowledge sharing / Brain dump: This kind of document may not need a thorough review or any official approval
39 | * Long term vision and Execution & Implementation: If approved (with LGTM comments, or in an approved section) by a majority of the team and no major concerns consider it approved. NOTE: The same applies to rejected proposals.
40 | 1. If the document has no more offline comments and no consensus was reached, schedule a meeting with interested parties.
41 | 2. When the document changes status, move it to the appropriate status folder in the design docs directory of the team folder. If an approved proposal concerns a component with its own directory, e.g. Telemeter, then create a shortcut to the proposal document in the component-specific directory. This helps us find design documents by topic and by status.
42 |
43 | It served us well, but it had the following issues (really similar to ones stated in [handbook proposal](../Accepted/202106-handbook.md#pitfalls-of-the-current-solution)):
44 |
45 | * Even if our Google Design docs organized in our team drive, those Google documents are not easily discoverable.
46 | * Existing Google doc-based documents are hard to consume. The formatting is widely different. Naming is inconsistent.
47 | * Document creation is rarely actionable. There is no review process, so the effort of creating a relevant document might be wasted, as the document is lost. This also leads to docs being in the half-completed state, demotivating readers to look at it.
48 | * It's hard to track previous discussions around proposals, who approved them (e.g. proposals).
49 | * It's not public, and it's hard to share good proposals with other external and internal teams.
50 |
51 | ## Goals
52 |
53 | Goals and use cases for the solution as proposed in [How](#how):
54 |
55 | * Allow easy collaboration and decision making on design ideas.
56 | * Have a consistent design style that is readable and understandable.
57 | * Ensure design docs are discoverable for better awareness and knowledge sharing about past decisions.
58 | * Define a clear review and approval process.
59 |
60 | ## Non-Goals
61 |
62 | * Define process for other documents (see [handbook proposal](../Accepted/202106-handbook.md#pitfalls-of-the-current-solution))
63 |
64 | ## How
65 |
66 | We want to propose an improved, official proposal process for Monitoring Group that clearly states *when, where and how* to create proposal/enhancement/design documents.
67 |
68 | Everything starts with a problem statement. It might be a missing functionality, confusing existing functionality or broken one. It might be an annoying process, performance or security issue (or potential one).
69 |
70 | ### Where to Propose Changes/Where to Submit Proposals?
71 |
72 | As defined in [handbook proposal](../Accepted/202106-handbook.md#pitfalls-of-the-current-solution), our Handbook should tell you that Handbook is meant to be an index for our team resources and a linking point to other distributed projects we maintain or contribute to.
73 |
74 | First, we need to identify if the idea we have is something we can contribute to an upstream project, or it does not fit anywhere else, so we can leverage the [Handbok Proposal directory](..) and the [process](#handbook-proposal-process). See the below algorithm to find it out:
75 |
76 | 
77 |
78 | [Internal Team Drive for Public and Confidential Proposals](https://drive.google.com/drive/folders/1WGqC3gMCxIQlrnjDUYfNUTPYYRI5Cxto)
79 |
80 | [Templates](../process.md#templates)
81 |
82 | ### Handbook Proposal Process
83 |
84 | If there is no problem, there is no need for changing anything, no need for a proposal. This might feel trivial, but we should first ask ourselves this question before even thinking about writing a proposal.
85 |
86 | It takes time to propose an idea, find consensus and implement more significant concepts, so let's not waste time before it's worth it. But, unfortunately, even good ideas sometimes have to wait for a good moment to discuss them.
87 |
88 | Let's assume the idea sounds interesting to you; what to do next, where to propose it? How to review it? Follow the algorithm below:
89 |
90 | 
91 |
92 | > Note: It's totally ok to reject a proposal if a team member feels the idea is wrong. It's better to explicitly oppose it than to ignore it and leave it in limbo.
93 |
94 | > NOTE: We would love to host Logging and Tracing Teams if they choose to follow our process, but we don't want to enforce it. We are happy to extend this process from the Monitoring Group handbook to Observability Group. Still, it has to grow organically (if the Logging, Tracing team will see the value of joining us here).
95 |
96 | ### On Review Process
97 |
98 | As you see on the above algorithm, if the content relates to any upstream project, it should be proposed, reviewed and potentially implemented together with the community. This does not mean that you cannot involve other team members towards this effort. Share the proposal with team members, even if they are not part of maintainer's team on a given project, any feedback, and voice are useful and can help to move idea further.
99 |
100 | Similar to proposals that touch our team only, despite mentioning mandatory approval process from leads, anyone can give feedback! Our process is in fact very similar to [Hashicorp's RFC process](https://works.hashicorp.com/articles/writing-practices-and-culture):
101 |
102 | > Once you’ve written the first draft of an RFC, share it with your team. They’re likely to have the most context on your proposal and its potential impacts, so most of your feedback will probably come at this stage. Any team member can comment on and approve an RFC, but you need explicit approval only from the appropriate team leads in order to move forward. Once the RFC is approved and shared with stakeholders, you can start implementing the solution. For major projects, also share the RFC to the company-wide email list. While most members of the mailing list will just read the email rather than the full RFC, sending it to the list gives visibility into major decisions being made across the company.
103 |
104 | ### Summary
105 |
106 | Overall, we want to bring a culture where design docs will be reviewed in certain amount of time and authors (team members) will be given feedback. This, coupled with recognizing the work and being able to add it to your list of achievements (even if proposal was rejected), should bring more motivation for people and teams to assess ideas in structure, sustainable way.
107 |
108 | ## Alternatives
109 |
110 | 1. Organize Team Google Drive with all Google docs we have.
111 |
112 | Pros:
113 | * Great for initial collaboration
114 |
115 | Cons:
116 | * Inconsistent format
117 | * Hard to track approvers
118 | * Never know when the doc is "completed."
119 | * Hard to maintain over time
120 | * Hard to share and reuse outside
121 |
122 | ## Action Plan
123 |
124 | * [X] Explain process in [Proposal Process Guide](../process.md)
125 | * [ ] Move existing up-to-date public design docs over to the Handbook (deadline: End of July).
126 | * TIP: You can use [Google Chrome Plugin](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to convert Google Doc into markdown easily.
127 | * [ ] Propose a similar process to upstream projects that do not have it.
128 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/content/Projects/Observability/kube-rbac-proxy.md:
--------------------------------------------------------------------------------
1 | # kube-rbac-proxy
2 |
3 | `kube-rbac-proxy`, as the name suggests, is an HTTP proxy that sits in front of a workload and performs authentication and authorization of incoming requests using the `TokenReview` and `SubjectAccessReview` resources of the Kubernetes API.
4 |
5 | ## Workflow
6 |
7 | The purpose of `kube-rbac-proxy` is to distinguish between calls made by same or different user(s) (or service account(s)) to endpoint(s) and protect them from unauthorized resource access based on their trusted identity (e.g. tokens, TLS certificates, etc.) or the RBACs they hold, respectively. Once the request is authenticated and/or authorized, the proxy forwards the response from the server to the client unmodified.
8 |
9 | ### [**Authentication**](https://github.com/brancz/kube-rbac-proxy/blob/74181c75b8b6fbcde7eff1ca5fae353faac5cfae/pkg/authn/config.go#L33-L39)
10 |
11 | kube-rbac-proxy can be configured with one of the 2 mechanisms for authentication:
12 |
13 | * [OpenID Connect](https://github.com/brancz/kube-rbac-proxy/blob/52e49fbdb75e009db4d02e3986e51fdba0526378/pkg/authn/oidc.go#L45-L63) where kube-rbac-proxy validates the client-provided token against the configured OIDC provider. This mechanism isn't used by the monitoring components.
14 |
15 | * Kubernetes API using bearer tokens or mutual TLS:
16 | * [Delegated authentication](https://github.com/kubernetes/apiserver/blob/8ad2e288d62d02276033ea11ee1efd94bb627836/pkg/authentication/authenticatorfactory/delegating.go#L102-L112) relies on Bearer tokens. The token represents the identity of the user or service account that is making the request and kube-rbac-proxy uses a [`TokenReview` request](https://github.com/kubernetes/apiserver/blob/21bbcb57c672531fe8c431e1035405f9a4b061de/plugin/pkg/authenticator/token/webhook/webhook.go#L51-L53) to verify the identity of the client.
17 | * If kube-rbac-proxy is configured with a client certificate authority, it can also verify the identify of the client presenting a TLS certificate. Some monitoring components use this [mechanism](#downstream-usage) which avoids a round-trip communication with the Kubernetes API server.
18 |
19 | In the case of a failed authentication, an [HTTP `401 Unauthorized` status code](https://github.com/brancz/kube-rbac-proxy/blob/9efde2a776fd516cfa082cc5f2c35c7f9e0e2689/pkg/filters/auth_test.go#L290) is returned (note the distinction between *authentication* and *unauthorized* here). Note that anonymous access is always disabled, and the proxy doesn't rely on HTTP headers to authenticate the request but it can add them if started with `--auth-header-fields-enabled`.
20 |
21 | Refer to [this page](https://kubernetes.io/docs/reference/access-authn-authz/authentication/) for more information on authentication in Kubernetes.
22 |
23 | ### [**Authorization**](https://github.com/brancz/kube-rbac-proxy/blob/1c7f88b5e951d25a493a175e93515068f5c77f3b/pkg/authz/auth.go#L31C1-L37)
24 |
25 | Once authentication is done, `kube-rbac-proxy` must then decide whether to allow the user's request to go through or not. A [`SubjectAccessReview` request is created](https://github.com/kubernetes/apiserver/blob/21bbcb57c672531fe8c431e1035405f9a4b061de/plugin/pkg/authorizer/webhook/webhook.go#L57-L59) for the API server, which allows for the review of the subject's access to a particular resource. Essentially, it checks whether the authenticated user or service account has sufficient permissions to perform the desired action on the requested resource, based on the RBAC permissions granted to it. If so, the request is forwarded to the endpoint, otherwise it is rejected. It is worth mentioning that the HTTP verbs are internally mapped to their [corresponding RBAC verbs](https://github.com/brancz/kube-rbac-proxy/blob/ccd5bc7fec36f9db0747033c2d698cc75a0e314c/pkg/proxy/proxy.go#L49-L60). Note that static authorization (as described in the [downstream usage](#downstream-usage) section) without SubjectAccessReview is also possible.
26 |
27 | Once the request is authenticated and authorized, it is forwarded to the endpoint. The response from the endpoint is then forwarded back to the client. If the request fails at any point, the proxy returns an error response to the client. If the authorization step fails, i.e., the client doesn't have the required permissions to access the requested resource, `kube-rbac-proxy` returns an [HTTP `403 Forbidden` status code](https://github.com/brancz/kube-rbac-proxy/blob/9efde2a776fd516cfa082cc5f2c35c7f9e0e2689/pkg/filters/auth_test.go#L300) to the client and does not forward the request to the endpoint.
28 |
29 | ## Downstream usage
30 |
31 | ### Inter-component communication
32 |
33 | In the context of monitoring, we're talking here about metric scrapes. These communications are usually secured using Mutual TLS (mTLS), which is a two-way authentication mechanism (see [configuring Prometheus to scrape metrics](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/)).
34 |
35 | Initially, the server (Prometheus) provides its digital certificate to the client which validates the server's identity. The process is then reciprocated, as the client shares its digital certificate for authentication by the server. Following the successful completion of these authentication steps, a secure channel for encrypted communication is established, ensuring that data transfer between the entities is duly safeguarded.
36 |
37 | ```yaml
38 | apiVersion: apps/v1
39 | kind: Deployment
40 | ...
41 | spec:
42 | template:
43 | spec:
44 | containers:
45 | - name: kube-rbac-proxy
46 | image: quay.io/brancz/kube-rbac-proxy:v0.8.0
47 | args:
48 | - "--tls-cert-file=/etc/tls/private/tls.crt"
49 | - "--tls-private-key-file=/etc/tls/private/tls.key"
50 | - "--client-ca-file=/etc/tls/client/client-ca.crt"
51 | ...
52 | ```
53 |
54 | CMO specifies the aforementioned CA certificate in the [metrics-client-ca](https://github.com/openshift/cluster-monitoring-operator/blob/6795f509e77cce6d24e5a3e371a432ca22e1a8e7/assets/cluster-monitoring-operator/metrics-client-ca.yaml) ConfigMap which is used to define client certificates for every `kube-rbac-proxy` container that's safeguarding a component. The component's `Service` endpoints are secured using the generated TLS `Secret` annotating it with the `service.beta.openshift.io/serving-cert-secret-name`. Internally, this requests the [`service-ca`](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/authentication/configuring-certificates#understanding-service-serving_service-serving-certificate) controller to generate a `Secret` containing a certificate and key pair for the `${service.name}.${service.namespace}.svc`. These TLS manifests are then used in various component `ServiceMonitors` to define their TLS configurations, and within CMO to ensure a "mutual" acknowledgement between the two.
55 |
56 | Static authorization involves configuring `kube-rbac-proxy` to allow access to certain resources or non-resources which are evaluated against the `Role` or `ClusterRole` RBAC permissions the user or the service account has. The example below demonstrates how this can be employed to give access to a known `ServiceAccount` to the `/metrics` endpoint. `/metrics` endpoints exposed by various monitoring components are protected this way. Note that after the initial user or service account authentication, the request is matched against a comma-separated list of paths, as defined by the [`--allow-path`](https://github.com/brancz/kube-rbac-proxy/blob/067e14a2d1ecdfe8c18da6b0a0507cd4684e2c1c/cmd/kube-rbac-proxy/app/options/options.go#L83) flag, [like so](https://github.com/openshift/cluster-monitoring-operator/blob/4e8efd9864bff3ff46499e86fef8bba1e0178f54/assets/alertmanager/alertmanager.yaml#L96).
57 |
58 | ```yaml
59 | apiVersion: v1
60 | kind: Secret
61 | ...
62 | stringData:
63 | # "path" is the path to match against the request path.
64 | # "resourceRequest" is a boolean indicating whether the request is for a resource or not.
65 | # "user" is the user to match against the request user.
66 | # "verb" is the verb to match against the corresponding request RBAC verb.
67 | config.yaml: |-
68 | "authorization":
69 | "static":
70 | - "path": "/metrics"
71 | "resourceRequest": false
72 | "user":
73 | "name": "system:serviceaccount:openshift-monitoring:prometheus-k8s"
74 | "verb": "get"
75 | ```
76 |
77 | For more details, refer to the `kube-rbac-proxy`'s [static authorization](https://github.com/brancz/kube-rbac-proxy/blob/4a44b610cd12c4cfe076a2b306283d0598c1bb7a/examples/static-auth/README.md#L169) example.
78 |
79 | For more information on collecting metrics in such cases, refer to [this section](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#exposing-metrics-for-prometheus) of the handbook.
80 |
81 | ### Securing API endpoints
82 |
83 | `kube-rbac-proxy` is also used to secure API endpoints such as Prometheus, Alertmanager and Thanos. In this case, the proxy is configured to authenticate requests based on bearer tokens and to perform authorization with `SubjectAccessReview`.
84 |
85 | The following components use the same method in their `kube-rbac-proxy` configurations `Secrets` to authorize the `/metrics` endpoint and restrict it to `GET` requests only:
86 |
87 | * `alertmanager-kube-rbac-proxy-metric` (`alertmanager`)
88 | * `openshift-user-workload-monitoring` (`alertmanager-user-workload`)
89 | * `kube-state-metrics-kube-rbac-proxy-config` (`kube-state-metrics`)
90 | * `node-exporter-kube-rbac-proxy-config` (`node-exporter`)
91 | * `openshift-state-metrics-kube-rbac-proxy-config` (`openshift-state-metrics`)
92 | * `kube-rbac-proxy` (`prometheus-k8s`) (additionally the `/federate` endpoint, for the telemeter as well as its own client)
93 | * `prometheus-operator-kube-rbac-proxy-config` (`prometheus-operator`)
94 | * `prometheus-operator-uwm-kube-rbac-proxy-config` (`prometheus-operator`)
95 | * `kube-rbac-proxy-metrics` (`prometheus-user-workload`)
96 | * `telemeter-client-kube-rbac-proxy-config` (`telemeter-client`)
97 | * `thanos-querier-kube-rbac-proxy-metrics` (`thanos-querier`)
98 | * `thanos-ruler-kube-rbac-proxy-metrics` (`thanos-ruler`)
99 |
100 | On the other hand, the example below depicts restricted access to a resource, i.e., `monitoring.coreos.com/prometheusrules` in the `openshift-monitoring` namespace.
101 |
102 | ```yaml
103 | apiVersion: v1
104 | kind: Secret
105 | ...
106 | stringData:
107 | # "resourceAttributes" describes attributes available for resource request authorization.
108 | # "rewrites" describes how SubjectAccessReview may be rewritten on a given request.
109 | # "rewrites.byQueryParameter" describes which HTTP URL query parameter is to be used to rewrite a SubjectAccessReview
110 | # on a given request.
111 | config.yaml: |-
112 | "authorization":
113 | "resourceAttributes":
114 | "apiGroup": "monitoring.coreos.com"
115 | "namespace": "{{ .Value }}"
116 | "resource": "prometheusrules"
117 | "rewrites":
118 | "byQueryParameter":
119 | "name": "namespace"
120 | ```
121 |
122 | The following components use the same method in their `kube-rbac-proxy` configuration `Secrets` to authorize the respective resources:
123 | * `alertmanager-kube-rbac-proxy` (`alertmanager`): `prometheusrules`
124 | * `alertmanager-kube-rbac-proxy-tenancy` (`alertmanager-user-workload`): `prometheusrules`
125 | * `kube-rbac-proxy-federate` (`prometheus-user-workload`): `namespaces`
126 | * `thanos-querier-kube-rbac-proxy-rules` (`thanos-querier`): `prometheusrules`
127 | * `thanos-querier-kube-rbac-proxy` (`thanos-querier`): `pods`
128 |
129 | Note that all applicable omitted configuration settings are interpreted as wildcards.
130 |
131 | ## Configuration
132 |
133 | Details on configuring `kube-rbac-proxy` under different scenarios can be found in the repository's [/examples](https://github.com/brancz/kube-rbac-proxy/tree/9f436d46699dfd425f2682e4338069642b682892/examples) section.
134 |
135 | ## Debugging
136 |
137 | In addition to enabling debug logs or compiling a custom binary with debugging capabilities (`-gcflags="all=-N -l"`), users can:
138 | * [put the CMO into unmanaged state](https://github.com/openshift/cluster-monitoring-operator/blob/e1803adfa64f9ef424cd6e10790791adbed25eb4/hack/local-cmo.sh#L68-L103) to enable a higher verbose level using `-v=12` (or higher), or,
139 | * grep [the audit logs](https://docs.openshift.com/container-platform/4.13/security/audit-log-view.html) for more information about requests or responses concerning `kube-rbac-proxy`.
140 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/collecting_metrics.md:
--------------------------------------------------------------------------------
1 | # Collecting metrics with Prometheus
2 |
3 | This document explains how to ingest metrics into the OpenShift Platform monitoring stack. **It only applies for the OCP core components and Red Hat certified operators.**
4 |
5 | For user application monitoring, please refer to the [official OCP documentation](https://docs.openshift.com/container-platform/latest/monitoring/enabling-monitoring-for-user-defined-projects.html).
6 |
7 | ## Targeted audience
8 |
9 | This document is intended for OpenShift developers that want to expose Prometheus metrics from their operators and operands. Readers should be familiar with the architecture of the [OpenShift cluster monitoring stack](https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html#understanding-the-monitoring-stack_monitoring-overview).
10 |
11 | ## Exposing metrics for Prometheus
12 |
13 | Prometheus is a monitoring system that pulls metrics over HTTP, meaning that monitored targets need to expose an HTTP endpoint (usually `/metrics`) which will be queried by Prometheus at regular intervals (typically every 30 seconds).
14 |
15 | To avoid leaking sensitive information to potential attackers, all OpenShift components scraped by the in-cluster monitoring Prometheus should follow these requirements:
16 | * Use HTTPS instead of plain HTTP.
17 | * Implement proper authentication (e.g. verify the identity of the requester).
18 | * Implement proper authorization (e.g. authorize requests issued by the Prometheus service account or users with GET permission on the metrics endpoint).
19 |
20 | All the requirements are well defined in the [OpenShift developer documentation](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#metrics). As described in the [Client certificate scraping](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/client-cert-scraping.md) enhancement proposal, we recommend that the components rely on client TLS certificates for authentication/authorization. This is more efficient and robust than using bearer tokens because token-based authn/authz add a dependency (and additional load) on the Kubernetes API.
21 |
22 | To this goal, the Cluster monitoring operator provisions a TLS client certificate for the in-cluster Prometheus. The client certificate is issued for the `system:serviceaccount:openshift-monitoring:prometheus-k8s` Common Name (CN) and signed by the `kubernetes.io/kube-apiserver-client` [signer](https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers). The certificate can be verified using the certificate authority (CA) bundle located at the `client-ca-file` key of the `kube-system/extension-apiserver-authentication` ConfigMap.
23 |
24 | {{% alert color="info" %}} In practice the Cluster Monitoring Operator creates a CertificateSigningRequest object for the `prometheus-k8s` service account which is automatically approved by the cluster-policy-controller. Once the certificate is issued by the controller, CMO provisions a secret named `metrics-client-certs` which contains the TLS certificate and key (respectively under `tls.crt` and `tls.key` keys in the secret). CMO also rotates the certificate before it gets expired.{{% /alert %}}
25 |
26 | There are several options available depending on which framework your component is built.
27 |
28 | ### library-go
29 |
30 | If your component already relies on `*ControllerCommandConfig` from `github.com/openshift/library-go/pkg/controller/controllercmd`, it should automatically expose a TLS-secured `/metrics` endpoint which has an hardcoded authorizer for the `system:serviceaccount:openshift-monitoring:prometheus-k8s` service account (refer to [github.com/openshift/library-go/pkg/authorization/hardcodedauthorizer](https://pkg.go.dev/github.com/openshift/library-go/pkg/authorization/hardcodedauthorizer) package and [usage](https://github.com/openshift/library-go/blob/24668b1349e6276ebfa9f9e49c780559284defed/pkg/controller/controllercmd/builder.go#L277-L279)). Requests authenticating with bearer tokens are allowed if the authenticated user has the `get` verb permission on the `/metrics` (non-resource) URL.
31 |
32 | Example: the [Cluster Kubernetes API Server Operator](https://github.com/openshift/cluster-kube-apiserver-operator/).
33 |
34 | ### kube-rbac-proxy sidecar
35 |
36 | The "simplest" option when the component doesn't rely on `github.com/openshift/library-go` (and switching to library-go isn't an option) is to run a [`kube-rbac-proxy`](https://github.com/openshift/kube-rbac-proxy) sidecar in the same pod as the application being monitored.
37 |
38 | Here is an example of a sidecar container's definition to be added to the Pod's template of the Deployment (or Daemonset):
39 |
40 | ```yaml
41 | - args:
42 | - --secure-listen-address=0.0.0.0:8443
43 | - --upstream=http://127.0.0.1:8081 # make sure that the upstream container listens on 127.0.0.1.
44 | - --config-file=/etc/kube-rbac-proxy/config.yaml
45 | - --tls-cert-file=/etc/tls/private/tls.crt
46 | - --tls-private-key-file=/etc/tls/private/tls.key
47 | - --client-ca-file=/etc/tls/client/client-ca-file
48 | - --logtostderr=true
49 | - --allow-paths=/metrics
50 | image: quay.io/brancz/kube-rbac-proxy:v0.11.0 # usually replaced by CVO by the OCP kube-rbac-proxy image reference.
51 | name: kube-rbac-proxy
52 | ports:
53 | - containerPort: 8443
54 | name: metrics
55 | resources:
56 | requests:
57 | cpu: 1m
58 | memory: 15Mi
59 | securityContext:
60 | allowPrivilegeEscalation: false
61 | capabilities:
62 | drop:
63 | - ALL
64 | terminationMessagePolicy: FallbackToLogsOnError
65 | volumeMounts:
66 | - mountPath: /etc/kube-rbac-proxy
67 | name: secret-kube-rbac-proxy-metric
68 | readOnly: true
69 | - mountPath: /etc/tls/private
70 | name: secret-kube-rbac-proxy-tls
71 | readOnly: true
72 | - mountPath: /etc/tls/client
73 | name: metrics-client-ca
74 | readOnly: true
75 | [...]
76 | - volumes:
77 | # Secret created by the service CA operator.
78 | # We assume that the Kubernetes service exposing the application's pods has the
79 | # "service.beta.openshift.io/serving-cert-secret-name: kube-rbac-proxy-tls"
80 | # annotation.
81 | - name: secret-kube-rbac-proxy-tls
82 | secret:
83 | secretName: kube-rbac-proxy-tls
84 | # Secret containing the kube-rbac-proxy configuration (see below).
85 | - name: secret-kube-rbac-proxy-metric
86 | secret:
87 | secretName: secret-kube-rbac-proxy-metric
88 | # ConfigMap containing the CA used to verify the client certificate.
89 | - name: metrics-client-ca
90 | configMap:
91 | name: metrics-client-ca
92 | ```
93 |
94 | {{% alert color="info"%}}The `metrics-client-ca` ConfigMap needs to be created by your component and synced from the `kube-system/extension-apiserver-authentication` ConfigMap.{{% /alert %}}
95 |
96 | Here is a Secret containing the kube-rbac-proxy's configuration (it allows only HTTPS requets to the `/metrics` endpoint for the Prometheus service account):
97 |
98 | ```yaml
99 | apiVersion: v1
100 | kind: Secret
101 | metadata:
102 | name: secret-kube-rbac-proxy-metric
103 | namespace: openshift-example
104 | stringData:
105 | config.yaml: |-
106 | "authorization":
107 | "static":
108 | - "path": "/metrics"
109 | "resourceRequest": false
110 | "user":
111 | "name": "system:serviceaccount:openshift-monitoring:prometheus-k8s"
112 | "verb": "get"
113 | type: Opaque
114 | ```
115 |
116 | Example: [node-exporter](https://github.com/openshift/cluster-monitoring-operator/blob/e51a06ffdb974003d4024ade3545f5e5e6efe157/assets/node-exporter/daemonset.yaml#L65-L98) from the Cluster Monitoring operator.
117 |
118 | ### controller-runtime (>= v0.16.0)
119 |
120 | Starting with v0.16.0, the `controller-runtime` framework provides a way to expose and secure a `/metrics` endpoint using TLS with minimal effort.
121 |
122 | Refer to https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/metrics/server for details about TLS configuration and check the next section to understand how it needs to be configured.
123 |
124 | As an example, you can refer to the [Observability Operator](https://github.com/rhobs/observability-operator/blob/2d192ea23b05bf4e088cf331666fd616efbb3072/pkg/operator/operator.go#L143-L212) implementation.
125 |
126 | ### Roll your own HTTPS server
127 |
128 | {{% alert color="info" %}}You don't use `library-go`, `controller-runtime` >= v0.16.0 or don't want to run a `kube-rbac-proxy` sidecar.{{% /alert %}}
129 |
130 | In such situations, you need to implement your own HTTPS server for `/metrics`. As explained before, it needs to require and verify the TLS client certificate using the root CA stored under the `client-ca-file` key of the `kube-system/extension-apiserver-authentication` ConfigMap.
131 |
132 | In practice, the server should:
133 | * Set TLSConfig's `ClientAuth` field to `RequireAndVerifyClientCert`.
134 | * Reload the root CA when the source ConfigMap is updated.
135 | * Reload the server's certificate and key when they are updated.
136 |
137 | Example: https://github.com/openshift/cluster-monitoring-operator/pull/1870
138 |
139 | ## Configuring Prometheus to scrape metrics
140 |
141 | To tell the Prometheus pods running in the `openshift-monitoring` namespace (e.g. `prometheus-k8s-{0,1}`) to scrape the metrics from your operator/operand pods, you should use `ServiceMonitor` and/or `PodMonitor` custom resources.
142 |
143 | The workflow is:
144 | * Add the `openshift.io/cluster-monitoring: "true"` label to the namespace where the scraped targets live.
145 | * **Important: only OCP core components and Red Hat certified operators can set this label on namespaces.**
146 | * OCP core components can set the label on their namespaces in the CVO manifets directly.
147 | * For OLM operators:
148 | * There's no automatic way to enforce the label (yet).
149 | * The OCP console will display a checkbox at installation time to enable cluster monitoring for the operator if you add the `operatorframework.io/cluster-monitoring=true` annotation to the operator's CSV and if the installation namespace starts with `openshift-`.
150 | * For CLI installations, the requirement should be detailed in the installation procedure ([example](https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/logging/cluster-logging-deploying#logging-loki-cli-install_cluster-logging-deploying) for the Logging operator).
151 | * Add Role and RoleBinding to give the prometheus-k8s service account access to pods, endpoints and services in your namespace.
152 | * In case of ServiceMonitor:
153 | * Create a Service object selecting the scraped pods.
154 | * Create a ServiceMonitor object targeting the Service.
155 | * In case of PodMonitor:
156 | * Create a PodMonitor object targeting the pods.
157 |
158 | Below is an fictitious example using a ServiceMonitor object to scrape metrics from pods deployed in the `openshift-example` namespace.
159 |
160 | **Role and RoleBinding manifests**
161 |
162 | ```yaml
163 | apiVersion: rbac.authorization.k8s.io/v1
164 | kind: Role
165 | metadata:
166 | name: prometheus-k8s
167 | namespace: openshift-example
168 | rules:
169 | - apiGroups:
170 | - ""
171 | resources:
172 | - services
173 | - endpoints
174 | - pods
175 | verbs:
176 | - get
177 | - list
178 | - watch
179 | ---
180 | apiVersion: rbac.authorization.k8s.io/v1
181 | kind: RoleBinding
182 | metadata:
183 | name: prometheus-k8s
184 | namespace: openshift-example
185 | roleRef:
186 | apiGroup: rbac.authorization.k8s.io
187 | kind: Role
188 | name: prometheus-k8s
189 | subjects:
190 | - kind: ServiceAccount
191 | name: prometheus-k8s
192 | namespace: openshift-monitoring
193 | ```
194 |
195 | **Service manifest**
196 |
197 | ```yaml
198 | apiVersion: v1
199 | kind: Service
200 | metadata:
201 | annotations:
202 | # This annotation tells the service CA operator to provision a Secret
203 | # holding the certificate + key to be mounted in the pods.
204 | # The Secret name is "" (e.g. "secret-my-app-tls").
205 | service.beta.openshift.io/serving-cert-secret-name: tls-my-app-tls
206 | labels:
207 | app.kubernetes.io/name: my-app
208 | name: metrics
209 | namespace: openshift-example
210 | spec:
211 | ports:
212 | - name: metrics
213 | port: 8443
214 | targetPort: metrics
215 | # Select all Pods in the same namespace that have the `app.kubernetes.io/name: my-app` label.
216 | selector:
217 | app.kubernetes.io/name: my-app
218 | type: ClusterIP
219 | ```
220 |
221 | **ServiceMonitor manifest**
222 |
223 | ```yaml
224 | apiVersion: monitoring.coreos.com/v1
225 | kind: ServiceMonitor
226 | metadata:
227 | name: my-app
228 | namespace: openshift-example
229 | spec:
230 | endpoints:
231 | - interval: 30s
232 | # Matches the name of the service's port.
233 | port: metrics
234 | scheme: https
235 | tlsConfig:
236 | # The name of the server (CN) in the server's certificate.
237 | serverName: my-app.openshift-example.svc
238 | # The openshift-monitoring/k8s resource declares a `tls-client-certificate-auth`
239 | # scrape class which defines the client certificate and key used by Prometheus
240 | # scrape the /metrics endpoint as well as the certificate authority used
241 | # used to verify the server's certificate (*). Referencing the scrape class here
242 | # avoids specifying the exact filepaths.
243 | #
244 | # (*) the certificate authority is the cluster's CA bundle from the service CA operator.
245 | scrapeClass: tls-client-certificate-auth
246 | selector:
247 | # Select all Services in the same namespace that have the `app.kubernetes.io/name: my-app` label.
248 | matchLabels:
249 | app.kubernetes.io/name: my-app
250 | ```
251 |
252 | ## Next steps
253 |
254 | * [Configure alerting](alerting.md) with Prometheus.
255 | * [Send Telemetry metrics](telemetry.md).
256 |
--------------------------------------------------------------------------------
/content/Teams/onboarding/README.md:
--------------------------------------------------------------------------------
1 | # Team member onboarding
2 |
3 |
4 |
5 | - [Team member onboarding](#team-member-onboarding)
6 | - [Team Background](#team-background)
7 | - [People relevant to Prometheus who you should know about:](#people-relevant-to-prometheus-who-you-should-know-about)
8 | - [Thanos](#thanos)
9 | - [Talks](#talks)
10 | - [First days (accounts & access)](#first-days-accounts--access)
11 | - [First weeks](#first-weeks)
12 | - [General](#general)
13 | - [Watch these talks](#watch-these-talks)
14 | - [[optional] Additional information & Exploration](#optional-additional-information--exploration)
15 | - [Who’s Who?](#whos-who)
16 | - [First project](#first-project)
17 | - [First Months](#first-months)
18 | - [Second project](#second-project)
19 | - [Glossary](#glossary)
20 |
21 | Welcome to the Monitoring Team! We created this document to help guide you through your onboarding.
22 |
23 | Please fork this repo and propose changes to the content as a pull request if something is not accurate, outdated or unclear. Use your fork to track the progress of your onboarding tasks, i.e. by keeping a copy in your fork with comments about your completion status. Please read the doc thoroughly and check off each item as you complete it. We consistently want to improve our onboarding experience and your feedback helps future team members.
24 |
25 | This documents contains some links to Red Hat internal documents and you won't be able to access them without a Red Hat associate login. We are still trying to keep as much information as possible in the open.
26 |
27 | ## Team Background
28 |
29 | Team Monitoring mainly focuses on Prometheus, Thanos, Observatorium, and their integration into Kubernetes and OpenShift. We are also responsible for the hosted multi-tenant Observability service which powers such services as OpenShift Telemetry and OSD metrics.
30 |
31 | Prometheus is a monitoring project initiated at SoundCloud in 2012. It became public and widely usable in early 2015. Since then, it found adoption across many industries. In early 2016 development diversified away from SoundCloud as CoreOS hired one of the core developers. Prometheus is not a single application but an entire ecosystem of well-separated components, which work well together but can be used individually.
32 |
33 | CoreOS was acquired by Red Hat in early 2018. The CoreOS monitoring team became the Red Hat Monitoring Team, which has evolved into the “Observability group”. The teams were divided into two teams:
34 |
35 | 1. The In-Cluster Observability Team
36 | 2. The Observability Platform team (aka RHOBS or Observatorium team)
37 |
38 | You may encounter references to these two teams. In early 2021 we decided to combine the efforts of these teams more closely in order to avoid working in silos and ensure we have a well functioning end-to-end product experience across both projects. We are, however, still split into separate scrum teams for efficiency. We are now collectively known as the “Monitoring Team” and each team works together across the various domains.
39 |
40 | We work on all important aspects of the upstream eco-system and a seamless monitoring experience for Kubernetes using Prometheus. Part of that is integrating our open source efforts into the OpenShift Container Platform (OCP), the commercial Red hat Kubernetes distribution.
41 |
42 | ## People relevant to Prometheus who you should know about:
43 |
44 | - Julius Volz ([@juliusv](https://github.com/juliusv)): Previously worked at Soundcloud where he developed prometheus. Now working as an independent contractor and organizing [PromCon](http://promcon.io/) (Prometheus community conference). He also worked with weave.works on a prototype for remote/long-term time series storage, with Influxdb on Flux PromQL support. Contributed new Prometheus UI in React. He also created a new company [PromLens](https://promlens.com/), for a rich PromQL UI.
45 | - Bjoern Rabenstein ([@beorn7](https://github.com/beorn7)): Worked at SoundCloud but now works at Grafana. He's active again upstream and the maintainer of pushgateway and client_golang (the Go client Prometheus library).
46 | - Frederic Branczyk ([@brancz](https://github.com/brancz)): Joined CoreOS in 2016 to work around Prometheus. Core Team member since then and one of the minds behind our team's vision. Left Red Hat in 2020 to start his new company around continuous profiling (PolarSignals). Still very active in upstream.
47 | - Julien Pivotto ([@roidelapluie](https://github.com/roidelapluie)): prometheus/prometheus maintainer. Very active in other upstream projects (Alertmanager, …).
48 |
49 | ## Thanos
50 |
51 | Thanos is a monitoring system, which was created based on Prometheus principles. It is a distributed version of Prometheus where every piece of Prometheus like scraping, querying, storage, recording, alerting, and compaction can be deployed as separate horizontally scalable components. This allows more flexible deployments and capabilities beyond single clusters. Thanos also supports object storage as the main storage option, allowing cheap long term retention for metrics. At the end it exposes the same (yet extended) Prometheus APIs and uses gRPC to communicate between components.
52 |
53 | Thanos was created because of the scalability limits of Prometheus in 2017. At that point a similar project Cortex was emerging too, but it was over complex at that time. In November 2017, Fabian Reinartz ([@fabxc](https://github.com/fabxc), consulting for Improbable at that time) and Bartek Plotka ([@bwplotka](https://github.com/bwplotka)), teamed up to create Thanos based on the Prometheus storage format. Around February 2018 the project was shown at Prometheus Meetup in London, and in Summer 2018 announced on PromCon 2018. In 2019, our team in Red Hat, led at that point by Frederic Branczyk [@brancz](https://github.com/brancz), contributed essential pieces allowing Thanos to receive remote-write (push model) for Prometheus metrics. Since then, we could leverage Thanos for Telemetry gathering and then in in-cluster Monitoring, too.
54 |
55 | When working with it you will most likely interact with Bartek Plotka ([@bwplotka](https://github.com/bwplotka)) and other team members.
56 |
57 | These are the people you’ll be in most contact with when working upstream. If you run into any communication issues, please let us know as soon as possible.
58 |
59 | ## Talks
60 |
61 | Advocating about sane monitoring and alerting practices (especially focused on Kubernetes environments) and how Prometheus implements them is part of our team’s work. That can happen internally or on public channels. If you are comfortable giving talks on the topic or some specific work we have done, let us know so we can plan ahead to find you a speaking opportunity at meetups or conferences. If you are not comfortable, but want to break this barrier let us know as well, we can help you get more comfortable in public speaking slowly step by step. If you want to submit a CFP for a talk please add it to this [spreadsheet](https://docs.google.com/spreadsheets/d/1eo_JVND3k4ZnL25kgnhITSE2DBkyw8fwg3MyCXMjdYU/edit#gid=1880565406) and inform your manager.
62 |
63 | ## First days (accounts & access)
64 |
65 | 1. Follow up on [administrative tasks](https://docs.google.com/document/d/1bJSrlyc-e7bcOxV4sjx3FesMNVgdwNxUzMvIYywbt-0).
66 | 2. Understand the meetings the team attends to:
67 |
68 | Ask your manager to be added to the [Observability Program](https://calendar.google.com/calendar/u/0?cid=cmVkaGF0LmNvbV91N3YwbGt2cnRuM2wwbWJmMnF2M2VkMm12MEBncm91cC5jYWxlbmRhci5nb29nbGUuY29t) calendar. Ensure you attend the following recurring meetings:
69 |
70 | * Team syncs
71 | * Sprint retro/planning
72 | * Sprint reviews
73 | * Weekly architecture call
74 | * 1on1 with manager
75 | * Weekly 1on1 with your mentor ([mentors are tracked here](https://docs.google.com/spreadsheets/d/1SpdBbZChBNuPHVtbCjOch1mfZGUuCjkrp7yyCClL9kk/edit#gid=0))
76 |
77 | ## First weeks
78 |
79 | Set up your computer and development environment and do some research. Feel free to come back to these on an ongoing basis as needed. There is no need to complete them all at once.
80 |
81 | ### General
82 |
83 | 1. Review our product documentation (this is very important): [Understanding the monitoring stack | Monitoring | OpenShift Container Platform 4.10](https://docs.openshift.com/container-platform/4.10/monitoring/monitoring-overview.html#understanding-the-monitoring-stack_monitoring-overview)
84 | 2. Review our team’s process doc: [Monitoring Team Process](https://docs.google.com/document/d/1vbDGcjMjJMTIWcua5Keajla9FzexjLKmVk7zoUc0_MI/edit#heading=h.n0ac5lllvh13)
85 | 3. Review how others should formally submit requests to our team: [Requests: Monitoring Team](https://docs.google.com/document/d/10orRGt5zlmZ-XsXQNY-sg6lOzWDCrPmHP68Oi-ETU9I)
86 | 4. If you haven’t already, buy this book and make a plan to finish it over time (you can expense it): *“Site Reliability Engineering: How Google Runs Production Systems”*. Online version of the book can be found here: [https://sre.google/books/](https://sre.google/books/).
87 | 5. Ensure you attend a meeting with your team lead or architect to give a general overview of our in-cluster OpenShift technology stack.
88 | 6. Ensure you attend a meeting with your team lead or architect to give a general overview of our hosted Observatorium/Telemetry stack.
89 | 7. Bookmark this spreadsheet for reference of all [OpenShift release dates](https://docs.google.com/spreadsheets/d/19bRYespPb-AvclkwkoizmJ6NZ54p9iFRn6DGD8Ugv2c/edit#gid=0). Alternatively, you can add the [OpenShift Release Dates](https://calendar.google.com/calendar/embed?src=c_188dvhrfem5majheld63i20a7rslg%40resource.calendar.google.com&ctz=America%2FNew_York) calendar.
90 |
91 | ### Watch these talks
92 |
93 | * [Prometheus introduction](https://www.youtube.com/watch?v=PzFUwBflXYc) by Julius Volz (project’s cofounder) @ KubeCon EU 2020
94 | * [The Zen of Prometheus](https://www.youtube.com/watch?v=Nqp4fjw_omU), by Kemal (ex-Observability Platform team) @ PromCon 2020
95 | * [The RED Method: How To Instrument Your Services](https://www.youtube.com/watch?v=zk77VS98Em8) by Tom Wilkie @ GrafanaCon EU 2018
96 | * [Thanos: Prometheus at Scale](https://www.youtube.com/watch?v=q9j8vpgFkoY) by Lucas and Bartek (Observability Platform team) @ DevConf 2020
97 | * [Instrumenting Applications and Alerting with Prometheus](https://www.youtube.com/watch?v=sHKWD8XnmmY), from Simon (Cluster Observability team) @ OSSEU 2019
98 | * [PromQL for mere mortals](https://www.youtube.com/watch?v=hTjHuoWxsks) by Ian Billett (Observability Platform team)@ PromCon 2019
99 | * [Life of an alert (Alertmanager)](https://www.youtube.com/watch?v=PUdjca23Qa4), @ PromCon 2018
100 | * [Best practices and pitfalls](https://www.youtube.com/watch?v=_MNYuTNfTb4) @ PromCon 2017
101 | * [Deep Dive: Kubernetes Metric APIs using Prometheus](https://www.youtube.com/watch?v=cIoOAbzhR7k)
102 | * [Monitoring Kubernetes with prometheus-operator](https://www.youtube.com/watch?v=MuHPMXCGiLc) by Lili @ Cloud Native Computing Berlin meetup 2021
103 | * [Using Jsonnet to Package Together Dashboards, Alerts and Exporters](https://www.youtube.com/watch?v=b7-DtFfsL6E) by Tom Wilkie(Grafana Labs) @ Kubecon, Europe 2018
104 | * [(Internal) Observatorium Deep Dive](https://drive.google.com/drive/u/0/folders/1NHhgoYi5y58wJpi_qp49tx1V9TQcRQTj), January 2021 by Kemal
105 | * [(Internal) How to Get Reviewers to Block your Changes](https://drive.google.com/file/d/1KOWv5A2qAoO1CfbfhCIJW1wcOnGpfTST/view), March 2022 by Assaf
106 |
107 | ### [optional] Additional information & Exploration
108 |
109 | * [https://www.linkedin.com/learning/](https://www.linkedin.com/learning/)
110 | * Red Hat has a corporate subscription to LinkedIn Learning that has great introductory courses to many topics relevant to our team
111 | * [Prometheus Monitoring](https://www.youtube.com/channel/UC4pLFely0-Odea4B2NL1nWA/videos) channel on Youtube
112 | * PromLabs [PromQL cheat sheet](https://promlabs.com/promql-cheat-sheet/)
113 | * [Prometheus-example-app](https://github.com/brancz/prometheus-example-appPrometheus-example-apphttps://github.com/brancz/prometheus-example-app)
114 | * [Kubernetes-sample-controller](https://github.com/kubernetes/sample-controller)
115 |
116 | The team uses various tools, you should get familiar with them by reading through the documentation and trying them out:
117 |
118 | * [Go](https://golang.org/)
119 | * [Jsonnet](https://jsonnet.org/)
120 | * [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/)
121 | * [Kubernetes](https://kubernetes.io/)
122 | * [Thanos](https://thanos.io/)
123 | * [Observatorium](https://github.com/observatorium)
124 |
125 | ### Who’s Who?
126 |
127 | * For all the teams and people in OpenShift Engineering, see this [Team Member Tracking spreadsheet](https://docs.google.com/spreadsheets/d/1M4C41fX2J1nBXhqPdtwd8UP4RAx98NA4ByIUv-0Z0Ds/edit?usp=drive_web&ouid=116712625969749019622). Bookmark this and refer to it as needed.
128 | * Schedule a meeting with your manager to go over the team organizational structure
129 |
130 | ### First project
131 |
132 | Your first project should ideally:
133 |
134 | * Provide an interesting set of related tasks that make you familiar with various aspects of internal and external parts of Prometheus and OpenShift.
135 | * Encourage discussion with other upstream maintainers and/or people at Red Hat.
136 | * Be aligned with the area of Prometheus and more generally the monitoring stack you want to work on.
137 | * Have a visible impact for you and others
138 |
139 | Here’s a list of potential starter projects, talk to us to discuss them in more detail and figure out which one suits you.
140 |
141 | (If you are not a new hire, please add/remove projects as appropriate)
142 |
143 | * Setup Prometheus, Alertmanager, and node-exporter
144 | * As binaries on your machine (Bonus: Compile them yourself)
145 | * As containers
146 | * Setup Prometheus as a StatefulSet on vanilla Kubernetes (minikube or your tool of choice)
147 | * Try the Prometheus Operator on vanilla Kubernetes (minikube or your tool of choice)
148 | * Try kube-prometheus on vanilla Kubernetes
149 | * Try the cluster-monitoring-operator on Openshift (easiest is through the cluster-bot on slack)
150 |
151 | During the project keep the feedback cycles with other people as long or short as you feel confident. If you are not sure, ask! Try to briefly check in with the team regularly.
152 |
153 | Try to submit any coding work in small batches. This makes it easier for us to review and realign quickly.
154 |
155 | Everyone gets stuck sometimes. There are various smaller issues around the Prometheus and Alertmanager upstream repositories and the different monitoring operators. If you need a bit of distance, tackle one of them for a while and then get back to your original problem. This will also help you to get a better overview. If you are still stuck, just ask someone and we’ll discuss things together.
156 |
157 | ## First Months
158 |
159 | * If you will be starting out working more closely with the in-cluster stack be sure to review this document as well: [In-Cluster Monitoring Onboarding](https://docs.google.com/document/d/16Uzd8OLkBdN0H4KxqQIr7HTPYZWHKz3WdGtOlB6Rcdk/edit#). Otherwise if you are starting out more focused on the Observatorium service, review this doc: [Observatorium Platform Onboarding](https://docs.google.com/document/d/1RXSJYpx2x3bje6fwy2PEUSOgDrBlxq24A5vh2mHcxnk/edit#)
160 | * Try to get *something* (anything) merged into one of our repositories
161 | * Begin your 2nd project
162 | * Create a PR for the the master onboarding doc (this one) with improvements you think would help others
163 |
164 | ### Second project
165 |
166 | After your starter project is done, we’ll discuss how it went and what your future projects will be. By then you'll hopefully have a good overview which areas you are interested in and what their priority is. Discuss with your team lead or manager what your next project will be.
167 |
168 | ## Glossary
169 |
170 | Our team's glossary can be found [here](https://docs.google.com/document/d/1bJSrlyc-e7bcOxV4sjx3FesMNVgdwNxUzMvIYywbt-0/edit#heading=h.9lupa64ck0pj).
171 |
--------------------------------------------------------------------------------
/content/Products/OpenshiftMonitoring/telemetry.md:
--------------------------------------------------------------------------------
1 | # Sending metrics via Telemetry
2 |
3 | ## Targeted audience
4 |
5 | This document is intended for OpenShift developers that want to ship new metrics to the Red Hat Telemetry service.
6 |
7 | ## Background
8 |
9 | Before going to the details, a few words about [Telemetry](https://rhobs-handbook.netlify.app/services/rhobs/use-cases/telemetry.md/) and the process to add a new metric..
10 |
11 | **What is Telemetry?**
12 |
13 | Telemetry is a system operated and hosted by Red Hat that allows to collect data from connected clusters to enable subscription management automation, monitor the health of clusters, assist with support, and improve customer experience.
14 |
15 | **What does sending metrics via Telemetry mean?**
16 |
17 | You should send the metrics via Telemetry when you want and need to see these metrics for **all** OpenShift clusters. This is primarily for gaining insights on how OpenShift is used, troubleshooting and monitoring the fleet of clusters. Users can already see these metrics in their clusters via Prometheus even when not available via Telemetry.
18 |
19 | **How are metrics shipped via Telemetry?**
20 |
21 | Only metrics which are already collected by the [in-cluster monitoring stack](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#in-cluster-monitoring-stack) can be shipped via Telemetry. The `telemeter-client` pod running in the `openshift-monitoring` namespace collects metrics from the `prometheus-k8s` service every 4m30s using the `/federate` endpoint and ships the samples to the Telemetry endpoint using a custom protocol.
22 |
23 | **How long will it take for my new telemetry metrics to show up?**
24 |
25 | Please start this process and involve the monitoring team as early as possible. The process described in this document includes a thorough review of the underlying metrics and labels. The monitoring team will try to understand your use case and perhaps propose improvements and optimizations. Metric, label and rule names will be reviewed for [following best practices](https://prometheus.io/docs/practices/naming/). This can take several review rounds over multiple weeks.
26 |
27 | ## Requirements
28 |
29 | Shipping metrics via Telemetry is only possible for components running in namespaces with the `openshift.io/cluster-monitoring=true` label. In practice, it means that your component falls into one of these 2 categories:
30 | * Your operator/operand is included in the OCP payload (e.g. it is a core/platform component).
31 | * Your operator/operand is deployed via OLM and has been certified by Red Hat.
32 |
33 | Your component should already be instrumented and scraped by the [in-cluster monitoring stack](#in-cluster-monitoring-stack) using `ServiceMonitor` and/or `PodMonitor` objects.
34 |
35 | ## Sending metrics via Telemetry step-by-step
36 |
37 | The overall process is as follows:
38 | 1. Request approval from the monitoring team.
39 | 2. Configure recording rules using `PrometheusRule` objects.
40 | 3. Modify the configuration of the Telemeter client in the [Cluster Monitoring Operator](https://github.com/openshift/cluster-monitoring-operator/) repository to collect the new metrics.
41 | 4. Synchronize the Telemeter server's configuration from the Cluster Monitoring Operator project.
42 | 5. Wait for the Telemeter server's configuration to be rolled out to production.
43 |
44 | ### Request approval
45 |
46 | The first step is to identify which metrics you want to send via Telemetry and what is the [cardinality](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#what-is-the-cardinality-of-a-metric) of the metrics (e.g. how many timeseries it will be in total). Typically you start with metrics that show how your component is being used. In practice, we recommend to start shipping not more than:
47 | * 1 to 3 metrics.
48 | * 1 to 10 timeseries per metric.
49 | * 10 timeseries in total.
50 |
51 | If you are above these limits, you have 2 choices:
52 | * (recommended) aggregate the metrics before sending. For instance: sum all values for a given metric.
53 | * request an exception from the monitoring team. The exception requires approval from upper management so make sure that your request is motivated!
54 |
55 | Finally your metric **MUST NOT** contain any personally identifiable information (names, email addresses, information about user workloads).
56 |
57 | Use the following information to file 1 JIRA ticket per metric in the [MON project](https://issues.redhat.com//secure/CreateIssueDetails!init.jspa?pid=12323177&issuetype=3&labels=telemetry-review-request&summary=Send+metric+...+via+Telemetry&description=h1.%20Request%20for%20sending%20data%20via%20telemetry%0A%0AThe%20goal%20is%20to%20collect%20metrics%20about%20...%20because%20...%0A%0Ah2.%20%3CMetric%20name%3E%0A%0A%3CMetric%20name%3E%20represents%20...%0A%0ALabels%0A%2A%20%3Clabel%201%3E%2C%20possible%20values%20are%20...%0A%2A%20%3Clabel%202%3E%2C%20possible%20values%20are%20...%0A%0AThe%20cardinality%20of%20the%20metric%20is%20at%20most%20%3CX%3E.%0A%0AComponent%20exposing%20the%20metric%3A%20https%3A%2F%2Fgithub.com%2F%3Corg%3E%2F%3Cproject%3E%0Ah2.&priority=4):
58 |
59 | * Type: `Task`
60 | * Title: `Send metric via Telemetry`
61 | * Label: `telemetry-review-request`
62 | * Description template:
63 |
64 | ```
65 | h1. Request for sending data via telemetry
66 |
67 | The goal is to collect metrics about ... because ...
68 |
69 | represents ...
70 |
71 | Labels
72 | *