├── 01-intro-to-prometheus.md
├── 02-write-your-first-query.md
├── 03-filtering-with-labels.md
├── 04-instant-and-range-vectors.md
├── 05-counters.md
├── 06-operators.md
├── 07-aggregating-by-label.md
├── 08-functions.md
├── 09-histograms.md
├── 10-summaries.md
├── 11-set-operators-and-vector-matches.md
├── LICENSE
└── README.md
/01-intro-to-prometheus.md:
--------------------------------------------------------------------------------
1 | # Intro to Prometheus
2 |
3 | ## Goal
4 | - Get a basic understanding of what Prometheus is
5 | - Be confident with basic querying so you could use it to resolve an incident/create a dashboard
6 |
7 | Note, we aren’t going to focus so much on setting up the monitoring but assume Prometheus is already set up and scraping your apps
8 |
9 | ## What is Prometheus
10 |
11 | https://prometheus.io/
12 |
13 | - monitoring tool
14 | - Similar/competitor products would be graphite, influxdb
15 | - time series DB
16 | - uses /metric pages
17 | - here is one of the metrics on the metrics page - https://observe-paas-prometheus-exporter.cloudapps.digital/metrics
18 | - pull model for metrics rather than push model
19 | - handles alerts but not responsible for delivery of alerts (that is alert manager)
20 | - only basic dashboarding (use grafana for dashboarding)
21 |
22 |
23 | ## What Reliability Engineering Observe provide
24 | - Prometheus, Alertmanager and Grafana.
25 | - Integration with Zendesk for tickets and Pagerduty for alerts.
26 | - Instructions for how to set up and use can be found in the Reliability Engineering docs.
27 | - It's being used for monitoring and alerting by RE Observe, Registers, data.gov.uk and Notify
28 |
--------------------------------------------------------------------------------
/02-write-your-first-query.md:
--------------------------------------------------------------------------------
1 | # Your first query
2 |
3 | https://prom-3.monitoring.gds-reliability.engineering/graph
4 |
5 | ## User interface
6 |
7 | - 'Help' takes you to the Prometheus docs
8 | - 'Alerts' has all the Prometheus alerts
9 | - 'Status > Targets' has all the targets that are being scraped by Prometheus
10 | - 'Graph' has a console and graph view for your metrics
11 |
12 | ## First metric
13 |
14 | `up` is a special metric added by Prometheus which represents if a /metrics page for a target can be scraped. It is a `gauge` metric meaning it's value can go up and down. 1 represents a good scrape of a target, 0 a failed scrape of a target.
15 |
16 | ## Exercise
17 |
18 | 1. Query `up` in the Prometheus user interface.
19 | 2. Swap between the console and the graph tabs.
20 | 3. Extend the interval to 24h in the graph tab.
21 |
22 | ## Docs
23 |
24 | https://prometheus.io/docs/prometheus/latest/querying/
25 |
--------------------------------------------------------------------------------
/03-filtering-with-labels.md:
--------------------------------------------------------------------------------
1 | # Filtering with labels
2 |
3 | The query `up` returns many timeseries. You can see how many in the top right. You can filter using labels. You can use more than one label.
4 |
5 | `up{job="observe-paas-prometheus-exporter"}`
6 |
7 | You can use regex for example to get all timeseries for the metric `up` whose job label ends with `exporter`:
8 |
9 | `up{job=~".*exporter"}`
10 |
11 | You can also use `!=`.
12 |
13 | ## Exercise
14 |
15 | Write a query to return the current value for the `memory_utilization` for the PaaS `app` named `gds-way`.
16 |
17 |
18 | ANSWER
19 |
20 | ```memory_utilization{app="gds-way"}```
21 |
22 |
23 |
24 |
25 | ------
26 |
27 | Write a query to return the current value for the `cpu` for the first instance of the PaaS app named `notify-api` running in the PaaS `production` space. (hint, use `exported_instance` not `instance`)
28 |
29 |
30 | ANSWER
31 |
32 | ```cpu{app="notify-api", space="production", exported_instance="0"}```
33 |
34 |
35 |
36 |
37 | ------
38 |
39 | BONUS: Write a query to return the current `memory_utilization` for all apps that have a name beginning with `registers` and are not running in the `sandbox` PaaS space.
40 |
41 |
42 | ANSWER
43 |
44 | ```memory_utilization{app=~"registers.*", space!="sandbox"}```
45 |
46 |
47 |
48 |
49 |
50 |
--------------------------------------------------------------------------------
/04-instant-and-range-vectors.md:
--------------------------------------------------------------------------------
1 | # Instant and range vectors
2 |
3 | * Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp
4 | * Range vector - a set of time series containing a range of data points over time for each time series
5 |
6 | It's important to understand the difference between instant and range vectors PromQL functions can require you to provide the correct type as input.
7 |
8 | ## Exercise
9 |
10 | Make sure you on the `Console` tab.
11 |
12 | Query `up{job="prometheus"}`. This is an instant vector
13 |
14 | Query `up{job="prometheus"}[2m]`. This is a range vector.
15 |
16 | Compare the difference. Can you work out how often this target is scraped? What happens if you try and graph this query?
17 |
18 | Write a query to return `up{job="prometheus"}` for a 2 hour range.
19 |
20 |
21 | ANSWER
22 |
23 | ```up{job="prometheus"}[2h]```
24 |
25 |
26 |
27 |
--------------------------------------------------------------------------------
/05-counters.md:
--------------------------------------------------------------------------------
1 | # Counters
2 |
3 | Counters are a different type of metric (compared to gauges).
4 |
5 | A counter is a cumulative metric whose value can stay the same, increase or be reset to zero. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
6 |
7 | Graph the 2xx requests for Grafana using `requests{app="grafana-paas", status_range="2xx", job="observe-paas-prometheus-exporter"}`. Extend the time period of your graph to spot the counter resets.
8 |
9 | One example of when a counter reset might happen would be if your application restarted due to a deploy and loses the value of the counter. To mitigate counter resets when querying/graphing you will need to use function such as `increase` or `rate`.
10 |
11 | For example, you often you want to know how many 2xx requests you've had over a certain time period. For this you should the `increase` function.
12 |
13 | ## Exercise
14 |
15 | Use the console to give you the increase of 2xx `requests` for Grafana in the last minute using `increase(requests{app="grafana-paas", status_range="2xx"}[1m])`. Now graph this query.
16 |
17 | ------
18 |
19 | Use the console to give you the increase of 2xx `requests` for Grafana in the last hour and then graph the query. Compare this graph with the per minute graph.
20 |
21 |
22 | ANSWER
23 |
24 | ```increase(requests{app="grafana-paas", status_range="2xx"}[1h])```
25 |
26 |
27 |
28 |
29 |
30 |
--------------------------------------------------------------------------------
/06-operators.md:
--------------------------------------------------------------------------------
1 | # Operators
2 |
3 | ## Binary operators
4 |
5 | Binary operators exist in prometheus such as `+`, `-`, `*`, `/`, `^`, `%`, for example `up * 3`. These can be useful for working out percentages.
6 |
7 | ## Aggregation operators can be used to aggregate the elements of a single instant vector.
8 |
9 | https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
10 |
11 | Common aggregators include `sum`, `min`, `max`, `avg`, `topk`, `bottomk`, `quantile`.
12 |
13 | For example to find the total number of PaaS disk space currently in use for all applications we are monitoring you can use `sum(disk_bytes)`.
14 |
15 | ## Exercise
16 |
17 | What is the current maximum value for the `cpu` metric?
18 |
19 |
20 | ANSWER
21 |
22 | ```max(cpu)```
23 |
24 |
25 |
26 |
27 | ------
28 |
29 | What is the total number of 2xx `requests` that the `notify-api` app in the `production` space has processed in the last minute?
30 |
31 | Note if agreggating counters you MUST use `increase` before aggregating otherwise counter resets aren't accounted for.
32 |
33 |
34 | ANSWER
35 |
36 | ```sum(increase(requests{app="notify-api", status_range="2xx", space="production"}[1m]))```
37 |
38 |
39 |
40 |
41 | ------
42 |
43 | BONUS: What are the top 3 highest values for the `memory_utilization` metric?
44 |
45 |
46 | ANSWER
47 |
48 | ```topk(3, memory_utilization)```
49 |
50 |
51 |
52 |
53 | ------
54 |
55 | BONUS: What percentage of `requests` in the last 5 minutes to the`notify-api` app in the `production` space returned a 2xx status code?
56 |
57 |
58 | ANSWER
59 |
60 | ```sum(increase(requests{app="notify-api", status_range="2xx", space="production"}[5m])) / sum(increase(requests{app="notify-api", space="production"}[5m])) * 100```
61 |
62 |
63 |
64 |
--------------------------------------------------------------------------------
/07-aggregating-by-label.md:
--------------------------------------------------------------------------------
1 | # Aggregating by label
2 |
3 | `sum(disk_bytes)` gives you total disk_bytes in use
4 |
5 | `sum(disk_bytes) by (org)` gives you the total disk_bytes in use grouped by PaaS organisation
6 |
7 | You can sum on any label and even multiple labels.
8 |
9 | `sum(disk_bytes) by (org, space)`
10 |
11 | ## Exercise
12 |
13 | What is the average `memory_utilization` for each of the apps running in the `openregister` org in the `prod` space?
14 |
15 |
16 | ANSWER
17 |
18 | ```avg(memory_utilization{org="openregister", space="prod"}) by (app)```
19 |
20 | or
21 |
22 | ```avg(memory_utilization{org="openregister", space="prod"}) without (exported_instance)```
23 |
24 |
25 |
26 |
27 | ------
28 |
29 | BONUS: Return the value of how many conduit apps are running in the PaaS (indicated by their app name including `conduit`). Hint - use a proxy metric such as `cpu` to indicate an app is running.
30 |
31 |
32 | ANSWER
33 |
34 | ```count(cpu{app=~".*conduit.*"})``` or similar
35 |
36 |
37 |
38 |
39 |
40 |
--------------------------------------------------------------------------------
/08-functions.md:
--------------------------------------------------------------------------------
1 | # Functions
2 |
3 | Prometheus provides many helpful functions for your queries. See the Prometheus docs for all the functions but here are a few which you may find most useful.
4 |
5 | ## rate
6 |
7 | `rate(v range-vector)` calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.
8 |
9 |
10 | ## avg_over_time
11 |
12 | `avg_over_time(v range-vector)`: the average value of all points in the specified interval (also known as moving average).
13 |
14 |
15 | ## Exercise
16 |
17 | In the previous exercise you worked out the the average `memory_utilization` for each of the apps running in the `openregister` org in the `prod` space. Now return that instant vector sorted by value.
18 |
19 |
20 | ANSWER
21 |
22 | ```sort(avg(memory_utilization{org="openregister", space="prod"}) by (app))```
23 |
24 | or
25 |
26 | ```sort(avg(memory_utilization{org="openregister", space="prod"}) without (exported_instance))```
27 |
28 |
29 |
30 |
31 | ------
32 |
33 | What is per second rate of 2xx `requests` to the `grafana-paas` app based on the last 5 minutes of data?
34 |
35 |
36 | ANSWER
37 |
38 | ```rate(requests{app="grafana-paas", status_range="2xx"}[5m])```
39 |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/09-histograms.md:
--------------------------------------------------------------------------------
1 | # Histograms
2 |
3 | A histogram records observations (usually things like request durations or response sizes) and counts them in buckets.
4 |
5 | A histogram with a base metric name of `` exposes multiple time series during a scrape:
6 |
7 | - cumulative counters for the observation buckets, exposed as `_bucket{le=""}`
8 | - the count of events that have been observed, exposed as `_count` (identical to `_bucket{le="+Inf"}` above)
9 | - the sum of events that have been observed, exposed as `_sum`
10 |
11 | Histograms are cumulative.
12 |
13 | The `_sum` metric is useful in combination with the `_count` metric to calculate averages over all of the recorded observations in a given time range.
14 |
15 | ## Exercises
16 |
17 | Find the histogram of `http_server_request_duration_seconds_bucket` since the last counter resets for the `pay-control-plane` job with `controller="cluster#index"` and status `code="200"`. Use this histogram for the next three exercises.
18 |
19 |
20 |
21 | How many requests took less than or equal 1s?
22 |
23 |
24 | ANSWER
25 |
26 | ```http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="1"}```
27 |
28 |
29 |
30 |
31 | ------
32 |
33 | How many requests were there in total?
34 |
35 |
36 | ANSWER
37 |
38 | ```http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="+Inf"}```
39 |
40 | or
41 |
42 | ```http_server_request_duration_seconds_count{job="pay-control-plane", code="200",controller="cluster#index"}```
43 |
44 |
45 |
46 |
47 | ------
48 |
49 | How many requests took between 0.25s and 2.5s?
50 |
51 | Hint, you will find this stack overflow post useful - https://stackoverflow.com/questions/45005524/prometheus-promql-subtract-two-gauge-metrics
52 |
53 |
54 | ANSWER
55 |
56 | ```http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="2.5"} - ignoring(le) http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="0.25"}```
57 |
58 |
59 |
60 |
61 | ------
62 |
63 | BONUS: What percentage of requests in the last hour took less than 2.5s for the `grafana-paas` app as measured by the `response_time_bucket` metric for the `observe-paas-prometheus-exporter` job?
64 |
65 |
66 | ANSWER
67 |
68 | ```sum(increase(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="2.5"}[1h])) / sum(increase(response_time_count{job="observe-paas-prometheus-exporter", app="grafana-paas"}[1h])) * 100```
69 |
70 | or
71 |
72 | ```sum(rate(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="2.5"}[1h])) / sum(rate(response_time_count{job="observe-paas-prometheus-exporter", app="grafana-paas"}[1h])) * 100```
73 |
74 | or
75 |
76 | ```sum(rate(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="2.5"}[1h])) / sum(rate(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="+Inf"}[1h])) * 100```
77 |
78 |
79 |
80 |
--------------------------------------------------------------------------------
/10-summaries.md:
--------------------------------------------------------------------------------
1 | # Summaries
2 |
3 | A summary also records observations (usually things like request durations or response sizes) however unlike a histogram, it does not put them into buckets but instead calculates quantiles.
4 |
5 | A summary with a base metric name of `` exposes multiple time series during a scrape:
6 |
7 | - streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as `{quantile="<φ>"}`
8 | - the count of events that have been observed, exposed as `_count`
9 | - the sum of events that have been observed, exposed as `_sum`
10 |
11 | ## Exercises
12 |
13 | What was the 0.5 quantile for `prometheus_engine_query_duration_seconds{slice="inner_eval"}`?
14 |
15 |
16 | ANSWER
17 |
18 | ```prometheus_engine_query_duration_seconds{slice="inner_eval",quantile="0.5"}```
19 |
20 |
21 |
22 |
23 | ------
24 |
25 | In what time did 99% of `prometheus_engine_query_duration_seconds{slice="prepare_time"}` complete?
26 |
27 |
28 | ANSWER
29 |
30 | ```prometheus_engine_query_duration_seconds{slice="prepare_time", quantile="0.99"}```
31 |
32 |
33 |
34 |
--------------------------------------------------------------------------------
/11-set-operators-and-vector-matches.md:
--------------------------------------------------------------------------------
1 | TODO
2 |
3 | - and/or/unless
4 | - vector matches
5 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 David McDonald
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # prometheus-workshop
2 |
3 | This has been archived as:
4 |
5 | - due to the retiring of the GDS Observe platform, most exercises will no longer work and would need new examples written
6 | - the workshop has not been run in a long time
7 |
8 | 1.5 hour intro workshop to Prometheus focussed on PromQL for GDS
9 |
10 | ## format
11 |
12 | Workshop to go over the fundamentals of Prometheus. Particularly relevant to teams using Prometheus provided by Reliability Engineering.
13 |
14 | It will be hands on. You will need to be on the VPN to access Prometheus.
15 |
16 | Anyone can run this workshop. It seems to work well with up to 10 people so questions can be asked and help can be given. The workshop can also be done completely independently by just following the exercises.
17 |
--------------------------------------------------------------------------------