├── 01-intro-to-prometheus.md ├── 02-write-your-first-query.md ├── 03-filtering-with-labels.md ├── 04-instant-and-range-vectors.md ├── 05-counters.md ├── 06-operators.md ├── 07-aggregating-by-label.md ├── 08-functions.md ├── 09-histograms.md ├── 10-summaries.md ├── 11-set-operators-and-vector-matches.md ├── LICENSE └── README.md /01-intro-to-prometheus.md: -------------------------------------------------------------------------------- 1 | # Intro to Prometheus 2 | 3 | ## Goal 4 | - Get a basic understanding of what Prometheus is 5 | - Be confident with basic querying so you could use it to resolve an incident/create a dashboard 6 | 7 | Note, we aren’t going to focus so much on setting up the monitoring but assume Prometheus is already set up and scraping your apps 8 | 9 | ## What is Prometheus 10 | 11 | https://prometheus.io/ 12 | 13 | - monitoring tool 14 | - Similar/competitor products would be graphite, influxdb 15 | - time series DB 16 | - uses /metric pages 17 | - here is one of the metrics on the metrics page - https://observe-paas-prometheus-exporter.cloudapps.digital/metrics 18 | - pull model for metrics rather than push model 19 | - handles alerts but not responsible for delivery of alerts (that is alert manager) 20 | - only basic dashboarding (use grafana for dashboarding) 21 | 22 | 23 | ## What Reliability Engineering Observe provide 24 | - Prometheus, Alertmanager and Grafana. 25 | - Integration with Zendesk for tickets and Pagerduty for alerts. 26 | - Instructions for how to set up and use can be found in the Reliability Engineering docs. 27 | - It's being used for monitoring and alerting by RE Observe, Registers, data.gov.uk and Notify 28 | -------------------------------------------------------------------------------- /02-write-your-first-query.md: -------------------------------------------------------------------------------- 1 | # Your first query 2 | 3 | https://prom-3.monitoring.gds-reliability.engineering/graph 4 | 5 | ## User interface 6 | 7 | - 'Help' takes you to the Prometheus docs 8 | - 'Alerts' has all the Prometheus alerts 9 | - 'Status > Targets' has all the targets that are being scraped by Prometheus 10 | - 'Graph' has a console and graph view for your metrics 11 | 12 | ## First metric 13 | 14 | `up` is a special metric added by Prometheus which represents if a /metrics page for a target can be scraped. It is a `gauge` metric meaning it's value can go up and down. 1 represents a good scrape of a target, 0 a failed scrape of a target. 15 | 16 | ## Exercise 17 | 18 | 1. Query `up` in the Prometheus user interface. 19 | 2. Swap between the console and the graph tabs. 20 | 3. Extend the interval to 24h in the graph tab. 21 | 22 | ## Docs 23 | 24 | https://prometheus.io/docs/prometheus/latest/querying/ 25 | -------------------------------------------------------------------------------- /03-filtering-with-labels.md: -------------------------------------------------------------------------------- 1 | # Filtering with labels 2 | 3 | The query `up` returns many timeseries. You can see how many in the top right. You can filter using labels. You can use more than one label. 4 | 5 | `up{job="observe-paas-prometheus-exporter"}` 6 | 7 | You can use regex for example to get all timeseries for the metric `up` whose job label ends with `exporter`: 8 | 9 | `up{job=~".*exporter"}` 10 | 11 | You can also use `!=`. 12 | 13 | ## Exercise 14 | 15 | Write a query to return the current value for the `memory_utilization` for the PaaS `app` named `gds-way`. 16 | 17 |
18 | ANSWER

19 | 20 | ```memory_utilization{app="gds-way"}``` 21 | 22 |

23 |
24 | 25 | ------ 26 | 27 | Write a query to return the current value for the `cpu` for the first instance of the PaaS app named `notify-api` running in the PaaS `production` space. (hint, use `exported_instance` not `instance`) 28 | 29 |
30 | ANSWER

31 | 32 | ```cpu{app="notify-api", space="production", exported_instance="0"}``` 33 | 34 |

35 |
36 | 37 | ------ 38 | 39 | BONUS: Write a query to return the current `memory_utilization` for all apps that have a name beginning with `registers` and are not running in the `sandbox` PaaS space. 40 | 41 |
42 | ANSWER

43 | 44 | ```memory_utilization{app=~"registers.*", space!="sandbox"}``` 45 | 46 |

47 |
48 | 49 | 50 | -------------------------------------------------------------------------------- /04-instant-and-range-vectors.md: -------------------------------------------------------------------------------- 1 | # Instant and range vectors 2 | 3 | * Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp 4 | * Range vector - a set of time series containing a range of data points over time for each time series 5 | 6 | It's important to understand the difference between instant and range vectors PromQL functions can require you to provide the correct type as input. 7 | 8 | ## Exercise 9 | 10 | Make sure you on the `Console` tab. 11 | 12 | Query `up{job="prometheus"}`. This is an instant vector 13 | 14 | Query `up{job="prometheus"}[2m]`. This is a range vector. 15 | 16 | Compare the difference. Can you work out how often this target is scraped? What happens if you try and graph this query? 17 | 18 | Write a query to return `up{job="prometheus"}` for a 2 hour range. 19 | 20 |
21 | ANSWER

22 | 23 | ```up{job="prometheus"}[2h]``` 24 | 25 |

26 |
27 | -------------------------------------------------------------------------------- /05-counters.md: -------------------------------------------------------------------------------- 1 | # Counters 2 | 3 | Counters are a different type of metric (compared to gauges). 4 | 5 | A counter is a cumulative metric whose value can stay the same, increase or be reset to zero. For example, you can use a counter to represent the number of requests served, tasks completed, or errors. 6 | 7 | Graph the 2xx requests for Grafana using `requests{app="grafana-paas", status_range="2xx", job="observe-paas-prometheus-exporter"}`. Extend the time period of your graph to spot the counter resets. 8 | 9 | One example of when a counter reset might happen would be if your application restarted due to a deploy and loses the value of the counter. To mitigate counter resets when querying/graphing you will need to use function such as `increase` or `rate`. 10 | 11 | For example, you often you want to know how many 2xx requests you've had over a certain time period. For this you should the `increase` function. 12 | 13 | ## Exercise 14 | 15 | Use the console to give you the increase of 2xx `requests` for Grafana in the last minute using `increase(requests{app="grafana-paas", status_range="2xx"}[1m])`. Now graph this query. 16 | 17 | ------ 18 | 19 | Use the console to give you the increase of 2xx `requests` for Grafana in the last hour and then graph the query. Compare this graph with the per minute graph. 20 | 21 |
22 | ANSWER

23 | 24 | ```increase(requests{app="grafana-paas", status_range="2xx"}[1h])``` 25 | 26 |

27 |
28 | 29 | 30 | -------------------------------------------------------------------------------- /06-operators.md: -------------------------------------------------------------------------------- 1 | # Operators 2 | 3 | ## Binary operators 4 | 5 | Binary operators exist in prometheus such as `+`, `-`, `*`, `/`, `^`, `%`, for example `up * 3`. These can be useful for working out percentages. 6 | 7 | ## Aggregation operators can be used to aggregate the elements of a single instant vector. 8 | 9 | https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators 10 | 11 | Common aggregators include `sum`, `min`, `max`, `avg`, `topk`, `bottomk`, `quantile`. 12 | 13 | For example to find the total number of PaaS disk space currently in use for all applications we are monitoring you can use `sum(disk_bytes)`. 14 | 15 | ## Exercise 16 | 17 | What is the current maximum value for the `cpu` metric? 18 | 19 |
20 | ANSWER

21 | 22 | ```max(cpu)``` 23 | 24 |

25 |
26 | 27 | ------ 28 | 29 | What is the total number of 2xx `requests` that the `notify-api` app in the `production` space has processed in the last minute? 30 | 31 | Note if agreggating counters you MUST use `increase` before aggregating otherwise counter resets aren't accounted for. 32 | 33 |
34 | ANSWER

35 | 36 | ```sum(increase(requests{app="notify-api", status_range="2xx", space="production"}[1m]))``` 37 | 38 |

39 |
40 | 41 | ------ 42 | 43 | BONUS: What are the top 3 highest values for the `memory_utilization` metric? 44 | 45 |
46 | ANSWER

47 | 48 | ```topk(3, memory_utilization)``` 49 | 50 |

51 |
52 | 53 | ------ 54 | 55 | BONUS: What percentage of `requests` in the last 5 minutes to the`notify-api` app in the `production` space returned a 2xx status code? 56 | 57 |
58 | ANSWER

59 | 60 | ```sum(increase(requests{app="notify-api", status_range="2xx", space="production"}[5m])) / sum(increase(requests{app="notify-api", space="production"}[5m])) * 100``` 61 | 62 |

63 |
64 | -------------------------------------------------------------------------------- /07-aggregating-by-label.md: -------------------------------------------------------------------------------- 1 | # Aggregating by label 2 | 3 | `sum(disk_bytes)` gives you total disk_bytes in use 4 | 5 | `sum(disk_bytes) by (org)` gives you the total disk_bytes in use grouped by PaaS organisation 6 | 7 | You can sum on any label and even multiple labels. 8 | 9 | `sum(disk_bytes) by (org, space)` 10 | 11 | ## Exercise 12 | 13 | What is the average `memory_utilization` for each of the apps running in the `openregister` org in the `prod` space? 14 | 15 |
16 | ANSWER

17 | 18 | ```avg(memory_utilization{org="openregister", space="prod"}) by (app)``` 19 | 20 | or 21 | 22 | ```avg(memory_utilization{org="openregister", space="prod"}) without (exported_instance)``` 23 | 24 |

25 |
26 | 27 | ------ 28 | 29 | BONUS: Return the value of how many conduit apps are running in the PaaS (indicated by their app name including `conduit`). Hint - use a proxy metric such as `cpu` to indicate an app is running. 30 | 31 |
32 | ANSWER

33 | 34 | ```count(cpu{app=~".*conduit.*"})``` or similar 35 | 36 |

37 |
38 | 39 | 40 | -------------------------------------------------------------------------------- /08-functions.md: -------------------------------------------------------------------------------- 1 | # Functions 2 | 3 | Prometheus provides many helpful functions for your queries. See the Prometheus docs for all the functions but here are a few which you may find most useful. 4 | 5 | ## rate 6 | 7 | `rate(v range-vector)` calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. 8 | 9 | 10 | ## avg_over_time 11 | 12 | `avg_over_time(v range-vector)`: the average value of all points in the specified interval (also known as moving average). 13 | 14 | 15 | ## Exercise 16 | 17 | In the previous exercise you worked out the the average `memory_utilization` for each of the apps running in the `openregister` org in the `prod` space. Now return that instant vector sorted by value. 18 | 19 |
20 | ANSWER

21 | 22 | ```sort(avg(memory_utilization{org="openregister", space="prod"}) by (app))``` 23 | 24 | or 25 | 26 | ```sort(avg(memory_utilization{org="openregister", space="prod"}) without (exported_instance))``` 27 | 28 |

29 |
30 | 31 | ------ 32 | 33 | What is per second rate of 2xx `requests` to the `grafana-paas` app based on the last 5 minutes of data? 34 | 35 |
36 | ANSWER

37 | 38 | ```rate(requests{app="grafana-paas", status_range="2xx"}[5m])``` 39 | 40 |

41 |
42 | -------------------------------------------------------------------------------- /09-histograms.md: -------------------------------------------------------------------------------- 1 | # Histograms 2 | 3 | A histogram records observations (usually things like request durations or response sizes) and counts them in buckets. 4 | 5 | A histogram with a base metric name of `` exposes multiple time series during a scrape: 6 | 7 | - cumulative counters for the observation buckets, exposed as `_bucket{le=""}` 8 | - the count of events that have been observed, exposed as `_count` (identical to `_bucket{le="+Inf"}` above) 9 | - the sum of events that have been observed, exposed as `_sum` 10 | 11 | Histograms are cumulative. 12 | 13 | The `_sum` metric is useful in combination with the `_count` metric to calculate averages over all of the recorded observations in a given time range. 14 | 15 | ## Exercises 16 | 17 | Find the histogram of `http_server_request_duration_seconds_bucket` since the last counter resets for the `pay-control-plane` job with `controller="cluster#index"` and status `code="200"`. Use this histogram for the next three exercises. 18 | 19 | 20 | 21 | How many requests took less than or equal 1s? 22 | 23 |
24 | ANSWER

25 | 26 | ```http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="1"}``` 27 | 28 |

29 |
30 | 31 | ------ 32 | 33 | How many requests were there in total? 34 | 35 |
36 | ANSWER

37 | 38 | ```http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="+Inf"}``` 39 | 40 | or 41 | 42 | ```http_server_request_duration_seconds_count{job="pay-control-plane", code="200",controller="cluster#index"}``` 43 | 44 |

45 |
46 | 47 | ------ 48 | 49 | How many requests took between 0.25s and 2.5s? 50 | 51 | Hint, you will find this stack overflow post useful - https://stackoverflow.com/questions/45005524/prometheus-promql-subtract-two-gauge-metrics 52 | 53 |
54 | ANSWER

55 | 56 | ```http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="2.5"} - ignoring(le) http_server_request_duration_seconds_bucket{job="pay-control-plane", code="200",controller="cluster#index", le="0.25"}``` 57 | 58 |

59 |
60 | 61 | ------ 62 | 63 | BONUS: What percentage of requests in the last hour took less than 2.5s for the `grafana-paas` app as measured by the `response_time_bucket` metric for the `observe-paas-prometheus-exporter` job? 64 | 65 |
66 | ANSWER

67 | 68 | ```sum(increase(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="2.5"}[1h])) / sum(increase(response_time_count{job="observe-paas-prometheus-exporter", app="grafana-paas"}[1h])) * 100``` 69 | 70 | or 71 | 72 | ```sum(rate(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="2.5"}[1h])) / sum(rate(response_time_count{job="observe-paas-prometheus-exporter", app="grafana-paas"}[1h])) * 100``` 73 | 74 | or 75 | 76 | ```sum(rate(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="2.5"}[1h])) / sum(rate(response_time_bucket{job="observe-paas-prometheus-exporter", app="grafana-paas", le="+Inf"}[1h])) * 100``` 77 | 78 |

79 |
80 | -------------------------------------------------------------------------------- /10-summaries.md: -------------------------------------------------------------------------------- 1 | # Summaries 2 | 3 | A summary also records observations (usually things like request durations or response sizes) however unlike a histogram, it does not put them into buckets but instead calculates quantiles. 4 | 5 | A summary with a base metric name of `` exposes multiple time series during a scrape: 6 | 7 | - streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as `{quantile="<φ>"}` 8 | - the count of events that have been observed, exposed as `_count` 9 | - the sum of events that have been observed, exposed as `_sum` 10 | 11 | ## Exercises 12 | 13 | What was the 0.5 quantile for `prometheus_engine_query_duration_seconds{slice="inner_eval"}`? 14 | 15 |
16 | ANSWER

17 | 18 | ```prometheus_engine_query_duration_seconds{slice="inner_eval",quantile="0.5"}``` 19 | 20 |

21 |
22 | 23 | ------ 24 | 25 | In what time did 99% of `prometheus_engine_query_duration_seconds{slice="prepare_time"}` complete? 26 | 27 |
28 | ANSWER

29 | 30 | ```prometheus_engine_query_duration_seconds{slice="prepare_time", quantile="0.99"}``` 31 | 32 |

33 |
34 | -------------------------------------------------------------------------------- /11-set-operators-and-vector-matches.md: -------------------------------------------------------------------------------- 1 | TODO 2 | 3 | - and/or/unless 4 | - vector matches 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 David McDonald 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # prometheus-workshop 2 | 3 | This has been archived as: 4 | 5 | - due to the retiring of the GDS Observe platform, most exercises will no longer work and would need new examples written 6 | - the workshop has not been run in a long time 7 | 8 | 1.5 hour intro workshop to Prometheus focussed on PromQL for GDS 9 | 10 | ## format 11 | 12 | Workshop to go over the fundamentals of Prometheus. Particularly relevant to teams using Prometheus provided by Reliability Engineering. 13 | 14 | It will be hands on. You will need to be on the VPN to access Prometheus. 15 | 16 | Anyone can run this workshop. It seems to work well with up to 10 people so questions can be asked and help can be given. The workshop can also be done completely independently by just following the exercises. 17 | --------------------------------------------------------------------------------