├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── IMG ├── 1.png ├── 1.webp ├── Prometheus-metaimage.png ├── grafana-security-login-authentication.png ├── graphic-3-.png └── img.png ├── Prometheus-lab ├── README.md └── k8s-yaml │ ├── Alertmanagerconfig.yaml │ ├── Deployment.yaml │ ├── PrometheusRule.yaml │ ├── Service-monitor.yaml │ └── Service.yaml ├── README.md ├── promQl.md ├── prometheus_setup.md └── promql-img ├── counter_example.png ├── gauge_example.png └── heatmap_histogram.png /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # **Contributor Covenant Code of Conduct** 2 | 3 | ## **Our Pledge** 4 | 5 | We as members, contributors, and leaders pledge to make participation in our project and community a **harassment-free experience** for everyone, regardless of age, body size, disability, ethnicity, gender identity, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. 6 | 7 | We pledge to act and interact in ways that contribute to an **open, welcoming, diverse, inclusive, and healthy** community. 8 | 9 | ## **Our Standards** 10 | 11 | > [!IMPORTANT] 12 | > 13 | > **Examples of behavior that contributes to a positive environment include:** 14 | > 15 | > - Being respectful and inclusive to others 16 | > - Using welcoming and inclusive language 17 | > - Giving and gracefully accepting constructive feedback 18 | > - Showing empathy towards other community members 19 | > 20 | > **Examples of unacceptable behavior include:** 21 | > 22 | > - Harassment, intimidation, or discrimination in any form 23 | > - Publishing private information of others without consent 24 | > - Use of inappropriate language, insults, or derogatory comments 25 | > - Trolling, personal attacks, or political/religious discussions 26 | 27 | ## **Our Responsibilities** 28 | 29 | > [!NOTE] 30 | > 31 | > Project maintainers are responsible for **clarifying and enforcing** the standards of acceptable behavior. They have the right to **remove, edit, or reject** comments, commits, code, issues, and other contributions that do not align with this Code of Conduct. 32 | 33 | ## **Enforcement** 34 | 35 | > [!CAUTION] 36 | > 37 | > Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at [your-email@example.com]. 38 | > Maintainers will review and investigate complaints and take appropriate action. 39 | 40 | ## **Attribution** 41 | 42 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), version 2.1. 43 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # 📜 **CONTRIBUTING.md** 2 | 3 | Thank you for considering contributing to **Learning Prometheus**! 🚀 4 | Your contributions are **highly appreciated** and help make this project better for everyone. 5 | 6 | --- 7 | 8 | ## 🛠 **How to Contribute?** 9 | 10 | ### 📌 **1. Fork the Repository** 11 | 12 | Click the **Fork** button on the top-right corner of this repository to create your own copy. 13 | 14 | ### 📌 **2. Clone Your Fork** 15 | 16 | Open your terminal and run: 17 | 18 | ```bash 19 | git clone https://github.com/your-username/Learning-Prometheus.git 20 | ``` 21 | 22 | Replace `your-username` with your GitHub username. 23 | 24 | ### 📌 **3. Navigate to the Project Directory** 25 | 26 | ```bash 27 | cd Learning-Prometheus 28 | ``` 29 | 30 | ### 📌 **4. Create a New Branch** 31 | 32 | Before making changes, create a new branch: 33 | 34 | ```bash 35 | git checkout -b feature-branch 36 | ``` 37 | 38 | Replace `feature-branch` with a relevant branch name. 39 | 40 | ### 📌 **5. Make Your Changes** 41 | 42 | Modify the code, update documentation, or add new Prometheus configurations. 43 | 44 | ### 📌 **6. Commit Your Changes** 45 | 46 | Follow best practices for writing meaningful commit messages: 47 | 48 | ```bash 49 | git commit -m "✨ Added PromQL query examples for better monitoring" 50 | ``` 51 | 52 | - Use **present-tense verbs** (e.g., "Add" instead of "Added") 53 | - Keep commit messages **concise yet descriptive** 54 | 55 | ### 📌 **7. Push the Changes** 56 | 57 | ```bash 58 | git push origin feature-branch 59 | ``` 60 | 61 | ### 📌 **8. Open a Pull Request (PR)** 62 | 63 | - Navigate to the **original repository** (NotHarshhaa/Learning-Prometheus). 64 | - Click on **New Pull Request**. 65 | - Select your forked repository and the branch you worked on. 66 | - Write a **clear PR description** explaining your changes. 67 | - Submit the PR for review. 🎉 68 | 69 | --- 70 | 71 | ## 📝 **Contribution Guidelines** 72 | 73 | ✔️ **Follow proper code formatting and best practices.** 74 | ✔️ **Add meaningful commit messages.** 75 | ✔️ **Write detailed PR descriptions explaining the changes.** 76 | ✔️ **Ensure new contributions are well-documented.** 77 | ✔️ **Test your code before submitting a PR.** 78 | ✔️ **For documentation updates, maintain proper formatting and clarity.** 79 | 80 | --- 81 | 82 | ## 🐞 **Reporting Issues** 83 | 84 | If you find a bug, have a feature request, or want to suggest an improvement: 85 | 86 | 1. **Check existing issues** to avoid duplicates. 87 | 2. Open a **new issue** with: 88 | - A **descriptive title** 89 | - Steps to **reproduce the bug** 90 | - Expected vs. actual behavior 91 | - Any **screenshots or logs** (if applicable) 92 | 93 | --- 94 | 95 | ## 🌎 **Community & Support** 96 | 97 | 👥 **Join our discussion and ask questions in our Telegram community:** 98 | 📢 [Join Telegram](https://t.me/prodevopsguy) 99 | 100 | 💡 **Follow me on GitHub for more DevOps content:** 101 | ⭐ [GitHub Profile](https://github.com/NotHarshhaa) 102 | 103 | --- 104 | 105 | ## 🙌**Acknowledgments** 106 | 107 | This project is maintained by **[Harshhaa](https://github.com/NotHarshhaa)**. 108 | Thank you for being part of the **DevOps & Prometheus** community! 💙 109 | 110 | --- 111 | 112 | ### ✅ **Now you are ready to contribute! Happy coding! 🚀** 113 | -------------------------------------------------------------------------------- /IMG/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/1.png -------------------------------------------------------------------------------- /IMG/1.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/1.webp -------------------------------------------------------------------------------- /IMG/Prometheus-metaimage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/Prometheus-metaimage.png -------------------------------------------------------------------------------- /IMG/grafana-security-login-authentication.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/grafana-security-login-authentication.png -------------------------------------------------------------------------------- /IMG/graphic-3-.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/graphic-3-.png -------------------------------------------------------------------------------- /IMG/img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/img.png -------------------------------------------------------------------------------- /Prometheus-lab/README.md: -------------------------------------------------------------------------------- 1 | - [Real-World Prometheus Deployment: A Practical Guide for Kubernetes Monitoring](#real-world-prometheus-deployment-a-practical-guide-for-kubernetes-monitoring) 2 | - [Aim of the Project](#aim-of-the-project) 3 | - [Project Architecture](#project-architecture) 4 | - [Prerequisites](#prerequisites) 5 | - [Summary of What We Achieved](#summary-of-what-we-achieved) 6 | - [Understanding Kubernetes Resources](#understanding-kubernetes-resources) 7 | - [Deployment](#deployment) 8 | - [API Version and Kind](#api-version-and-kind) 9 | - [Metadata](#metadata) 10 | - [Specification (`spec`)](#specification-spec) 11 | - [Selector](#selector) 12 | - [Template](#template) 13 | - [Pod Specification (`spec` inside the template)](#pod-specification-spec-inside-the-template) 14 | - [Services](#services) 15 | - [API Version and Kind](#api-version-and-kind-1) 16 | - [Metadata](#metadata-1) 17 | - [Specification (`spec`)](#specification-spec-1) 18 | - [ServiceMonitor](#servicemonitor) 19 | - [API Version and Kind](#api-version-and-kind-2) 20 | - [Metadata](#metadata-2) 21 | - [Specification (`spec`)](#specification-spec-2) 22 | - [PrometheusRules](#prometheusrules) 23 | - [API Version and Kind](#api-version-and-kind-3) 24 | - [Metadata](#metadata-3) 25 | - [Specification (`spec`)](#specification-spec-3) 26 | - [AlertmanagerConfig](#alertmanagerconfig) 27 | - [API Version and Kind](#api-version-and-kind-4) 28 | - [Metadata](#metadata-4) 29 | - [Specification (`spec`)](#specification-spec-4) 30 | - [Author & Community](#author--community) 31 | 32 | --- 33 | 34 | # **Real-World Prometheus Deployment: A Practical Guide for Kubernetes Monitoring** 35 | 36 | ## **Aim of the Project** 37 | 38 | The primary goal of this **Prometheus Lab** project is to provide **hands-on experience** in setting up a **Prometheus monitoring system** on a **Kubernetes cluster**. 39 | 40 | By following this guide, you will: 41 | ✅ Deploy **Prometheus** for real-time monitoring. 42 | ✅ Understand **Kubernetes monitoring architecture**. 43 | ✅ Set up **Grafana** for data visualization. 44 | ✅ Configure **Alertmanager** for proactive notifications. 45 | 46 | --- 47 | 48 | ## **Project Architecture** 49 | 50 | Below is a high-level architecture of the Prometheus monitoring setup: 51 | 52 | ![](../IMG/graphic-3-.png) 53 | 54 | --- 55 | 56 | ## **Prerequisites** 57 | 58 | Before we begin, ensure you have the following tools installed: 59 | 60 | - **`kubectl`** → To interact with the Kubernetes cluster. 61 | - **`Helm`** → For deploying Prometheus using Helm charts. 62 | - **`k3d`** → A lightweight Kubernetes distribution for local testing. 63 | 64 | --- 65 | 66 | ## **📌 Step 1: Install `k3d` (Lightweight Kubernetes)** 67 | 68 | To create a **local Kubernetes cluster**, install `k3d` with: 69 | 70 | ```bash 71 | curl -s https://raw.githubusercontent.com/rancher/k3d/main/install.sh | bash 72 | ``` 73 | 74 | Verify the installation: 75 | 76 | ```bash 77 | k3d --version 78 | ``` 79 | 80 | --- 81 | 82 | ## **📌 Step 2: Clone the GitHub Repository** 83 | 84 | All the necessary **YAML manifests** and configurations can be found in my GitHub repository: 85 | 86 | 🔗 **GitHub Repo:** 87 | 88 | ```text 89 | https://github.com/panchanandevops/Learning-Prometheus.git 90 | ``` 91 | 92 | Clone the repository for easy access: 93 | 94 | ```bash 95 | git clone https://github.com/panchanandevops/Learning-Prometheus.git 96 | cd Learning-Prometheus 97 | ``` 98 | 99 | --- 100 | 101 | ## **📌 Step 3: Create a Namespace for Monitoring** 102 | 103 | All monitoring components should be deployed in a dedicated **namespace**. 104 | 105 | ```bash 106 | kubectl create namespace monitoring 107 | ``` 108 | 109 | Verify the namespace: 110 | 111 | ```bash 112 | kubectl get namespaces 113 | ``` 114 | 115 | --- 116 | 117 | ## **📌 Step 4: Add the Prometheus Helm Repository** 118 | 119 | We will use Helm to deploy Prometheus and related components. 120 | 121 | ```bash 122 | helm repo add prometheus-community https://prometheus-community.github.io/helm-charts 123 | helm repo update 124 | ``` 125 | 126 | This ensures we fetch the **latest** chart versions. 127 | 128 | --- 129 | 130 | ## **📌 Step 5: Store Default Helm Values** 131 | 132 | Before installing Prometheus, save the default configuration for customization. 133 | 134 | ```bash 135 | helm show values prometheus-community/kube-prometheus-stack > values.yaml 136 | ``` 137 | 138 | This file (`values.yaml`) contains settings for Prometheus, Grafana, and Alertmanager. 139 | 140 | --- 141 | 142 | ## **📌 Step 6: Install Prometheus Stack using Helm** 143 | 144 | Now, install the **kube-prometheus-stack** Helm chart in the `monitoring` namespace: 145 | 146 | ```bash 147 | helm install prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring 148 | ``` 149 | 150 | 🚀 **This deploys:** 151 | 152 | - **Prometheus** (for metrics collection) 153 | - **Grafana** (for visualization) 154 | - **Alertmanager** (for alert handling) 155 | 156 | --- 157 | 158 | ## **📌 Step 7: Verify the Deployment** 159 | 160 | Check if the monitoring components are running: 161 | 162 | ```bash 163 | kubectl get pods -n monitoring 164 | ``` 165 | 166 | You should see multiple pods for **Prometheus, Grafana, and Alertmanager** in a `Running` state. 167 | 168 | --- 169 | 170 | ## **📌 Step 8: Access the Prometheus Dashboard** 171 | 172 | Prometheus exposes metrics and allows querying via its web UI. 173 | 174 | To access it locally, run: 175 | 176 | ```bash 177 | kubectl port-forward svc/prometheus-stack-prometheus -n monitoring 9090:9090 178 | ``` 179 | 180 | Now, open **[http://localhost:9090](http://localhost:9090)** in your browser. 181 | 182 | ![](../IMG/1.png) 183 | 184 | --- 185 | 186 | ## **📌 Step 9: Access the Grafana Dashboard** 187 | 188 | Grafana provides a beautiful UI to visualize the metrics collected by Prometheus. 189 | 190 | To access it locally, run: 191 | 192 | ```bash 193 | kubectl port-forward svc/prometheus-stack-grafana -n monitoring 8080:80 194 | ``` 195 | 196 | Now, open **[http://localhost:8080](http://localhost:8080)** in your browser. 197 | 198 | ![](../IMG/grafana-security-login-authentication.png) 199 | 200 | --- 201 | 202 | ## **📌 Step 10: Login to Grafana** 203 | 204 | Grafana uses **default credentials**: 205 | 206 | - **Username:** `admin` 207 | - **Password:** Retrieve the password using: 208 | 209 | ```bash 210 | kubectl get secret prometheus-stack-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 --decode ; echo 211 | ``` 212 | 213 | - Copy the password and **log in** to Grafana. 214 | 215 | --- 216 | 217 | ## **📌 Step 11: Configure `values.yaml` for AlertmanagerConfig** 218 | 219 | By default, Prometheus does not automatically pick up **AlertmanagerConfig** CRDs. 220 | 221 | To enable it, **edit `values.yaml`** and search for `alertmanagerConfigSelector`. 222 | 223 | Replace that section with: 224 | 225 | ```yaml 226 | alertmanagerConfigSelector: 227 | matchLabels: 228 | release: prometheus 229 | ``` 230 | 231 | This ensures **custom alerting rules** are applied. 232 | 233 | --- 234 | 235 | ## **📌 Step 12: Apply Kubernetes YAML Manifests** 236 | 237 | Once the setup is complete, apply all the necessary Kubernetes resources: 238 | 239 | ```bash 240 | kubectl apply -f /k8s-yaml/ 241 | ``` 242 | 243 | This will configure: 244 | ✅ **ServiceMonitor** (for scraping custom metrics). 245 | ✅ **PrometheusRules** (for setting up alert conditions). 246 | ✅ **AlertmanagerConfig** (for sending notifications). 247 | 248 | --- 249 | 250 | ## **Summary of What We Achieved** 251 | 252 | | **Step** | **Action** | 253 | |----------|------------| 254 | | 🛠 **Setup** | Installed `k3d`, `kubectl`, `Helm` | 255 | | 📥 **Downloaded** | Cloned GitHub repo | 256 | | 📦 **Created Namespace** | `monitoring` | 257 | | 🔹 **Added Helm Repo** | `prometheus-community` | 258 | | 📜 **Saved Config** | Stored `values.yaml` | 259 | | 🚀 **Deployed Stack** | Installed `kube-prometheus-stack` | 260 | | 📊 **Accessed Dashboards** | Prometheus & Grafana | 261 | | ⚙ **Configured Alerts** | Modified `values.yaml` | 262 | | 📌 **Applied Manifests** | `kubectl apply -f k8s-yaml/` | 263 | 264 | 🚀 **Now you have a fully functional Prometheus monitoring setup on Kubernetes!** 265 | 266 | --- 267 | 268 | # **Understanding Kubernetes Resources** 269 | 270 | ## **Deployment** 271 | 272 | A **Deployment** in Kubernetes is used to ensure that a set of identical **pods** (containers running your application) are always running. It allows you to easily scale your application, update it without downtime, and recover from failures. 273 | 274 | Now, let’s break down the **Deployment YAML file** step by step. 275 | 276 | --- 277 | 278 | ### **API Version and Kind** 279 | 280 | ```yaml 281 | apiVersion: apps/v1 282 | kind: Deployment 283 | ``` 284 | 285 | - `apiVersion: apps/v1` → This specifies the **API version** of Kubernetes being used. 286 | - `kind: Deployment` → This tells Kubernetes that we are creating a **Deployment** resource. 287 | 288 | A **Deployment** helps in managing a set of pods by ensuring they stay available and can be updated smoothly. 289 | 290 | --- 291 | 292 | ### **Metadata** 293 | 294 | ```yaml 295 | metadata: 296 | name: my-deployment 297 | labels: 298 | app: api 299 | ``` 300 | 301 | - `name: my-deployment` → The name of the Deployment. It must be unique within the namespace. 302 | - `labels:` 303 | - `app: api` → A label assigned to this Deployment. Labels help **group, filter, and identify** Kubernetes resources. 304 | 305 | 📌 **Why labels?** 306 | Labels allow you to easily select and manage Kubernetes objects. For example, we can list all pods that belong to this Deployment using: 307 | 308 | ```bash 309 | kubectl get pods -l app=api 310 | ``` 311 | 312 | --- 313 | 314 | ### **Specification (`spec`)** 315 | 316 | This section defines **how** the Deployment should behave. 317 | 318 | --- 319 | 320 | #### **Selector** 321 | 322 | The **selector** tells Kubernetes **which pods** this Deployment should manage. 323 | 324 | ```yaml 325 | selector: 326 | matchLabels: 327 | app: api 328 | ``` 329 | 330 | - This means the Deployment will look for pods with the label **`app: api`**. 331 | - Only these labeled pods will be controlled by this Deployment. 332 | 333 | 📌 **Why is this needed?** 334 | Because a Kubernetes cluster may have **many** Deployments and pods, the **selector** ensures that only the right pods are managed by this Deployment. 335 | 336 | --- 337 | 338 | #### **Template** 339 | 340 | The **template** defines how new pods should be created when the Deployment starts or scales up. 341 | 342 | ```yaml 343 | template: 344 | metadata: 345 | labels: 346 | app: api 347 | spec: 348 | ``` 349 | 350 | - `metadata:` 351 | - `labels: app: api` → The pod will have the same label as the Deployment. 352 | - `spec:` → This is where we define what should **run inside the pod** (the containers). 353 | 354 | 📌 **Why is this needed?** 355 | When Kubernetes creates new pods using this Deployment, it **ensures** that every pod gets the same labels and configurations. 356 | 357 | --- 358 | 359 | #### **Pod Specification (`spec` inside the template)** 360 | 361 | Now, let’s define the actual **container** that runs inside the pod. 362 | 363 | ```yaml 364 | containers: 365 | - name: my-container 366 | image: panchanandevops/myexpress:v0.1.0 367 | resources: 368 | limits: 369 | memory: "128Mi" 370 | cpu: "500m" 371 | ports: 372 | - containerPort: 3000 373 | ``` 374 | 375 | - **`containers:`** → A pod can run **one or more** containers. In this case, we have one container named **`my-container`**. 376 | - **`image: panchanandevops/myexpress:v0.1.0`** → This is the **Docker image** that will be pulled from Docker Hub or a private registry. 377 | - **Resource Limits (`resources:`)** 378 | - `memory: "128Mi"` → The container can use a maximum of **128 MiB of RAM**. 379 | - `cpu: "500m"` → The container can use up to **0.5 CPU cores** (500 milliCPU). 380 | - **Port (`ports:`)** 381 | - `containerPort: 3000` → This means the container **listens** for requests on port `3000`. 382 | 383 | 📌 **Why are resource limits important?** 384 | Setting **resource limits** prevents a single container from consuming all system resources, ensuring **fair resource distribution** among all containers in the cluster. 385 | 386 | --- 387 | 388 | ### **Services** 389 | 390 | A **Service** in Kubernetes is used to expose a set of pods as a **network service**. Even if pods are created and destroyed, the Service ensures that requests always reach the correct backend pods. 391 | 392 | Now, let’s break down the **Service YAML file** step by step. 393 | 394 | --- 395 | 396 | ### **API Version and Kind** 397 | 398 | ```yaml 399 | apiVersion: v1 400 | kind: Service 401 | ``` 402 | 403 | - `apiVersion: v1` → Specifies the API version used for defining the Service. 404 | - `kind: Service` → Tells Kubernetes that we are creating a **Service** resource. 405 | 406 | 📌 **Why do we need a Service?** 407 | Pods have **dynamic IP addresses**, which means their IPs can change when they restart. A **Service provides a stable IP and DNS name** to ensure that traffic always reaches the right pods, even if they get recreated. 408 | 409 | --- 410 | 411 | ### **Metadata** 412 | 413 | ```yaml 414 | metadata: 415 | name: my-service 416 | labels: 417 | job: node-api 418 | app: api 419 | ``` 420 | 421 | - `name: my-service` → The unique name of the Service. 422 | - `labels:` 423 | - `job: node-api` → This label is used to categorize the Service. 424 | - `app: api` → This label helps in grouping and managing resources. 425 | 426 | 📌 **Why use labels?** 427 | Labels help in organizing and selecting Kubernetes resources. For example, we can find all services related to `app: api` using: 428 | 429 | ```bash 430 | kubectl get services -l app=api 431 | ``` 432 | 433 | --- 434 | 435 | ### **Specification (`spec`)** 436 | 437 | The **spec** defines how the Service will behave. 438 | 439 | ```yaml 440 | spec: 441 | type: ClusterIP 442 | selector: 443 | app: api 444 | ports: 445 | - name: web 446 | protocol: TCP 447 | port: 3000 448 | targetPort: 3000 449 | ``` 450 | 451 | --- 452 | 453 | ### **Breaking Down the Service Spec** 454 | 455 | #### **1️⃣ Service Type (`type`)** 456 | 457 | ```yaml 458 | type: ClusterIP 459 | ``` 460 | 461 | - **ClusterIP (default)** → Makes the Service accessible **only within the cluster**. 462 | - Other types of Services: 463 | - **NodePort** → Exposes the Service on a port on each node. 464 | - **LoadBalancer** → Provides an external IP via a cloud provider's load balancer. 465 | - **ExternalName** → Maps the Service to an external DNS name. 466 | 467 | 📌 **Why use ClusterIP?** 468 | If the Service is meant for **internal communication** (e.g., backend APIs talking to each other), ClusterIP is the best choice. 469 | 470 | --- 471 | 472 | #### **2️⃣ Selector (`selector`)** 473 | 474 | ```yaml 475 | selector: 476 | app: api 477 | ``` 478 | 479 | - The **selector** ensures that this Service sends traffic to pods that have the label `app: api`. 480 | - Only these pods will receive requests from this Service. 481 | 482 | 📌 **Why is this needed?** 483 | Because Kubernetes may have **many pods**, we need a way to **match** the correct ones for the Service to route traffic. 484 | 485 | --- 486 | 487 | #### **3️⃣ Ports (`ports`)** 488 | 489 | ```yaml 490 | ports: 491 | - name: web 492 | protocol: TCP 493 | port: 3000 494 | targetPort: 3000 495 | ``` 496 | 497 | - `name: web` → A **name** for the port (useful for debugging and monitoring). 498 | - `protocol: TCP` → Specifies the network protocol used (default is TCP). 499 | - `port: 3000` → The port **exposed by the Service** (used by other services to connect). 500 | - `targetPort: 3000` → The **port inside the pod** where the application is running. 501 | 502 | 📌 **Why is `targetPort` needed?** 503 | Sometimes, the Service port (`port: 3000`) and the pod's container port (`targetPort: 3000`) **can be different**. Kubernetes maps the incoming request from the Service port to the correct container port. 504 | 505 | --- 506 | 507 | ## **ServiceMonitor** 508 | 509 | A **ServiceMonitor** is a **custom resource** used by Prometheus **to discover and scrape metrics from Kubernetes services**. Instead of manually configuring Prometheus to collect metrics from services, we define a **ServiceMonitor** that dynamically discovers the right endpoints. 510 | 511 | Now, let’s break down the **ServiceMonitor YAML file** step by step. 512 | 513 | --- 514 | 515 | ### **API Version and Kind** 516 | 517 | ```yaml 518 | apiVersion: monitoring.coreos.com/v1 519 | kind: ServiceMonitor 520 | ``` 521 | 522 | - `apiVersion: monitoring.coreos.com/v1` → This API is specific to **Prometheus Operator**, which manages monitoring in Kubernetes. 523 | - `kind: ServiceMonitor` → Defines this resource as a **ServiceMonitor**, which tells Prometheus where to collect metrics from. 524 | 525 | 📌 **Why do we need a ServiceMonitor?** 526 | Prometheus does not automatically know which services to monitor. A **ServiceMonitor automatically finds and scrapes metrics from matching services** in Kubernetes. 527 | 528 | --- 529 | 530 | ### **Metadata** 531 | 532 | ```yaml 533 | metadata: 534 | name: api-service-monitor 535 | labels: 536 | release: prometheus 537 | app: prometheus 538 | ``` 539 | 540 | - `name: api-service-monitor` → The name of the ServiceMonitor. 541 | - `labels:` 542 | - `release: prometheus` → Associates this monitor with a specific Prometheus instance. 543 | - `app: prometheus` → Indicates that this ServiceMonitor is part of the Prometheus monitoring setup. 544 | 545 | 📌 **Why use labels here?** 546 | 547 | - Prometheus Operator uses labels to **discover ServiceMonitors** that it should scrape. 548 | - If the Prometheus instance is deployed with `release: prometheus`, it will only pick up ServiceMonitors with the **same label**. 549 | 550 | --- 551 | 552 | ### **Specification (`spec`)** 553 | 554 | The **spec** defines how Prometheus should scrape metrics from services. 555 | 556 | ```yaml 557 | spec: 558 | jobLabel: job 559 | selector: 560 | matchLabels: 561 | app: api 562 | endpoints: 563 | - port: web 564 | path: /swagger-stats/metrics 565 | ``` 566 | 567 | --- 568 | 569 | ### **Breaking Down the ServiceMonitor Spec** 570 | 571 | #### **1️⃣ Job Label (`jobLabel`)** 572 | 573 | ```yaml 574 | jobLabel: job 575 | ``` 576 | 577 | - Specifies that the **job name** for Prometheus should be taken from the `job` label in the Service. 578 | 579 | 📌 **Why is this needed?** 580 | In Prometheus, each **scraped target** (like a service) is associated with a **job name**. This helps in **grouping metrics** for easier analysis. 581 | 582 | --- 583 | 584 | #### **2️⃣ Selector (`selector`)** 585 | 586 | ```yaml 587 | selector: 588 | matchLabels: 589 | app: api 590 | ``` 591 | 592 | - The **selector** ensures that the ServiceMonitor only scrapes **services** that have the label `app: api`. 593 | - It **filters** which services should be monitored. 594 | 595 | 📌 **How does this work?** 596 | If we have a **Kubernetes Service** defined like this: 597 | 598 | ```yaml 599 | metadata: 600 | labels: 601 | app: api 602 | ``` 603 | 604 | Then, the ServiceMonitor will find this service and scrape its metrics. 605 | 606 | --- 607 | 608 | #### **3️⃣ Endpoints (`endpoints`)** 609 | 610 | ```yaml 611 | endpoints: 612 | - port: web 613 | path: /swagger-stats/metrics 614 | ``` 615 | 616 | - `port: web` → Specifies **which port** of the service should be used for scraping metrics. 617 | - `path: /swagger-stats/metrics` → Specifies the **URL path** where Prometheus can fetch metrics. 618 | 619 | 📌 **Why do we need `endpoints`?** 620 | A service may have multiple ports, but only **one of them exposes Prometheus metrics**. This section tells Prometheus **exactly where to look**. 621 | 622 | --- 623 | 624 | ## **PrometheusRules** 625 | 626 | A **PrometheusRule** is a **custom resource** that defines **alerting and recording rules** for Prometheus. It helps in setting up automated **alerts** based on specific conditions in your metrics. 627 | 628 | Now, let’s break down the **PrometheusRules YAML file** step by step. 629 | 630 | --- 631 | 632 | ### **API Version and Kind** 633 | 634 | ```yaml 635 | apiVersion: monitoring.coreos.com/v1 636 | kind: PrometheusRule 637 | ``` 638 | 639 | - `apiVersion: monitoring.coreos.com/v1` → This API is part of the **Prometheus Operator**, which manages monitoring rules in Kubernetes. 640 | - `kind: PrometheusRule` → Defines this resource as a **PrometheusRule**, which contains alerting and recording rules. 641 | 642 | 📌 **Why do we need PrometheusRules?** 643 | 644 | - To **automatically trigger alerts** when specific conditions are met. 645 | - To **define recording rules** that precompute expensive queries for better performance. 646 | 647 | --- 648 | 649 | ### **Metadata** 650 | 651 | ```yaml 652 | metadata: 653 | name: api-prometheus-rule 654 | labels: 655 | release: prometheus 656 | ``` 657 | 658 | - `name: api-prometheus-rule` → The name of this rule set. 659 | - `labels:` 660 | - `release: prometheus` → Associates this rule with the Prometheus instance. 661 | 662 | 📌 **Why use labels here?** 663 | Prometheus Operator looks for **PrometheusRule** objects that match a Prometheus instance using **labels**. If your Prometheus setup is using `release: prometheus`, it will only load rules with the **same label**. 664 | 665 | --- 666 | 667 | ### **Specification (`spec`):** 668 | 669 | The `spec` section defines **alerting rules** that tell Prometheus when to trigger alerts. 670 | 671 | ```yaml 672 | spec: 673 | groups: 674 | - name: api 675 | rules: 676 | - alert: down 677 | expr: up == 0 678 | for: 0m 679 | labels: 680 | severity: Critical 681 | annotations: 682 | summary: Prometheus target missing {{$labels.instance}} 683 | ``` 684 | 685 | --- 686 | 687 | ### **Breaking Down the PrometheusRules Spec** 688 | 689 | #### **1️⃣ Groups (`groups`)** 690 | 691 | ```yaml 692 | groups: 693 | - name: api 694 | ``` 695 | 696 | - A **group** is a collection of rules. 697 | - `name: api` → The **name of the rule group**, which helps in organizing alerts. 698 | 699 | 📌 **Why use rule groups?** 700 | 701 | - **Groups allow efficient rule evaluation** by executing all rules in the same group at once. 702 | - Helps in **categorizing** rules based on different services (e.g., `api`, `database`, `network`). 703 | 704 | --- 705 | 706 | #### **2️⃣ Rules (`rules`)** 707 | 708 | Each group contains **one or more rules** that define alerting conditions. 709 | 710 | ```yaml 711 | rules: 712 | - alert: down 713 | expr: up == 0 714 | for: 0m 715 | labels: 716 | severity: Critical 717 | annotations: 718 | summary: Prometheus target missing {{$labels.instance}} 719 | ``` 720 | 721 | ##### **🔹 Alert Name** 722 | 723 | ```yaml 724 | - alert: down 725 | ``` 726 | 727 | - `alert: down` → This is the **name of the alert** that will be triggered when the condition is met. 728 | 729 | --- 730 | 731 | ##### **🔹 Alert Expression (`expr`)** 732 | 733 | ```yaml 734 | expr: up == 0 735 | ``` 736 | 737 | - The **expression** determines when the alert should trigger. 738 | - `up == 0` → The **`up`** metric in Prometheus indicates whether a target is reachable. 739 | - `1` → The target is **healthy**. 740 | - `0` → The target is **down**. 741 | - This rule **triggers an alert if the target is down**. 742 | 743 | 📌 **How does this work?** 744 | If an application crashes or a service becomes unreachable, the **`up` metric becomes 0**, and this alert fires. 🚨 745 | 746 | --- 747 | 748 | ##### **🔹 Alert Duration (`for`)** 749 | 750 | ```yaml 751 | for: 0m 752 | ``` 753 | 754 | - Specifies how long the condition must hold **before triggering the alert**. 755 | - `0m` → The alert fires **immediately** when the condition is met. 756 | - You can set a delay (e.g., `5m`) to prevent **flapping alerts** (temporary issues that resolve quickly). 757 | 758 | 📌 **Example:** 759 | 760 | - `for: 5m` → Only triggers if `up == 0` for **5 continuous minutes**. 761 | - `for: 0m` → Triggers **instantly** when `up == 0`. 762 | 763 | --- 764 | 765 | ##### **🔹 Labels (`labels`)** 766 | 767 | ```yaml 768 | labels: 769 | severity: Critical 770 | ``` 771 | 772 | - Labels **add metadata** to alerts. 773 | - `severity: Critical` → Marks this alert as **Critical**, helping categorize alerts by urgency. 774 | 775 | 📌 **Why use labels?** 776 | 777 | - Allows **grouping and filtering alerts** in monitoring dashboards like Grafana. 778 | - Helps **alert managers** route notifications (e.g., send `Critical` alerts via SMS and `Warning` alerts via email). 779 | 780 | --- 781 | 782 | ##### **🔹 Annotations (`annotations`)** 783 | 784 | ```yaml 785 | annotations: 786 | summary: Prometheus target missing {{$labels.instance}} 787 | ``` 788 | 789 | - **Annotations provide extra information** about the alert. 790 | - `summary: Prometheus target missing {{$labels.instance}}` 791 | - `{{$labels.instance}}` → Inserts the instance name (e.g., `my-service:9090`). 792 | - The final alert message might look like: 793 | - **"Prometheus target missing my-service:9090"** 794 | 795 | 📌 **Why use annotations?** 796 | 797 | - Helps **provide context** in alerting tools like **Alertmanager, Slack, or PagerDuty**. 798 | - Reduces the need for manual investigation when an alert fires. 799 | 800 | --- 801 | 802 | ## **AlertmanagerConfig** 803 | 804 | An **AlertmanagerConfig** is a **custom resource** that helps define how **Alertmanager** handles alerts sent by **Prometheus**. It **routes alerts** to different receivers (e.g., email, Slack, PagerDuty) based on their **severity, labels, or conditions**. 805 | 806 | Now, let’s break down the **AlertmanagerConfig YAML file** step by step. 807 | 808 | --- 809 | 810 | ### **API Version and Kind** 811 | 812 | ```yaml 813 | apiVersion: monitoring.coreos.com/v1 814 | kind: AlertmanagerConfig 815 | ``` 816 | 817 | - `apiVersion: monitoring.coreos.com/v1` → Uses the **Prometheus Operator API** for managing alerting configurations. 818 | - `kind: AlertmanagerConfig` → Defines this resource as an **AlertmanagerConfig**, which specifies routing rules for alerts. 819 | 820 | 📌 **Why do we need AlertmanagerConfig?** 821 | 822 | - It allows **custom alert routing** to different teams based on alert **severity** or **labels**. 823 | - Helps **avoid alert fatigue** by grouping and delaying alerts instead of spamming notifications. 824 | 825 | --- 826 | 827 | ### **Metadata** 828 | 829 | ```yaml 830 | metadata: 831 | name: alertmanager-config 832 | labels: 833 | release: prometheus 834 | ``` 835 | 836 | - `name: alertmanager-config` → The name of this Alertmanager configuration. 837 | - `labels:` 838 | - `release: prometheus` → Associates this configuration with a **Prometheus** instance. 839 | 840 | 📌 **Why use labels here?** 841 | Labels ensure that **only the correct Prometheus instance** picks up this configuration. 842 | 843 | --- 844 | 845 | ### **Specification (`spec`)**: 846 | 847 | The `spec` section defines **how alerts are grouped, delayed, and routed** to different receivers (e.g., email, Slack, webhook). 848 | 849 | --- 850 | 851 | ### **1️⃣ Route Configuration** 852 | 853 | ```yaml 854 | spec: 855 | route: 856 | groupBy: ["severity"] 857 | groupWait: 30s 858 | groupInterval: 5m 859 | repeatInterval: 12h 860 | receiver: "team-notifications" 861 | ``` 862 | 863 | This section **controls how alerts are grouped and routed** to receivers. 864 | 865 | #### **🔹 `groupBy`** 866 | 867 | ```yaml 868 | groupBy: ["severity"] 869 | ``` 870 | 871 | - Groups alerts based on their **severity** (e.g., Critical, Warning). 872 | - Instead of sending separate alerts for each **Critical** issue, it **bundles them together** into one notification. 873 | 874 | 📌 **Example:** 875 | 876 | - Instead of sending **5 separate Critical alerts**, it sends **1 grouped alert** with all details. 877 | 878 | --- 879 | 880 | #### **🔹 `groupWait`** 881 | 882 | ```yaml 883 | groupWait: 30s 884 | ``` 885 | 886 | - **Waits for 30 seconds** before sending the first alert. 887 | - Helps **reduce noise** by allowing similar alerts to be **grouped** before notifying. 888 | 889 | 📌 **Why use `groupWait`?** 890 | If an application crashes and **multiple alerts** fire at once, it **waits 30 seconds** to group them instead of sending them separately. 891 | 892 | --- 893 | 894 | #### **🔹 `groupInterval`** 895 | 896 | ```yaml 897 | groupInterval: 5m 898 | ``` 899 | 900 | - After sending the first alert, it waits **5 minutes** before sending another batch of alerts. 901 | - Ensures that alerts for **the same issue** are **not repeatedly sent** in a short period. 902 | 903 | 📌 **Example:** 904 | If 10 alerts fire within 5 minutes, only **one alert is sent** every 5 minutes. 905 | 906 | --- 907 | 908 | #### **🔹 `repeatInterval`** 909 | 910 | ```yaml 911 | repeatInterval: 12h 912 | ``` 913 | 914 | - If an alert **remains active**, it sends **a reminder every 12 hours**. 915 | - Prevents excessive notifications while ensuring unresolved issues are **not ignored**. 916 | 917 | 📌 **Example:** 918 | If an API server is down for 2 days, **reminders are sent every 12 hours** until it's resolved. 919 | 920 | --- 921 | 922 | #### **🔹 `receiver`** 923 | 924 | ```yaml 925 | receiver: "team-notifications" 926 | ``` 927 | 928 | - Defines which **receiver** will handle this alert group. 929 | - Here, alerts are sent to a receiver named **"team-notifications"** (configured below). 930 | 931 | 📌 **Why use receivers?** 932 | It helps direct alerts to the right **team** or **notification method** (e.g., Slack for engineers, PagerDuty for on-call teams). 933 | 934 | --- 935 | 936 | ### **2️⃣ Receiver Configuration** 937 | 938 | ```yaml 939 | receivers: 940 | - name: "team-notifications" 941 | emailConfigs: 942 | - to: "team@example.com" 943 | sendResolved: true 944 | ``` 945 | 946 | This section **defines how alerts are delivered**. 947 | 948 | #### **🔹 Receiver Name** 949 | 950 | ```yaml 951 | - name: "team-notifications" 952 | ``` 953 | 954 | - This **matches the receiver name** in `route.receiver`, meaning alerts will be **sent here**. 955 | 956 | --- 957 | 958 | #### **🔹 Email Notification (`emailConfigs`)** 959 | 960 | ```yaml 961 | emailConfigs: 962 | - to: "team@example.com" 963 | sendResolved: true 964 | ``` 965 | 966 | - **`to: "team@example.com"`** → Sends alerts to this **email address**. 967 | - **`sendResolved: true`** → Sends **another email** when the issue is **fixed**. 968 | 969 | 📌 **Why use `sendResolved: true`?** 970 | 971 | - Without it, users only get an alert when an issue occurs. 972 | - With it, users **also get a notification when the issue is resolved**. 973 | 974 | --- 975 | 976 | ### **🔹 Other Notification Methods** 977 | 978 | Instead of **emailConfigs**, you can also use **Slack, PagerDuty, or Webhooks**: 979 | 980 | ✅ **Slack Notification Example** 981 | 982 | ```yaml 983 | slackConfigs: 984 | - channel: "#alerts" 985 | apiURL: "https://hooks.slack.com/services/XXX/YYY/ZZZ" 986 | sendResolved: true 987 | ``` 988 | 989 | ✅ **PagerDuty Notification Example** 990 | 991 | ```yaml 992 | pagerdutyConfigs: 993 | - serviceKey: "your-pagerduty-key" 994 | sendResolved: true 995 | ``` 996 | 997 | 📌 **Why use multiple receivers?** 998 | 999 | - Send **Critical alerts to PagerDuty**. 1000 | - Send **Warning alerts to Slack**. 1001 | - Send **All alerts to email**. 1002 | 1003 | --- 1004 | 1005 | ### **Summary** 1006 | 1007 | ✅ This **AlertmanagerConfig** will: 1008 | 1009 | 1. **Group alerts by severity** (`Critical`, `Warning`). 1010 | 2. **Delay the first alert by 30s** to prevent spam. 1011 | 3. **Send only one alert every 5 minutes** for ongoing issues. 1012 | 4. **Repeat unresolved alerts every 12 hours**. 1013 | 5. **Send alerts to `team@example.com` via email**. 1014 | 6. **Notify when an alert is resolved** (`sendResolved: true`). 1015 | 1016 | --- 1017 | 1018 | ### **Real-World Example** 1019 | 1020 | Imagine you are running a **Kubernetes cluster**, and your **API server crashes**. 1021 | 1022 | 1. Prometheus detects the failure (`up == 0`). 1023 | 2. The **PrometheusRule** fires an alert. 1024 | 3. The **AlertmanagerConfig**: 1025 | - Waits 30s to group similar alerts. 1026 | - Sends an email to ****. 1027 | - If the issue is not resolved, it **reminds the team every 12 hours**. 1028 | 4. When the API server **recovers**, an email **confirmation is sent**. 1029 | 1030 | --- 1031 | 1032 | ## **Author & Community** 1033 | 1034 | This project is crafted by **[Harshhaa](https://github.com/NotHarshhaa)** 💡. 1035 | I’d love to hear your feedback! Feel free to share your thoughts. 1036 | 1037 | --- 1038 | 1039 | ### **Connect with me:** 1040 | 1041 | [![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://linkedin.com/in/harshhaa-vardhan-reddy) [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/NotHarshhaa) [![Telegram](https://img.shields.io/badge/Telegram-26A5E4?style=for-the-badge&logo=telegram&logoColor=white)](https://t.me/prodevopsguy) [![Dev.to](https://img.shields.io/badge/Dev.to-0A0A0A?style=for-the-badge&logo=dev.to&logoColor=white)](https://dev.to/notharshhaa) [![Hashnode](https://img.shields.io/badge/Hashnode-2962FF?style=for-the-badge&logo=hashnode&logoColor=white)](https://hashnode.com/@prodevopsguy) 1042 | 1043 | --- 1044 | 1045 | ### 📢 **Stay Connected** 1046 | 1047 | ![Follow Me](https://imgur.com/2j7GSPs.png) 1048 | -------------------------------------------------------------------------------- /Prometheus-lab/k8s-yaml/Alertmanagerconfig.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: monitoring.coreos.com/v1 2 | kind: AlertmanagerConfig 3 | metadata: 4 | name: alertmanager-config 5 | labels: 6 | release: prometheus # Ensures correct association with Prometheus instance 7 | spec: 8 | route: 9 | groupBy: ["alertname", "severity"] # Group alerts by alert name and severity level 10 | groupWait: 30s # Wait time before sending grouped alerts 11 | groupInterval: 5m # Interval between alert groups 12 | repeatInterval: 12h # Frequency of repeated alerts for unresolved issues 13 | receiver: "default-receiver" # Default receiver for alerts 14 | 15 | receivers: 16 | - name: "email-team" # Email notification configuration 17 | emailConfigs: 18 | - to: "team@example.com" # Replace with actual team email 19 | from: "alerts@yourdomain.com" # Replace with your email sender 20 | smarthost: "smtp.yourmail.com:587" # Replace with SMTP server 21 | authUsername: "your-username" 22 | authPassword: 23 | name: email-secret 24 | key: password 25 | sendResolved: true # Send notifications when issues are resolved 26 | 27 | - name: "slack-notifications" # Slack notification configuration 28 | slackConfigs: 29 | - channel: "#alerts" # Replace with Slack channel name 30 | apiURL: 31 | name: slack-secret 32 | key: webhook-url 33 | sendResolved: true 34 | title: "{{ .CommonAnnotations.summary }}" 35 | text: "{{ .CommonAnnotations.description }}" 36 | 37 | - name: "webhook-receiver" # Webhook integration (e.g., PagerDuty, custom APIs) 38 | webhookConfigs: 39 | - url: "https://your-webhook-url.com/alert" 40 | sendResolved: true 41 | -------------------------------------------------------------------------------- /Prometheus-lab/k8s-yaml/Deployment.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: my-deployment # Name of the deployment 5 | labels: 6 | app: api # Label for identifying the application type 7 | spec: 8 | replicas: 3 # Ensures high availability by running 3 replicas 9 | strategy: 10 | type: RollingUpdate # Enables zero-downtime updates 11 | rollingUpdate: 12 | maxUnavailable: 1 # Only 1 pod can be unavailable during an update 13 | maxSurge: 1 # Allows 1 extra pod to be created during an update 14 | selector: 15 | matchLabels: 16 | app: api # Match labels to select pods for this deployment 17 | template: 18 | metadata: 19 | labels: 20 | app: api # Labels for pods created by this template 21 | spec: 22 | containers: 23 | - name: my-container # Name of the container 24 | image: panchanandevops/myexpress:v0.1.0 # Docker image for the container 25 | imagePullPolicy: Always # Ensures the latest image is pulled on every deployment 26 | resources: 27 | limits: 28 | memory: "256Mi" # Increased memory for better performance 29 | cpu: "500m" # CPU limit for the container 30 | requests: 31 | memory: "128Mi" # Requested memory for initial allocation 32 | cpu: "250m" # Requested CPU to ensure smooth operation 33 | ports: 34 | - containerPort: 3000 # Port on which the container will listen 35 | readinessProbe: # Ensures the pod is ready before traffic is sent 36 | httpGet: 37 | path: /health 38 | port: 3000 39 | initialDelaySeconds: 5 40 | periodSeconds: 10 41 | livenessProbe: # Checks if the application is still running 42 | httpGet: 43 | path: /health 44 | port: 3000 45 | initialDelaySeconds: 10 46 | periodSeconds: 20 47 | restartPolicy: Always # Ensures the container restarts on failure 48 | nodeSelector: 49 | kubernetes.io/os: linux # Ensures the deployment runs on Linux nodes 50 | -------------------------------------------------------------------------------- /Prometheus-lab/k8s-yaml/PrometheusRule.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: monitoring.coreos.com/v1 2 | kind: PrometheusRule 3 | metadata: 4 | name: api-prometheus-rule 5 | labels: 6 | release: prometheus # Label for release association 7 | spec: 8 | groups: 9 | - name: api-alerts # Improved naming convention for clarity 10 | rules: 11 | # Alert for when the target instance is down 12 | - alert: InstanceDown 13 | expr: up == 0 14 | for: 1m # Alert only triggers if the instance is down for 1 minute 15 | labels: 16 | severity: critical 17 | team: backend 18 | annotations: 19 | summary: "Instance {{ $labels.instance }} is down" 20 | description: "Prometheus target instance {{ $labels.instance }} has been unreachable for over 1 minute." 21 | 22 | # Alert for high CPU usage (above 80% for 5 minutes) 23 | - alert: HighCPUUsage 24 | expr: (100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 25 | for: 5m 26 | labels: 27 | severity: warning 28 | team: backend 29 | annotations: 30 | summary: "High CPU Usage on {{ $labels.instance }}" 31 | description: "CPU usage on instance {{ $labels.instance }} has been above 80% for the last 5 minutes." 32 | 33 | # Alert for low memory availability (less than 10% free for 5 minutes) 34 | - alert: LowMemory 35 | expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10 36 | for: 5m 37 | labels: 38 | severity: warning 39 | team: backend 40 | annotations: 41 | summary: "Low Available Memory on {{ $labels.instance }}" 42 | description: "Available memory on instance {{ $labels.instance }} is below 10% for over 5 minutes." 43 | 44 | # Alert for high disk usage (above 90% for 10 minutes) 45 | - alert: HighDiskUsage 46 | expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 47 | for: 10m 48 | labels: 49 | severity: critical 50 | team: backend 51 | annotations: 52 | summary: "High Disk Usage on {{ $labels.instance }}" 53 | description: "Disk usage on instance {{ $labels.instance }} has exceeded 90% for the last 10 minutes." 54 | -------------------------------------------------------------------------------- /Prometheus-lab/k8s-yaml/Service-monitor.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: monitoring.coreos.com/v1 2 | kind: ServiceMonitor 3 | metadata: 4 | name: api-service-monitor 5 | labels: 6 | release: prometheus # Ensures association with Prometheus 7 | app: prometheus 8 | team: backend # Assigns the monitor to a specific team 9 | spec: 10 | jobLabel: api-monitor # More descriptive job label 11 | selector: 12 | matchLabels: 13 | app: api # Matches the app label for monitoring 14 | namespaceSelector: 15 | matchNames: 16 | - default # Ensures the correct namespace is targeted (modify as needed) 17 | endpoints: 18 | - port: web # Monitored port 19 | path: /swagger-stats/metrics # Path where Prometheus scrapes metrics 20 | interval: 15s # Scrape interval (adjust based on load) 21 | scrapeTimeout: 10s # Timeout for scraping metrics 22 | honorLabels: true # Preserves metric labels from the application 23 | relabelings: # Modifies metric labels dynamically 24 | - sourceLabels: [__meta_kubernetes_pod_node_name] 25 | targetLabel: node 26 | - sourceLabels: [__meta_kubernetes_namespace] 27 | targetLabel: namespace 28 | - sourceLabels: [__meta_kubernetes_pod_name] 29 | targetLabel: pod 30 | - sourceLabels: [__meta_kubernetes_pod_container_name] 31 | targetLabel: container 32 | - sourceLabels: [__meta_kubernetes_pod_label_app] 33 | targetLabel: app 34 | - sourceLabels: [__meta_kubernetes_pod_label_release] 35 | targetLabel: release 36 | - sourceLabels: [__meta_kubernetes_pod_label_team] 37 | targetLabel: team 38 | - sourceLabels: [__meta_kubernetes_pod_label_version] 39 | targetLabel: version 40 | - sourceLabels: [__meta_kubernetes_pod_label_component] 41 | targetLabel: component 42 | - sourceLabels: [__meta_kubernetes_pod_label_managed_by] 43 | targetLabel: managed_by 44 | - sourceLabels: [__meta_kubernetes_pod_label_created_by] 45 | targetLabel: created_by 46 | - sourceLabels: [__meta_kubernetes_pod_label_deployment] 47 | targetLabel: deployment 48 | - sourceLabels: [__meta_kubernetes_pod_label_service] 49 | targetLabel: service 50 | - sourceLabels: [__meta_kubernetes_pod_label_environment] 51 | targetLabel: environment 52 | - sourceLabels: [__meta_kubernetes_pod_label_tier] 53 | targetLabel: tier 54 | - sourceLabels: [__meta_kubernetes_pod_label_partition] 55 | targetLabel: partition 56 | - sourceLabels: [__meta_kubernetes_pod_label_track] 57 | targetLabel: track 58 | - sourceLabels: [__meta_kubernetes_pod_label_role] 59 | targetLabel: role 60 | - sourceLabels: [__meta_kubernetes_pod_label_zone] 61 | targetLabel: zone 62 | - sourceLabels: [__meta_kubernetes_pod_label_region] 63 | targetLabel: region 64 | - sourceLabels: [__meta_kubernetes_pod_label_cluster] 65 | targetLabel: cluster 66 | -------------------------------------------------------------------------------- /Prometheus-lab/k8s-yaml/Service.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: Service 3 | metadata: 4 | name: my-service 5 | labels: 6 | job: node-api # Identifies the job associated with this service 7 | app: api 8 | environment: production # Helps differentiate between environments (dev/staging/prod) 9 | spec: 10 | type: ClusterIP # Internal service (change to NodePort or LoadBalancer if needed) 11 | selector: 12 | app: api # Matches pods labeled with 'app: api' 13 | ports: 14 | - name: http # Descriptive name for the port 15 | protocol: TCP 16 | port: 3000 # Exposed service port 17 | targetPort: 3000 # Corresponding container port 18 | sessionAffinity: None # Ensures no sticky sessions (modify if needed) 19 | sessionAffinityConfig: 20 | clientIP: 21 | timeoutSeconds: 10800 # 3 hours 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🚀 **Learning Prometheus: A Complete Guide for Kubernetes Monitoring** 2 | 3 | ![Prometheus](https://imgur.com/0lYXGvg.png) 4 | 5 | ## 🔍 **Master Prometheus for Real-Time Monitoring & Observability** 6 | 7 | ![Prometheus](https://imgur.com/EZe96QW.png) 8 | 9 | This repository is dedicated to learning, implementing, and deploying **Prometheus** for monitoring Kubernetes environments. Whether you're a beginner or an experienced DevOps engineer, this guide will help you master Prometheus with real-world use cases. 10 | 11 | --- 12 | 13 | ## 📌 **Repository Structure** 14 | 15 | ### 📂 **1. Prometheus-lab/** 16 | 17 | This directory contains hands-on labs and YAML manifest files for deploying Prometheus in Kubernetes. 18 | 19 | #### 📌 **k8s-yaml/** *(Kubernetes Deployment Manifests)* 20 | 21 | - `Alertmanagerconfig.yaml` - Configuration for Prometheus Alertmanager to handle alerts. 22 | - `Deployment.yaml` - Defines the Prometheus deployment in Kubernetes. 23 | - `PrometheusRule.yaml` - Alerting rules for Prometheus monitoring. 24 | - `Service-monitor.yaml` - ServiceMonitor definition for scraping Prometheus metrics. 25 | - `Service.yaml` - Kubernetes service to expose Prometheus. 26 | - `README.md` - Documentation on setting up Prometheus in Kubernetes. 27 | 28 | ### 📂 **2. promql-img/** 29 | 30 | - This folder contains images used to explain PromQL queries and dashboard visualizations. 31 | 32 | ### 📜 **3. promQl.md** 33 | 34 | - A guide to **PromQL (Prometheus Query Language)**, including syntax, functions, and real-world query examples. 35 | 36 | ### 📜 **4. prometheus_setup.md** 37 | 38 | - Step-by-step instructions for installing and setting up Prometheus. 39 | 40 | ### 📜 **5. README.md** *(This file)* 41 | 42 | - The main documentation file for understanding the structure and content of the repository. 43 | 44 | --- 45 | 46 | ## 📖 **Detailed Learning Guide** 47 | 48 | 📌 **Read the full tutorial here:** 49 | 🔗 **[Real-world Prometheus Deployment: A Practical Guide for Kubernetes Monitoring](https://blog.prodevopsguy.xyz/real-world-prometheus-deployment-a-practical-guide-for-kubernetes-monitoring)** 50 | 51 | --- 52 | 53 | ## 🚀 **What You'll Learn?** 54 | 55 | ✅ **Prometheus Fundamentals:** Understand Prometheus architecture, data collection, and querying. 56 | ✅ **Kubernetes Monitoring:** Learn how to integrate Prometheus with Kubernetes for system metrics and application observability. 57 | ✅ **PromQL (Prometheus Query Language):** Master querying techniques for efficient monitoring and alerting. 58 | ✅ **Grafana Integration:** Visualize Prometheus metrics using Grafana dashboards. 59 | ✅ **Alerting & Notifications:** Set up alert rules and integrate with Slack, Email, and other services. 60 | ✅ **Custom Exporters:** Learn to create and configure custom exporters for collecting application-specific metrics. 61 | ✅ **Scaling Prometheus:** Implement high-availability and federation strategies. 62 | 63 | --- 64 | 65 | ## **Code of Conduct** 66 | 67 | > [!CAUTION] 68 | > 69 | > We are committed to fostering a welcoming and respectful environment for all contributors. Please take a moment to review our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating in this community. 70 | 71 | --- 72 | 73 | ## **Contribute and Collaborate** 74 | 75 | > [!TIP] 76 | > This repository thrives on community contributions and collaboration. Here’s how you can get involved: 77 | > 78 | > - **Fork the Repository:** Create your own copy of the repository to work on. 79 | > - **Submit Pull Requests:** Contribute your projects or improvements to existing projects by submitting pull requests. 80 | > - **Engage with Others:** Participate in discussions, provide feedback on others’ projects, and collaborate to create better solutions. 81 | > - **Share Your Knowledge:** If you’ve developed a new project or learned something valuable, share it with the community. Your contributions can help others in their learning journey. 82 | 83 | --- 84 | 85 | ## **Join the Community** 86 | 87 | > [!IMPORTANT] 88 | > We encourage you to be an active part of our community: 89 | > 90 | > - **Join Our Telegram Community:** Connect with fellow DevOps enthusiasts, ask questions, and share your progress in our [Telegram group](https://t.me/prodevopsguy). 91 | > - **Follow Me on GitHub:** Stay updated with new projects and content by [following me on GitHub](https://github.com/NotHarshhaa). 92 | 93 | --- 94 | 95 | ## **Hit the Star!** ⭐ 96 | 97 | **If you find this repository helpful and plan to use it for learning, please give it a star. Your support is appreciated!** 98 | 99 | --- 100 | 101 | ## 🛠️ **Author & Community** 102 | 103 | This project is crafted by **[Harshhaa](https://github.com/NotHarshhaa)** 💡. 104 | I’d love to hear your feedback! Feel free to share your thoughts. 105 | 106 | --- 107 | 108 | ### 📧 **Connect with me:** 109 | 110 | [![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://linkedin.com/in/harshhaa-vardhan-reddy) [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/NotHarshhaa) [![Telegram](https://img.shields.io/badge/Telegram-26A5E4?style=for-the-badge&logo=telegram&logoColor=white)](https://t.me/prodevopsguy) [![Dev.to](https://img.shields.io/badge/Dev.to-0A0A0A?style=for-the-badge&logo=dev.to&logoColor=white)](https://dev.to/notharshhaa) [![Hashnode](https://img.shields.io/badge/Hashnode-2962FF?style=for-the-badge&logo=hashnode&logoColor=white)](https://hashnode.com/@prodevopsguy) 111 | 112 | --- 113 | 114 | ### 📢 **Stay Connected** 115 | 116 | ![Follow Me](https://imgur.com/2j7GSPs.png) 117 | -------------------------------------------------------------------------------- /promQl.md: -------------------------------------------------------------------------------- 1 | - [Decoding PromQL: A Deep Dive into Prometheus Query Language](#decoding-promql-a-deep-dive-into-prometheus-query-language) 2 | - [Introduction to PromQL](#introduction-to-promql) 3 | - [Data Types of PromQL](#data-types-of-promql) 4 | - [Scalar:](#scalar) 5 | - [String:](#string) 6 | - [Instant Vector:](#instant-vector) 7 | - [Range Vector:](#range-vector) 8 | - [Operators in PromQL](#operators-in-promql) 9 | - [Aggregation Operators:](#aggregation-operators) 10 | - [Binary Operators:](#binary-operators) 11 | - [Range Operator:](#range-operator) 12 | - [Offset Operator:](#offset-operator) 13 | - [Types of Prometheus Metrics for Data Storage and Organization](#types-of-prometheus-metrics-for-data-storage-and-organization) 14 | - [Counter:](#counter) 15 | - [Gauge:](#gauge) 16 | - [Histogram:](#histogram) 17 | - [Summary:](#summary) 18 | - [Begin Your Monitoring Journey!](#begin-your-monitoring-journey) 19 | --- 20 | 21 | 22 | # Decoding PromQL: A Deep Dive into Prometheus Query Language 23 | --- 24 | ## Introduction to PromQL 25 | 26 | PromQL, short for Prometheus Query Language, is the dedicated language designed for **querying** and **extracting valuable insights** from the **time-series** data stored in Prometheus. As the backbone of Prometheus' querying capabilities, PromQL enables users to navigate and analyze metrics, providing a powerful tool for **monitoring** and **troubleshooting.** This section will provide a foundational understanding of PromQL, setting the stage for exploring its various aspects and applications in the realm of system observability. 27 | 28 | ## Data Types of PromQL 29 | 30 | PromQL has **scalar**, **instant vector**, **range vector**, **string**, and **boolean** data types for **querying** and **analyzing** Prometheus metrics. 31 | 32 | Lets explore each in detail. 33 | 34 | ### Scalar: 35 | 36 | The scalar data type in Prometheus Query Language (PromQL) represents a single numeric value at a specific point in time. Scalars are fundamental to expressing instantaneous measurements or metrics that don't vary over a range. 37 | 38 | **Key characteristics of the scalar data type:** 39 | 40 | 1. **Single Value:** Scalars represent a solitary numeric value at a specific instance in time, offering a snapshot of a metric's value without time series information. 41 | 42 | 2. **Direct Representation:** Scalars are direct representations of metrics' current states, making them suitable for instantaneous measurements such as the current CPU load, available memory, or a count at a specific time. 43 | 44 | 3. **Fundamental Building Block:** Scalars serve as fundamental components for arithmetic operations, comparisons, and as basic inputs for computations and analyses within PromQL. 45 | 46 | **Here are examples illustrating the characteristics of the scalar data type in PromQL:** 47 | 48 | 49 | 1. **Single Value:** 50 | 51 | ``` 52 | cpu_temperature 53 | ``` 54 | - Returns the current temperature of the CPU as a scalar value, representing the temperature at the latest timestamp. 55 | 56 | 57 | 2. **Direct Representation:** 58 | 59 | ``` 60 | available_memory_bytes 61 | ``` 62 | - The query returns the current available memory in bytes as a scalar, providing a direct representation of the available memory at the latest observation. 63 | 64 | 65 | 3. **Fundamental Building Block:** 66 | 67 | ``` 68 | cpu_usage + memory_usage 69 | ``` 70 | - In this example, the query adds the current CPU usage and memory usage, leveraging scalars as fundamental building blocks for arithmetic operations and analysis. 71 | 72 | ### String: 73 | 74 | The string data type in Prometheus represents a sequence of characters and is commonly used for labeling and metadata in the metric data model. 75 | 76 | **Key characteristics of the string data type in Prometheus:** 77 | 78 | 1. **Character Sequence:** Strings represent a sequence of characters, which can include letters, numbers, symbols, and whitespace. 79 | 80 | 2. **Labeling:** Strings are commonly used as labels in the metric data model, providing additional context or categorization for time series data. 81 | 82 | 3. **Metadata:** Strings serve as a means to convey metadata information, such as labels describing service names, instance identifiers, or any descriptive information associated with a metric. 83 | 84 | 4. **Quoting:** Strings in PromQL are enclosed in either single quotes (`'`) or double quotes (`"`), and they are used to define label values or string literals in queries. 85 | 86 | **Here are examples illustrating the characteristics of the string data type in Prometheus:** 87 | 88 | 1. **Character Sequence:** 89 | ```plaintext 90 | 'service_name' 91 | ``` 92 | - This string literal represents the character sequence 'service_name' and might be used as a label value. 93 | 94 | 2. **Labeling:** 95 | ```plaintext 96 | http_requests_total{environment="production"} 97 | ``` 98 | - In this query, the string "production" is used as a label value to filter for HTTP requests in the production environment. 99 | 100 | 3. **Metadata:** 101 | ```plaintext 102 | instance="webserver-01" 103 | ``` 104 | - This string label provides metadata about the specific instance 'webserver-01' associated with a metric. 105 | 106 | 4. **Quoting:** 107 | ```plaintext 108 | "metric_name" or 'metric_name' 109 | ``` 110 | - Strings in PromQL are enclosed in either single or double quotes, as shown in these examples defining the string literals "metric_name" or 'metric_name'. 111 | 112 | 5. **Usage:** 113 | ```plaintext 114 | up{job="api", environment='staging'} 115 | ``` 116 | - This query selects time series with the label "job" having the value "api" and the label "environment" having the value 'staging'. 117 | 118 | ### Instant Vector: 119 | 120 | The instant vector in Prometheus is a set of time series data, each associated with a single value at a specific point in time. Instant vectors are commonly used in PromQL queries to retrieve and analyze metric values at a specific timestamp. 121 | 122 | **Key characteristics of the instant vector:** 123 | 124 | 1. **Snapshot in Time:** Instant vectors represent a snapshot of metric values at a specific timestamp, providing a point-in-time view of the data. 125 | 126 | 2. **Single Value per Time Series:** Each time series within an instant vector contains a single value corresponding to the specified timestamp. 127 | 128 | 3. **Time Series Selection:** Queries using instant vectors can filter and aggregate time series based on labels, allowing for targeted analysis of specific metrics. 129 | 130 | 4. **Mathematical Operations:** Instant vectors can be involved in mathematical operations, enabling calculations, comparisons, and transformations within PromQL queries. 131 | 132 | 133 | **Here are examples illustrating the characteristics of the instant vector data type in Prometheus:** 134 | 135 | 136 | 1. **Current CPU Usage:** 137 | ```plaintext 138 | cpu_usage 139 | ``` 140 | - This query returns an instant vector representing the current CPU usage for all relevant time series. 141 | 142 | 2. **High Memory Utilization Instances:** 143 | ```plaintext 144 | node_memory_MemUsage_bytes > 80e9 145 | ``` 146 | - The query filters instant vectors to select time series where the memory usage exceeds 80 gigabytes. 147 | 148 | 3. **Rate of HTTP Requests:** 149 | ```plaintext 150 | rate(http_requests_total[5m]) 151 | ``` 152 | - This query calculates the per-second rate of HTTP requests over the last 5 minutes, returning an instant vector. 153 | 154 | 4. **Combined Network Traffic:** 155 | ```plaintext 156 | sum(rate(network_traffic_bytes{direction="in"}[1h])) + sum(rate(network_traffic_bytes{direction="out"}[1h])) 157 | ``` 158 | - The query computes the sum of the rates of incoming and outgoing network traffic over the last hour, providing an instant vector. 159 | 160 | ### Range Vector: 161 | 162 | The Range Vector in Prometheus is a set of time series data, each associated with a range of values over a specified time interval. Range Vectors are commonly used in PromQL queries to analyze and evaluate metrics over a duration, allowing for calculations, aggregations, and comparisons over time. 163 | 164 | Key characteristics of the Range Vector data type: 165 | 166 | 1. **Time Series Over a Range:** 167 | - The Range Vector provides a set of time series data, each representing a range of values over a specified time interval. 168 | 169 | 2. **Single Value per Time Series per Timestamp:** 170 | - Each time series within a Range Vector contains a set of values corresponding to multiple timestamps within the specified interval. 171 | 172 | 3. **Time Series Selection:** 173 | - Similar to instant vectors, Range Vectors can filter and aggregate time series based on labels for targeted analysis. 174 | 175 | 4. **Time Shifts:** 176 | - PromQL allows for time shifting operations on Range Vectors, such as using the offset modifier to shift the time range, enabling comparison of values at different points in time. 177 | 178 | 5. **Alerting Conditions:** 179 | - Range Vectors are commonly used in alerting conditions to detect and trigger alerts based on abnormal behavior or patterns observed over a defined duration. 180 | 181 | 182 | **Here are examples illustrating the characteristics of the Range Vector data type in Prometheus:** 183 | 184 | 185 | 1. **Sum of HTTP Request Rates over the Last 5 Minutes:** 186 | ```plaintext 187 | sum(rate(http_requests_total[5m])) 188 | ``` 189 | - This query calculates the sum of the per-second rate of HTTP requests over the last 5 minutes. 190 | 191 | 2. **Average CPU Usage over the Last Hour:** 192 | ```plaintext 193 | avg(cpu_usage_percent[1h]) 194 | ``` 195 | - The query computes the average CPU usage percentage over the last hour. 196 | 197 | 3. **Total Disk Space Used in Bytes over the Last 30 Minutes:** 198 | ```plaintext 199 | sum(node_filesystem_size_bytes - node_filesystem_free_bytes) offset 30m 200 | ``` 201 | - This query determines the total disk space used by subtracting free space from total size over the last 30 minutes. 202 | 203 | 4. **Rate of Error Responses in the Past 15 Minutes:** 204 | ```plaintext 205 | rate(http_responses_error_total[15m]) 206 | ``` 207 | - The query calculates the per-second rate of error responses in HTTP requests over the past 15 minutes. 208 | 209 | 5. **Changes in Available Memory over the Last 10 Minutes:** 210 | ```plaintext 211 | changes(node_memory_MemAvailable_bytes[10m]) 212 | ``` 213 | - This query identifies the number of changes in available memory values over the last 10 minutes. 214 | 215 | 6. **90th Percentile Response Time for API Requests in the Last 20 Minutes:** 216 | ```plaintext 217 | histogram_quantile(0.9, rate(api_request_duration_seconds_bucket[20m])) 218 | ``` 219 | - The query computes the 90th percentile response time for API requests over the last 20 minutes using a histogram. 220 | 221 | 222 | 223 | **Boolean:** 224 | 225 | The boolean data type in Prometheus represents true or false values and is commonly used in logical expressions and conditions within PromQL queries. 226 | 227 | **Key characteristics of the boolean data type:** 228 | 229 | 1. **True or False:** Booleans can have two possible values: true or false, representing binary logic. 230 | 231 | 2. **Logical Operators:** Booleans are often used with logical operators such as `==` (equals), `!=` (not equals), `and`, `or`, and `unless` to construct conditional expressions in PromQL queries. 232 | 233 | 3. **Comparison Operators:** Boolean expressions often involve comparison operators like `<`, `>`, `<=`, and `>=` to evaluate conditions based on metric values. 234 | 235 | 4. **Filtering and Filtering Conditions:** Booleans are used to filter time series data based on specific conditions, allowing for selective analysis and alerting. 236 | 237 | 238 | **Here are examples illustrating the characteristics of the Boolean data type in Prometheus:** 239 | 240 | 1. **Combining Conditions:** 241 | ```plaintext 242 | up == 1 and http_requests_total > 100 243 | ``` 244 | - This query combines two conditions using the logical AND operator, checking if the `up` metric is equal to 1 and the `http_requests_total` metric is greater than 100. 245 | 246 | 2. **Negating Conditions:** 247 | ```plaintext 248 | not job{job="api"} == 0 249 | ``` 250 | - Here, the `not` operator negates the condition, checking if any time series with the label `job` equal to "api" has a value not equal to 0. 251 | 252 | 3. **Conditional Expression:** 253 | ```plaintext 254 | rate(http_requests_total[5m]) > 10 or (node_memory_MemFree_bytes / node_memory_MemTotal_bytes) < 0.2 255 | ``` 256 | - This query uses a conditional expression, checking if the per-second rate of HTTP requests over the last 5 minutes is greater than 10 or if the ratio of free memory to total memory is less than 0.2. 257 | 258 | 259 | --- 260 | 261 | ## Operators in PromQL 262 | 263 | ### Aggregation Operators: 264 | 265 | They will only use **instant vectors** as **input** and **return** instant vectors as the operator's **output.** 266 | **`Aggregation () => `**. 267 | 268 | PromQL includes various aggregation operators that serve different purposes in summarizing and analyzing time series data. Here are some types of aggregation operators: 269 | 270 | 1. **Basic Aggregation:** 271 | - `sum()`: Calculates the total sum of values across time series. 272 | - `avg()`: Computes the average value across time series. 273 | - `min()`: Identifies the minimum value across time series. 274 | - `max()`: Identifies the maximum value across time series. 275 | - `count()`: Counts the number of time series matching a given condition. 276 | 277 | 2. **Rate and Increase:** 278 | - `rate()`: Calculates the per-second rate of increase for counters. 279 | - `increase()`: Computes the total increase in a counter over a specified time range. 280 | 281 | 3. **Statistical Aggregation:** 282 | - `stddev()`: Computes the standard deviation of values across time series. 283 | - `quantile()`: Calculates specified quantiles (e.g., 90th percentile) of values. 284 | 285 | 4. **Top and Bottom K:** 286 | - `topk()`: Identifies the top k time series based on a specified metric. 287 | - `bottomk()`: Identifies the bottom k time series based on a specified metric. 288 | 289 | 5. **Time Series Aggregation:** 290 | - `sum_over_time()`: Aggregates total sums over a specified time range. 291 | - `avg_over_time()`: Aggregates average values over a specified time range. 292 | - `min_over_time()`: Aggregates minimum values over a specified time range. 293 | - `max_over_time()`: Aggregates maximum values over a specified time range. 294 | 295 | 296 | ### Binary Operators: 297 | 298 | 1. **Arithmetic Binary Operators:** 299 | - `+` (Addition), `-` (Subtraction), `*` (Multiplication), `/` (Division) 300 | 301 | 2. **Comparison Binary Operators:** 302 | - `==` (Equal), `!=` (Not Equal), `<` (Less Than), `>` (Greater Than), `<=` (Less Than or Equal), `>=` (Greater Than or Equal) 303 | 304 | 3. **Logical Binary Operators:** 305 | - `and` (Logical AND), `or` (Logical OR), `unless` (Logical NOT) 306 | 307 | 4. **Set Binary Operators:** 308 | - `=~` (Regex Match), `!~` (Negative Regex Match) 309 | 310 | 5. **Mathematical Binary Operators:** 311 | - `^` (Exponentiation), `%` (Modulo) 312 | 313 | ### Range Operator: 314 | With the range operator, you can specify a time duration that will filter vectors between now and specific timing. 315 | 316 | `[time unit]` 317 | 318 | **Time Durations Format:** 319 | Time durations are specified as a number, followed immediately by one of the following units: 320 | 321 | - `ms` - milliseconds 322 | - `s` - seconds 323 | - `m` - minutes 324 | - `h` - hours 325 | - `d` - days (assuming a day has always 24h) 326 | - `w` - weeks (assuming a week has always 7d) 327 | - `y` - years (assuming a year has always 365d) 328 | 329 | 330 | ### Offset Operator: 331 | With offset, you can request the value from a certain amount of time before the moment the query was done. 332 | 333 | 1. **`offset` for Rate Shifting:** 334 | 335 | - Shifts the time range for rate comparison, allowing historical analysis. 336 | 337 | ``` 338 | rate(http_requests_total[5m]) > offset 1h rate(http_requests_total[5m]) 339 | ``` 340 | 341 | 2. **`offset` for Value Comparison:** 342 | - Compares metric values at different time ranges, aiding in trend analysis. 343 | 344 | ``` 345 | cpu_temperature > offset 1d cpu_temperature 346 | ``` 347 | 348 | 3. **`offset` for Rate of Change:** 349 | - Utilizes offset to compare the rate of change in metrics over time. 350 | 351 | ``` 352 | rate(cpu_usage[1h]) - rate(cpu_usage[2h]) 353 | ``` 354 | 355 | 4. **`offset` for Historical Comparisons:** 356 | - Enables historical comparisons of metric values over a specified time range. 357 | 358 | ``` 359 | http_requests_total > offset 7d http_requests_total 360 | ``` 361 | 362 | 363 | ## Filter Data with the @ modifier in PromQL 364 | 365 | The `@` modifier in PromQL is used to query the value of a time series at a specific timestamp. It allows retrieving the value of a metric at a precise point in time. 366 | 367 | Lets understand it through an example 368 | 369 | ```plaintext 370 | http_requests_total @ 1632315600 371 | ``` 372 | 373 | In this example, `@ 1632315600` retrieves the value of the `http_requests_total` metric at the UNIX timestamp 1632315600 (representing a specific moment in time). 374 | 375 | This enables querying historical or specific values of metrics at exact timestamps for analysis or comparison purposes. 376 | 377 | --- 378 | 379 | ## Types of Prometheus Metrics for Data Storage and Organization 380 | 381 | Prometheus metrics primarily come in four types: **Counter**, **Gauge**, **Histogram**, **Summary.** 382 | 383 | Lets explore each in detail. 384 | 385 | ### Counter: 386 | A Prometheus Counter is a metric type that represents a **cumulative value** that can **only monotonically increase** over time. 387 | 388 |
389 | Counter Example 390 |
391 | 392 | 393 | **Key characteristics of a Prometheus Counter:** 394 | 395 | 1. **Monotonicity:** Counters always move in the positive direction, starting from zero and increasing over time. They do not decrease. 396 | 397 | 2. **Cumulative:** Counters represent cumulative values, making them suitable for tracking totals or counts of events that continuously accumulate. 398 | 399 | 3. **No Arbitrary Units:** Counters are dimensionless and have no specific unit attached to them. They represent a simple count or quantity. 400 | 401 | 4. **Common Use Cases:** Counters are often used for measuring the total number of occurrences of an event, such as request counts, error counts, or other cumulative metrics in a system. 402 | 403 | 5. **Querying for Rate:** Derivative operations in queries are commonly applied to counters to calculate rates of change over time, providing insights into the frequency of events. 404 | 405 | **Example Queries** 406 | 407 | 1. **Total Count:** 408 | ```plaintext 409 | http_requests_total 410 | ``` 411 | 412 | - Returns the total count of HTTP requests. 413 | 414 | 2. **Rate of Change (Requests Per Second):** 415 | ```plaintext 416 | rate(http_requests_total[1m]) 417 | ``` 418 | - Calculates the rate of change of HTTP requests per second over the last 1 minute. 419 | 420 | 3. **Error Rate as a Percentage:** 421 | ``` 422 | 100 * (http_errors_total / http_requests_total) 423 | ``` 424 | - Calculates the percentage of HTTP requests that resulted in errors. 425 | 426 | 4. **Increase in Count Since Last Hour:** 427 | ``` 428 | increase(http_requests_total[1h]) 429 | ``` 430 | - Shows the total increase in HTTP requests count over the last hour. 431 | 432 | ### Gauge: 433 | 434 | A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. 435 | 436 | Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests. 437 | 438 | 439 |
440 | Gauge Example 441 |
442 | 443 | 444 | **Key characteristics of a Gauge:** 445 | 446 | 1. **Non-Cumulative:** Gauges do not accumulate values over time; they represent the latest observed value at a specific point in time. 447 | 448 | 2. **Fluctuating Values:** Gauges can capture fluctuations in a metric, making them ideal for metrics that may vary, such as CPU usage, memory utilization, or the number of active connections. 449 | 450 | 3. **No Automatic Resets:** Gauges retain their last observed value until a new value is recorded. They do not reset automatically, allowing continuous monitoring of changing conditions. 451 | 452 | 4. **Arbitrary Units:** Gauges can have arbitrary units based on the metric they measure. For example, a gauge measuring temperature might have units in degrees Celsius or Fahrenheit. 453 | 454 | 455 | **Example Queries** 456 | 457 | 1. **Current CPU Usage:** 458 | 459 | ``` 460 | cpu_usage 461 | ``` 462 | - Returns the current CPU usage as a gauge value. 463 | 464 | 465 | 2. **Average Memory Utilization over 5 Minutes:** 466 | 467 | ``` 468 | avg_over_time(memory_usage[5m]) 469 | ``` 470 | - Calculates the average memory utilization as a gauge value over the last 5 minutes. 471 | 472 | 3. **Number of Active Connections:** 473 | 474 | ``` 475 | http_connections 476 | ``` 477 | - Retrieves the current count of active HTTP connections as a gauge value. 478 | 479 | 4. **Disk Space Utilization:** 480 | 481 | ``` 482 | 100 - (disk_free / disk_total) * 100 483 | ``` 484 | - Calculates the disk space utilization percentage as a gauge value. 485 | 486 | ### Histogram: 487 | 488 | A Prometheus Histogram is a metric type used to sample and observe the distribution of values in a dataset. It is particularly useful for measuring the spread of data, such as response times or request latencies. Histograms automatically bucketize data into configurable ranges (buckets) and provide aggregated information about the data distribution, including count, sum, and quantiles. 489 | 490 | 491 |
492 | Heatmap Histogram 493 |
494 | 495 | 496 | **Key characteristics of a Prometheus Histogram:** 497 | 498 | 1. **Bucketization:** Histograms automatically group observed values into predefined buckets based on their magnitudes. Each bucket represents a range of values. 499 | 500 | 2. **Dynamic Ranges:** Histograms allow dynamic adjustments of bucket ranges, making them adaptable to changes in the data distribution. 501 | 502 | 3. **Non-Cumulative:** Unlike counters, histograms do not accumulate values over time. They provide a snapshot of the data distribution at the time of observation. 503 | 504 | 4. **Bucket Labels:** Each bucket is labeled with an upper bound (`le` label) representing the maximum value that falls into that bucket. 505 | 506 | 5. **Querying Percentiles:** Prometheus provides functions like **`histogram_quantile`** to query specific percentiles of a histogram, allowing users to analyze the distribution of values. 507 | 508 | 509 | **Example Queries** 510 | 511 | 1. **Average Duration:** 512 | 513 | ``` 514 | rate(http_request_duration_seconds_sum[1m]) / rate(http_request_duration_seconds_count[1m]) 515 | ``` 516 | - Calculates the average duration of HTTP requests over the last 1 minute. 517 | 518 | 2. **90th Percentile Response Time:** 519 | 520 | ``` 521 | histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[1m])) by (le)) 522 | ``` 523 | - Retrieves the 90th percentile response time of HTTP requests over the last 1 minute. 524 | 525 | 3. **Bucket Counts:** 526 | 527 | ``` 528 | http_request_duration_seconds_count 529 | ``` 530 | - Returns the count of HTTP requests in each bucket of the duration histogram. 531 | 532 | 4. **Sum of Request Durations in Top 3 Buckets:** 533 | 534 | ``` 535 | sum(http_request_duration_seconds_bucket{le=~"0.1|0.5|1.0"}) 536 | ``` 537 | - Calculates the sum of request durations in the top 3 buckets (0.1s, 0.5s, 1.0s) of the duration histogram. 538 | 539 | ### Summary: 540 | 541 | A Prometheus Summary is a metric type designed to measure and track the distribution of observed values over time, particularly for quantiles and other percentile-based analyses. Similar to histograms, summaries provide insights into the variability and spread of data, but they do so by calculating quantiles over a sliding time window. Summaries are useful for monitoring metrics with changing distributions, such as request latencies. 542 | 543 | 544 | **Key characteristics of a Summary:** 545 | 546 | 1. **Dynamic Ranges:** Summaries allow dynamic adjustments of quantile ranges, making them adaptable to changes in the data distribution. 547 | 548 | 2. **No Explicit Bucketization:** Unlike histograms, summaries do not use predefined buckets. Instead, they directly calculate quantiles from the observed values. 549 | 550 | 3. **Count and Sum:** Summaries track the count of observations and the sum of observed values over time, providing aggregated information for the entire dataset. 551 | 552 | 4. **Non-Cumulative:** Similar to histograms, summaries do not accumulate values over time. They offer a snapshot of the data distribution within the specified time window. 553 | 554 | 5. **Querying Percentiles:** Prometheus provides functions like `quantile` to query specific percentiles of a summary, enabling users to analyze the distribution of values. 555 | 556 | **Example Queries** 557 | 558 | 1. **Average Duration over Last 5 Minutes:** 559 | 560 | ``` 561 | rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) 562 | ``` 563 | - Calculates the average duration of HTTP requests over the last 5 minutes. 564 | 565 | 2. **90th Percentile Response Time:** 566 | 567 | ``` 568 | quantile(0.9, http_request_duration_seconds) 569 | ``` 570 | - Retrieves the 90th percentile response time of HTTP requests. 571 | 572 | 3. **Count of Requests in the Last Hour:** 573 | 574 | ``` 575 | http_request_duration_seconds_count[1h] 576 | ``` 577 | - Returns the count of HTTP requests observed in the last hour. 578 | 579 | 4. **Sum of Response Durations in the Last 10 Minutes:** 580 | 581 | ``` 582 | sum(rate(http_request_duration_seconds_sum[10m])) 583 | ``` 584 | - Calculates the sum of response durations for HTTP requests over the last 10 minutes. 585 | 586 | --- 587 | 588 | ## Begin Your Monitoring Journey! 589 | 590 | Get ready to navigate your metrics with **confidence!** Stay tuned for more insights, tips, and tricks to keep your monitoring game strong. Keep exploring, keep learning, and keep monitoring! **Happy monitoring!** 📊👀😊 591 | 592 | 593 | 594 | -------------------------------------------------------------------------------- /prometheus_setup.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | - [Table of Contents](#table-of-contents) 4 | - [What Is Prometheus?](#what-is-prometheus) 5 | - [Why Do We Need Prometheus?](#why-do-we-need-prometheus) 6 | - [Prometheus Architecture: In K8S](#prometheus-architecture-in-k8s) 7 | - [Deploying Prometheus on Kubernetes](#deploying-prometheus-on-kubernetes) 8 | - [Create a Namespace for Monitoring](#create-a-namespace-for-monitoring) 9 | - [Add Helm Repository](#add-helm-repository) 10 | - [Install kube-prometheus-stack Helm Chart in monitoring Namespace](#install-kube-prometheus-stack-helm-chart-in-monitoring-namespace) 11 | - [Verify Deployment](#verify-deployment) 12 | - [Access Prometheus Dashboard](#access-prometheus-dashboard) 13 | - [Access Grafana Dashboard](#access-grafana-dashboard) 14 | - [Login with the default credentials:](#login-with-the-default-credentials) 15 | - [Begin Your Monitoring Journey! 🚀](#begin-your-monitoring-journey-) 16 | 17 | 18 | ## What Is Prometheus? 19 | 20 | Prometheus, an open-source monitoring toolkit, excels in dynamic system monitoring with a versatile data model and efficient time-series collection. Notable for its built-in alerting, adaptability, and strong community support, Prometheus empowers users to proactively manage and optimize system performance. 21 | 22 | ### Why Do We Need Prometheus? 23 | 24 | Lets understand Through a **Real-World** Example. 25 | 26 | **Scenario: Managing a Real-Time Messaging App** 27 | 28 | Imagine overseeing a real-time messaging app connecting millions worldwide. The app includes services like user authentication, message processing, and notifications. As the user base grows, ensuring smooth communication becomes a top priority. 29 | 30 | **Challenges:** 31 | 32 | 1. **Interconnected Services:** 33 | 34 | - Messaging involves many services working together. Understanding how each service affects communication is crucial but complicated. 35 | 36 | 2. **Variable Workloads:** 37 | 38 | - Messaging apps deal with fluctuating workloads, especially during peak times. Predicting the exact resources needed for optimal performance is tricky, requiring a flexible approach to scaling. 39 | 40 | 3. **Latency and Optimization:** 41 | - Fast message delivery is vital for a great user experience. Pinpointing services causing latency issues demands detailed insights often lacking in traditional monitoring tools. 42 | 43 | **How Prometheus Helps:** 44 | 45 | 1. **Dynamic Service Discovery:** 46 | 47 | - Prometheus automatically discovers and monitors new services as the app scales. No manual setup is needed, ensuring all parts are effectively monitored. 48 | 49 | 2. **Flexible Monitoring:** 50 | 51 | - Prometheus adapts to changing workloads by collecting time-series data. This helps in closely monitoring performance and making smart decisions on resource allocation and scaling. 52 | 53 | 3. **Alerts for Latency:** 54 | - Using Prometheus's alerting, you can set rules to catch latency issues in specific services. Proactive alerts allow the team to address potential problems before users notice. 55 | 56 | ## Prometheus Architecture: In K8S 57 | 58 |
59 | Image 60 |
61 | 62 | 1. **Prometheus Server:** 63 | 64 | - Runs as a dedicated Pod in the Kubernetes cluster. 65 | - Scrapes and collects metrics from configured endpoints or services. 66 | - Utilizes Kubernetes ServiceMonitors or service discovery for dynamic service monitoring. 67 | 68 | 2. **Time-Series Database (TSDB):** 69 | 70 | - Serves as the repository for time-series data collected by Prometheus. 71 | - Configurable retention policies for efficient data storage. 72 | - Can use persistent volumes for data storage across Prometheus restarts. 73 | 74 | 3. **Alertmanager:** 75 | 76 | - Often deployed as a separate Pod alongside Prometheus. 77 | - Manages and dispatches alerts based on predefined rules and conditions. 78 | - Receives alerts from Prometheus for forwarding to various channels. 79 | 80 | 4. **Exporters:** 81 | 82 | - Agents or sidecar containers exposing metrics from Kubernetes pods or services. 83 | - Types include Node Exporter, kube-state-metrics, and others for collecting specific metrics. 84 | 85 | 5. **Service Discovery:** 86 | 87 | - Kubernetes ServiceMonitors facilitate automatic service discovery and monitoring based on labels. 88 | 89 | 6. **Grafana Integration:** 90 | - Used with Prometheus for advanced metric visualization. 91 | - Offers pre-configured dashboards for rich and customizable visual representations. 92 | 93 | ## Deploying Prometheus on Kubernetes 94 | 95 | To set up Prometheus and its related components on your Kubernetes cluster, follow these steps: 96 | 97 | ### Create a Namespace for Monitoring 98 | 99 | ```bash 100 | kubectl create namespace monitoring 101 | ``` 102 | 103 | ### Add Helm Repository 104 | 105 | ```bash 106 | helm repo add prometheus-community https://prometheus-community.github.io/helm-charts 107 | helm repo update 108 | ``` 109 | 110 | ### Install kube-prometheus-stack Helm Chart in monitoring Namespace 111 | 112 | ```bash 113 | helm install prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring 114 | ``` 115 | 116 | ### Verify Deployment 117 | 118 | Wait for the deployment to complete, and then check the status: 119 | 120 | ```bash 121 | kubectl get pods -n monitoring 122 | ``` 123 | 124 | ### Access Prometheus Dashboard 125 | 126 | ```bash 127 | kubectl port-forward svc/prometheus-stack-prometheus -n monitoring 9090:9090 128 | ``` 129 | 130 | Open your web browser and navigate to **`http://localhost:9090`** to access the **Prometheus dashboard.** 131 | 132 |
133 | Image 134 |
135 | 136 | 137 | Remember to keep the port-forwarding terminal open as long as you need to access the dashboard. 138 | 139 | ### Access Grafana Dashboard 140 | 141 | Use the following command to port forward to the Grafana service: 142 | 143 | ```bash 144 | kubectl port-forward svc/prometheus-stack-grafana -n monitoring 8080:80 145 | ``` 146 | 147 | Open your web browser and navigate to **`http://localhost:8080.`** 148 | 149 |
150 | Grafana Security Login Authentication 151 |
152 | 153 | 154 | ### Login with the default credentials: 155 | 156 | **Username:** admin 157 | **Password:** (Retrieve the password using the following command): 158 | 159 | ```bash 160 | kubectl get secret prometheus-stack-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 --decode ; echo 161 | ``` 162 | Understand the grafana UI by yourself. The following resources can be helpful. 163 | 164 | 1. [Grafana Documentation](https://grafana.com/docs/) 165 | 2. [ YOUTUBE LINK: Grafana Setup & Simple Dashboard ](https://www.youtube.com/watch?v=EGgtJUjky8w) 166 | 167 | ## Begin Your Monitoring Journey! 🚀 168 | 169 | Start exploring system observability with Prometheus and Grafana. Learn from the [Grafana Documentation](https://grafana.com/docs/), set up Prometheus easily on Kubernetes, and join active communities. Whether you're experienced or new, keep learning to master these tools. Improve your systems and enjoy monitoring!📊👀😊 170 | 171 | -------------------------------------------------------------------------------- /promql-img/counter_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/promql-img/counter_example.png -------------------------------------------------------------------------------- /promql-img/gauge_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/promql-img/gauge_example.png -------------------------------------------------------------------------------- /promql-img/heatmap_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/promql-img/heatmap_histogram.png --------------------------------------------------------------------------------