├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── IMG
    ├── 1.png
    ├── 1.webp
    ├── Prometheus-metaimage.png
    ├── grafana-security-login-authentication.png
    ├── graphic-3-.png
    └── img.png
├── Prometheus-lab
    ├── README.md
    └── k8s-yaml
    │   ├── Alertmanagerconfig.yaml
    │   ├── Deployment.yaml
    │   ├── PrometheusRule.yaml
    │   ├── Service-monitor.yaml
    │   └── Service.yaml
├── README.md
├── promQl.md
├── prometheus_setup.md
└── promql-img
    ├── counter_example.png
    ├── gauge_example.png
    └── heatmap_histogram.png


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # **Contributor Covenant Code of Conduct**  
 2 | 
 3 | ## **Our Pledge**  
 4 | 
 5 | We as members, contributors, and leaders pledge to make participation in our project and community a **harassment-free experience** for everyone, regardless of age, body size, disability, ethnicity, gender identity, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.  
 6 | 
 7 | We pledge to act and interact in ways that contribute to an **open, welcoming, diverse, inclusive, and healthy** community.  
 8 | 
 9 | ## **Our Standards**  
10 | 
11 | > [!IMPORTANT]
12 | >
13 | > **Examples of behavior that contributes to a positive environment include:**  
14 | >
15 | > - Being respectful and inclusive to others  
16 | > - Using welcoming and inclusive language  
17 | > - Giving and gracefully accepting constructive feedback  
18 | > - Showing empathy towards other community members  
19 | >
20 | > **Examples of unacceptable behavior include:**  
21 | >
22 | > - Harassment, intimidation, or discrimination in any form  
23 | > - Publishing private information of others without consent  
24 | > - Use of inappropriate language, insults, or derogatory comments  
25 | > - Trolling, personal attacks, or political/religious discussions  
26 | 
27 | ## **Our Responsibilities**  
28 | 
29 | > [!NOTE]
30 | >
31 | > Project maintainers are responsible for **clarifying and enforcing** the standards of acceptable behavior. They have the right to **remove, edit, or reject** comments, commits, code, issues, and other contributions that do not align with this Code of Conduct.  
32 | 
33 | ## **Enforcement**  
34 | 
35 | > [!CAUTION]
36 | >
37 | > Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at [your-email@example.com].  
38 | > Maintainers will review and investigate complaints and take appropriate action.  
39 | 
40 | ## **Attribution**  
41 | 
42 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), version 2.1.  
43 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
  1 | # 📜 **CONTRIBUTING.md**  
  2 | 
  3 | Thank you for considering contributing to **Learning Prometheus**! 🚀  
  4 | Your contributions are **highly appreciated** and help make this project better for everyone.  
  5 | 
  6 | ---
  7 | 
  8 | ## 🛠 **How to Contribute?**  
  9 | 
 10 | ### 📌 **1. Fork the Repository**  
 11 | 
 12 | Click the **Fork** button on the top-right corner of this repository to create your own copy.  
 13 | 
 14 | ### 📌 **2. Clone Your Fork**  
 15 | 
 16 | Open your terminal and run:  
 17 | 
 18 | ```bash
 19 | git clone https://github.com/your-username/Learning-Prometheus.git
 20 | ```
 21 | 
 22 | Replace `your-username` with your GitHub username.  
 23 | 
 24 | ### 📌 **3. Navigate to the Project Directory**  
 25 | 
 26 | ```bash
 27 | cd Learning-Prometheus
 28 | ```
 29 | 
 30 | ### 📌 **4. Create a New Branch**  
 31 | 
 32 | Before making changes, create a new branch:  
 33 | 
 34 | ```bash
 35 | git checkout -b feature-branch
 36 | ```
 37 | 
 38 | Replace `feature-branch` with a relevant branch name.  
 39 | 
 40 | ### 📌 **5. Make Your Changes**  
 41 | 
 42 | Modify the code, update documentation, or add new Prometheus configurations.  
 43 | 
 44 | ### 📌 **6. Commit Your Changes**  
 45 | 
 46 | Follow best practices for writing meaningful commit messages:  
 47 | 
 48 | ```bash
 49 | git commit -m "✨ Added PromQL query examples for better monitoring"
 50 | ```
 51 | 
 52 | - Use **present-tense verbs** (e.g., "Add" instead of "Added")  
 53 | - Keep commit messages **concise yet descriptive**  
 54 | 
 55 | ### 📌 **7. Push the Changes**  
 56 | 
 57 | ```bash
 58 | git push origin feature-branch
 59 | ```
 60 | 
 61 | ### 📌 **8. Open a Pull Request (PR)**  
 62 | 
 63 | - Navigate to the **original repository** (NotHarshhaa/Learning-Prometheus).  
 64 | - Click on **New Pull Request**.  
 65 | - Select your forked repository and the branch you worked on.  
 66 | - Write a **clear PR description** explaining your changes.  
 67 | - Submit the PR for review. 🎉  
 68 | 
 69 | ---
 70 | 
 71 | ## 📝 **Contribution Guidelines**
 72 | 
 73 | ✔️ **Follow proper code formatting and best practices.**  
 74 | ✔️ **Add meaningful commit messages.**  
 75 | ✔️ **Write detailed PR descriptions explaining the changes.**  
 76 | ✔️ **Ensure new contributions are well-documented.**  
 77 | ✔️ **Test your code before submitting a PR.**  
 78 | ✔️ **For documentation updates, maintain proper formatting and clarity.**  
 79 | 
 80 | ---
 81 | 
 82 | ## 🐞 **Reporting Issues**  
 83 | 
 84 | If you find a bug, have a feature request, or want to suggest an improvement:  
 85 | 
 86 | 1. **Check existing issues** to avoid duplicates.  
 87 | 2. Open a **new issue** with:  
 88 |    - A **descriptive title**  
 89 |    - Steps to **reproduce the bug**  
 90 |    - Expected vs. actual behavior  
 91 |    - Any **screenshots or logs** (if applicable)  
 92 | 
 93 | ---
 94 | 
 95 | ## 🌎 **Community & Support**  
 96 | 
 97 | 👥 **Join our discussion and ask questions in our Telegram community:**  
 98 | 📢 [Join Telegram](https://t.me/prodevopsguy)  
 99 | 
100 | 💡 **Follow me on GitHub for more DevOps content:**  
101 | ⭐ [GitHub Profile](https://github.com/NotHarshhaa)  
102 | 
103 | ---
104 | 
105 | ## 🙌**Acknowledgments**  
106 | 
107 | This project is maintained by **[Harshhaa](https://github.com/NotHarshhaa)**.  
108 | Thank you for being part of the **DevOps & Prometheus** community! 💙  
109 | 
110 | ---
111 | 
112 | ### ✅ **Now you are ready to contribute! Happy coding! 🚀**  
113 | 


--------------------------------------------------------------------------------
/IMG/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/1.png


--------------------------------------------------------------------------------
/IMG/1.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/1.webp


--------------------------------------------------------------------------------
/IMG/Prometheus-metaimage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/Prometheus-metaimage.png


--------------------------------------------------------------------------------
/IMG/grafana-security-login-authentication.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/grafana-security-login-authentication.png


--------------------------------------------------------------------------------
/IMG/graphic-3-.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/graphic-3-.png


--------------------------------------------------------------------------------
/IMG/img.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/IMG/img.png


--------------------------------------------------------------------------------
/Prometheus-lab/README.md:
--------------------------------------------------------------------------------
   1 | - [Real-World Prometheus Deployment: A Practical Guide for Kubernetes Monitoring](#real-world-prometheus-deployment-a-practical-guide-for-kubernetes-monitoring)
   2 |   - [Aim of the Project](#aim-of-the-project)
   3 |   - [Project Architecture](#project-architecture)
   4 |   - [Prerequisites](#prerequisites)
   5 |   - [Summary of What We Achieved](#summary-of-what-we-achieved)
   6 |   - [Understanding Kubernetes Resources](#understanding-kubernetes-resources)
   7 |     - [Deployment](#deployment)
   8 |       - [API Version and Kind](#api-version-and-kind)
   9 |       - [Metadata](#metadata)
  10 |       - [Specification (`spec`)](#specification-spec)
  11 |       - [Selector](#selector)
  12 |       - [Template](#template)
  13 |       - [Pod Specification (`spec` inside the template)](#pod-specification-spec-inside-the-template)
  14 |     - [Services](#services)
  15 |       - [API Version and Kind](#api-version-and-kind-1)
  16 |       - [Metadata](#metadata-1)
  17 |       - [Specification (`spec`)](#specification-spec-1)
  18 |     - [ServiceMonitor](#servicemonitor)
  19 |       - [API Version and Kind](#api-version-and-kind-2)
  20 |       - [Metadata](#metadata-2)
  21 |       - [Specification (`spec`)](#specification-spec-2)
  22 |     - [PrometheusRules](#prometheusrules)
  23 |       - [API Version and Kind](#api-version-and-kind-3)
  24 |       - [Metadata](#metadata-3)
  25 |       - [Specification (`spec`)](#specification-spec-3)
  26 |     - [AlertmanagerConfig](#alertmanagerconfig)
  27 |       - [API Version and Kind](#api-version-and-kind-4)
  28 |       - [Metadata](#metadata-4)
  29 |       - [Specification (`spec`)](#specification-spec-4)
  30 |   - [Author & Community](#author--community)
  31 | 
  32 | ---
  33 | 
  34 | # **Real-World Prometheus Deployment: A Practical Guide for Kubernetes Monitoring**  
  35 | 
  36 | ## **Aim of the Project**  
  37 | 
  38 | The primary goal of this **Prometheus Lab** project is to provide **hands-on experience** in setting up a **Prometheus monitoring system** on a **Kubernetes cluster**.  
  39 | 
  40 | By following this guide, you will:  
  41 | ✅ Deploy **Prometheus** for real-time monitoring.  
  42 | ✅ Understand **Kubernetes monitoring architecture**.  
  43 | ✅ Set up **Grafana** for data visualization.  
  44 | ✅ Configure **Alertmanager** for proactive notifications.  
  45 | 
  46 | ---
  47 | 
  48 | ## **Project Architecture**  
  49 | 
  50 | Below is a high-level architecture of the Prometheus monitoring setup:  
  51 | 
  52 | ![](../IMG/graphic-3-.png)  
  53 | 
  54 | ---
  55 | 
  56 | ## **Prerequisites**  
  57 | 
  58 | Before we begin, ensure you have the following tools installed:  
  59 | 
  60 | - **`kubectl`** → To interact with the Kubernetes cluster.  
  61 | - **`Helm`** → For deploying Prometheus using Helm charts.  
  62 | - **`k3d`** → A lightweight Kubernetes distribution for local testing.  
  63 | 
  64 | ---
  65 | 
  66 | ## **📌 Step 1: Install `k3d` (Lightweight Kubernetes)**  
  67 | 
  68 | To create a **local Kubernetes cluster**, install `k3d` with:  
  69 | 
  70 | ```bash
  71 | curl -s https://raw.githubusercontent.com/rancher/k3d/main/install.sh | bash
  72 | ```
  73 | 
  74 | Verify the installation:  
  75 | 
  76 | ```bash
  77 | k3d --version
  78 | ```
  79 | 
  80 | ---
  81 | 
  82 | ## **📌 Step 2: Clone the GitHub Repository**  
  83 | 
  84 | All the necessary **YAML manifests** and configurations can be found in my GitHub repository:  
  85 | 
  86 | 🔗 **GitHub Repo:**  
  87 | 
  88 | ```text
  89 | https://github.com/panchanandevops/Learning-Prometheus.git
  90 | ```
  91 | 
  92 | Clone the repository for easy access:  
  93 | 
  94 | ```bash
  95 | git clone https://github.com/panchanandevops/Learning-Prometheus.git
  96 | cd Learning-Prometheus
  97 | ```
  98 | 
  99 | ---
 100 | 
 101 | ## **📌 Step 3: Create a Namespace for Monitoring**  
 102 | 
 103 | All monitoring components should be deployed in a dedicated **namespace**.  
 104 | 
 105 | ```bash
 106 | kubectl create namespace monitoring
 107 | ```
 108 | 
 109 | Verify the namespace:  
 110 | 
 111 | ```bash
 112 | kubectl get namespaces
 113 | ```
 114 | 
 115 | ---
 116 | 
 117 | ## **📌 Step 4: Add the Prometheus Helm Repository**  
 118 | 
 119 | We will use Helm to deploy Prometheus and related components.  
 120 | 
 121 | ```bash
 122 | helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
 123 | helm repo update
 124 | ```
 125 | 
 126 | This ensures we fetch the **latest** chart versions.
 127 | 
 128 | ---
 129 | 
 130 | ## **📌 Step 5: Store Default Helm Values**  
 131 | 
 132 | Before installing Prometheus, save the default configuration for customization.  
 133 | 
 134 | ```bash
 135 | helm show values prometheus-community/kube-prometheus-stack > values.yaml
 136 | ```
 137 | 
 138 | This file (`values.yaml`) contains settings for Prometheus, Grafana, and Alertmanager.
 139 | 
 140 | ---
 141 | 
 142 | ## **📌 Step 6: Install Prometheus Stack using Helm**  
 143 | 
 144 | Now, install the **kube-prometheus-stack** Helm chart in the `monitoring` namespace:  
 145 | 
 146 | ```bash
 147 | helm install prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring
 148 | ```
 149 | 
 150 | 🚀 **This deploys:**  
 151 | 
 152 | - **Prometheus** (for metrics collection)  
 153 | - **Grafana** (for visualization)  
 154 | - **Alertmanager** (for alert handling)  
 155 | 
 156 | ---
 157 | 
 158 | ## **📌 Step 7: Verify the Deployment**  
 159 | 
 160 | Check if the monitoring components are running:  
 161 | 
 162 | ```bash
 163 | kubectl get pods -n monitoring
 164 | ```
 165 | 
 166 | You should see multiple pods for **Prometheus, Grafana, and Alertmanager** in a `Running` state.
 167 | 
 168 | ---
 169 | 
 170 | ## **📌 Step 8: Access the Prometheus Dashboard**  
 171 | 
 172 | Prometheus exposes metrics and allows querying via its web UI.  
 173 | 
 174 | To access it locally, run:  
 175 | 
 176 | ```bash
 177 | kubectl port-forward svc/prometheus-stack-prometheus -n monitoring 9090:9090
 178 | ```
 179 | 
 180 | Now, open **[http://localhost:9090](http://localhost:9090)** in your browser.  
 181 | 
 182 | ![](../IMG/1.png)  
 183 | 
 184 | ---
 185 | 
 186 | ## **📌 Step 9: Access the Grafana Dashboard**  
 187 | 
 188 | Grafana provides a beautiful UI to visualize the metrics collected by Prometheus.  
 189 | 
 190 | To access it locally, run:  
 191 | 
 192 | ```bash
 193 | kubectl port-forward svc/prometheus-stack-grafana -n monitoring 8080:80
 194 | ```
 195 | 
 196 | Now, open **[http://localhost:8080](http://localhost:8080)** in your browser.  
 197 | 
 198 | ![](../IMG/grafana-security-login-authentication.png)  
 199 | 
 200 | ---
 201 | 
 202 | ## **📌 Step 10: Login to Grafana**  
 203 | 
 204 | Grafana uses **default credentials**:  
 205 | 
 206 | - **Username:** `admin`  
 207 | - **Password:** Retrieve the password using:  
 208 | 
 209 |   ```bash
 210 |   kubectl get secret prometheus-stack-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 --decode ; echo
 211 |   ```
 212 | 
 213 | - Copy the password and **log in** to Grafana.  
 214 | 
 215 | ---
 216 | 
 217 | ## **📌 Step 11: Configure `values.yaml` for AlertmanagerConfig**  
 218 | 
 219 | By default, Prometheus does not automatically pick up **AlertmanagerConfig** CRDs.  
 220 | 
 221 | To enable it, **edit `values.yaml`** and search for `alertmanagerConfigSelector`.  
 222 | 
 223 | Replace that section with:  
 224 | 
 225 | ```yaml
 226 | alertmanagerConfigSelector:
 227 |   matchLabels:
 228 |     release: prometheus
 229 | ```
 230 | 
 231 | This ensures **custom alerting rules** are applied.
 232 | 
 233 | ---
 234 | 
 235 | ## **📌 Step 12: Apply Kubernetes YAML Manifests**  
 236 | 
 237 | Once the setup is complete, apply all the necessary Kubernetes resources:  
 238 | 
 239 | ```bash
 240 | kubectl apply -f <your-path>/k8s-yaml/
 241 | ```
 242 | 
 243 | This will configure:  
 244 | ✅ **ServiceMonitor** (for scraping custom metrics).  
 245 | ✅ **PrometheusRules** (for setting up alert conditions).  
 246 | ✅ **AlertmanagerConfig** (for sending notifications).  
 247 | 
 248 | ---
 249 | 
 250 | ## **Summary of What We Achieved**  
 251 | 
 252 | | **Step** | **Action** |
 253 | |----------|------------|
 254 | | 🛠 **Setup** | Installed `k3d`, `kubectl`, `Helm` |
 255 | | 📥 **Downloaded** | Cloned GitHub repo |
 256 | | 📦 **Created Namespace** | `monitoring` |
 257 | | 🔹 **Added Helm Repo** | `prometheus-community` |
 258 | | 📜 **Saved Config** | Stored `values.yaml` |
 259 | | 🚀 **Deployed Stack** | Installed `kube-prometheus-stack` |
 260 | | 📊 **Accessed Dashboards** | Prometheus & Grafana |
 261 | | ⚙ **Configured Alerts** | Modified `values.yaml` |
 262 | | 📌 **Applied Manifests** | `kubectl apply -f k8s-yaml/` |
 263 | 
 264 | 🚀 **Now you have a fully functional Prometheus monitoring setup on Kubernetes!**
 265 | 
 266 | ---
 267 | 
 268 | # **Understanding Kubernetes Resources**
 269 | 
 270 | ## **Deployment**  
 271 | 
 272 | A **Deployment** in Kubernetes is used to ensure that a set of identical **pods** (containers running your application) are always running. It allows you to easily scale your application, update it without downtime, and recover from failures.  
 273 | 
 274 | Now, let’s break down the **Deployment YAML file** step by step.  
 275 | 
 276 | ---
 277 | 
 278 | ### **API Version and Kind**
 279 | 
 280 | ```yaml
 281 | apiVersion: apps/v1
 282 | kind: Deployment
 283 | ```
 284 | 
 285 | - `apiVersion: apps/v1` → This specifies the **API version** of Kubernetes being used.  
 286 | - `kind: Deployment` → This tells Kubernetes that we are creating a **Deployment** resource.  
 287 | 
 288 | A **Deployment** helps in managing a set of pods by ensuring they stay available and can be updated smoothly.  
 289 | 
 290 | ---
 291 | 
 292 | ### **Metadata**
 293 | 
 294 | ```yaml
 295 | metadata:
 296 |   name: my-deployment
 297 |   labels:
 298 |     app: api
 299 | ```
 300 | 
 301 | - `name: my-deployment` → The name of the Deployment. It must be unique within the namespace.  
 302 | - `labels:`  
 303 |   - `app: api` → A label assigned to this Deployment. Labels help **group, filter, and identify** Kubernetes resources.  
 304 | 
 305 | 📌 **Why labels?**  
 306 | Labels allow you to easily select and manage Kubernetes objects. For example, we can list all pods that belong to this Deployment using:  
 307 | 
 308 | ```bash
 309 | kubectl get pods -l app=api
 310 | ```
 311 | 
 312 | ---
 313 | 
 314 | ### **Specification (`spec`)**
 315 | 
 316 | This section defines **how** the Deployment should behave.  
 317 | 
 318 | ---
 319 | 
 320 | #### **Selector**  
 321 | 
 322 | The **selector** tells Kubernetes **which pods** this Deployment should manage.  
 323 | 
 324 | ```yaml
 325 | selector:
 326 |   matchLabels:
 327 |     app: api
 328 | ```
 329 | 
 330 | - This means the Deployment will look for pods with the label **`app: api`**.  
 331 | - Only these labeled pods will be controlled by this Deployment.  
 332 | 
 333 | 📌 **Why is this needed?**  
 334 | Because a Kubernetes cluster may have **many** Deployments and pods, the **selector** ensures that only the right pods are managed by this Deployment.  
 335 | 
 336 | ---
 337 | 
 338 | #### **Template**  
 339 | 
 340 | The **template** defines how new pods should be created when the Deployment starts or scales up.  
 341 | 
 342 | ```yaml
 343 | template:
 344 |   metadata:
 345 |     labels:
 346 |       app: api
 347 |   spec:
 348 | ```
 349 | 
 350 | - `metadata:`  
 351 |   - `labels: app: api` → The pod will have the same label as the Deployment.  
 352 | - `spec:` → This is where we define what should **run inside the pod** (the containers).  
 353 | 
 354 | 📌 **Why is this needed?**  
 355 | When Kubernetes creates new pods using this Deployment, it **ensures** that every pod gets the same labels and configurations.
 356 | 
 357 | ---
 358 | 
 359 | #### **Pod Specification (`spec` inside the template)**  
 360 | 
 361 | Now, let’s define the actual **container** that runs inside the pod.  
 362 | 
 363 | ```yaml
 364 |     containers:
 365 |       - name: my-container
 366 |         image: panchanandevops/myexpress:v0.1.0
 367 |         resources:
 368 |           limits:
 369 |             memory: "128Mi"
 370 |             cpu: "500m"
 371 |         ports:
 372 |           - containerPort: 3000
 373 | ```
 374 | 
 375 | - **`containers:`** → A pod can run **one or more** containers. In this case, we have one container named **`my-container`**.  
 376 | - **`image: panchanandevops/myexpress:v0.1.0`** → This is the **Docker image** that will be pulled from Docker Hub or a private registry.  
 377 | - **Resource Limits (`resources:`)**  
 378 |   - `memory: "128Mi"` → The container can use a maximum of **128 MiB of RAM**.  
 379 |   - `cpu: "500m"` → The container can use up to **0.5 CPU cores** (500 milliCPU).  
 380 | - **Port (`ports:`)**  
 381 |   - `containerPort: 3000` → This means the container **listens** for requests on port `3000`.  
 382 | 
 383 | 📌 **Why are resource limits important?**  
 384 | Setting **resource limits** prevents a single container from consuming all system resources, ensuring **fair resource distribution** among all containers in the cluster.
 385 | 
 386 | ---
 387 | 
 388 | ### **Services**
 389 | 
 390 | A **Service** in Kubernetes is used to expose a set of pods as a **network service**. Even if pods are created and destroyed, the Service ensures that requests always reach the correct backend pods.  
 391 | 
 392 | Now, let’s break down the **Service YAML file** step by step.  
 393 | 
 394 | ---
 395 | 
 396 | ### **API Version and Kind**
 397 | 
 398 | ```yaml
 399 | apiVersion: v1
 400 | kind: Service
 401 | ```
 402 | 
 403 | - `apiVersion: v1` → Specifies the API version used for defining the Service.  
 404 | - `kind: Service` → Tells Kubernetes that we are creating a **Service** resource.  
 405 | 
 406 | 📌 **Why do we need a Service?**  
 407 | Pods have **dynamic IP addresses**, which means their IPs can change when they restart. A **Service provides a stable IP and DNS name** to ensure that traffic always reaches the right pods, even if they get recreated.
 408 | 
 409 | ---
 410 | 
 411 | ### **Metadata**
 412 | 
 413 | ```yaml
 414 | metadata:
 415 |   name: my-service
 416 |   labels:
 417 |     job: node-api
 418 |     app: api
 419 | ```
 420 | 
 421 | - `name: my-service` → The unique name of the Service.  
 422 | - `labels:`  
 423 |   - `job: node-api` → This label is used to categorize the Service.  
 424 |   - `app: api` → This label helps in grouping and managing resources.  
 425 | 
 426 | 📌 **Why use labels?**  
 427 | Labels help in organizing and selecting Kubernetes resources. For example, we can find all services related to `app: api` using:  
 428 | 
 429 | ```bash
 430 | kubectl get services -l app=api
 431 | ```
 432 | 
 433 | ---
 434 | 
 435 | ### **Specification (`spec`)**
 436 | 
 437 | The **spec** defines how the Service will behave.  
 438 | 
 439 | ```yaml
 440 | spec:
 441 |   type: ClusterIP
 442 |   selector:
 443 |     app: api
 444 |   ports:
 445 |     - name: web
 446 |       protocol: TCP
 447 |       port: 3000
 448 |       targetPort: 3000
 449 | ```
 450 | 
 451 | ---
 452 | 
 453 | ### **Breaking Down the Service Spec**  
 454 | 
 455 | #### **1️⃣ Service Type (`type`)**
 456 | 
 457 | ```yaml
 458 |   type: ClusterIP
 459 | ```
 460 | 
 461 | - **ClusterIP (default)** → Makes the Service accessible **only within the cluster**.  
 462 | - Other types of Services:  
 463 |   - **NodePort** → Exposes the Service on a port on each node.  
 464 |   - **LoadBalancer** → Provides an external IP via a cloud provider's load balancer.  
 465 |   - **ExternalName** → Maps the Service to an external DNS name.  
 466 | 
 467 | 📌 **Why use ClusterIP?**  
 468 | If the Service is meant for **internal communication** (e.g., backend APIs talking to each other), ClusterIP is the best choice.
 469 | 
 470 | ---
 471 | 
 472 | #### **2️⃣ Selector (`selector`)**
 473 | 
 474 | ```yaml
 475 |   selector:
 476 |     app: api
 477 | ```
 478 | 
 479 | - The **selector** ensures that this Service sends traffic to pods that have the label `app: api`.  
 480 | - Only these pods will receive requests from this Service.  
 481 | 
 482 | 📌 **Why is this needed?**  
 483 | Because Kubernetes may have **many pods**, we need a way to **match** the correct ones for the Service to route traffic.
 484 | 
 485 | ---
 486 | 
 487 | #### **3️⃣ Ports (`ports`)**
 488 | 
 489 | ```yaml
 490 |   ports:
 491 |     - name: web
 492 |       protocol: TCP
 493 |       port: 3000
 494 |       targetPort: 3000
 495 | ```
 496 | 
 497 | - `name: web` → A **name** for the port (useful for debugging and monitoring).  
 498 | - `protocol: TCP` → Specifies the network protocol used (default is TCP).  
 499 | - `port: 3000` → The port **exposed by the Service** (used by other services to connect).  
 500 | - `targetPort: 3000` → The **port inside the pod** where the application is running.  
 501 | 
 502 | 📌 **Why is `targetPort` needed?**  
 503 | Sometimes, the Service port (`port: 3000`) and the pod's container port (`targetPort: 3000`) **can be different**. Kubernetes maps the incoming request from the Service port to the correct container port.
 504 | 
 505 | ---
 506 | 
 507 | ## **ServiceMonitor**
 508 | 
 509 | A **ServiceMonitor** is a **custom resource** used by Prometheus **to discover and scrape metrics from Kubernetes services**. Instead of manually configuring Prometheus to collect metrics from services, we define a **ServiceMonitor** that dynamically discovers the right endpoints.  
 510 | 
 511 | Now, let’s break down the **ServiceMonitor YAML file** step by step.  
 512 | 
 513 | ---
 514 | 
 515 | ### **API Version and Kind**
 516 | 
 517 | ```yaml
 518 | apiVersion: monitoring.coreos.com/v1
 519 | kind: ServiceMonitor
 520 | ```
 521 | 
 522 | - `apiVersion: monitoring.coreos.com/v1` → This API is specific to **Prometheus Operator**, which manages monitoring in Kubernetes.  
 523 | - `kind: ServiceMonitor` → Defines this resource as a **ServiceMonitor**, which tells Prometheus where to collect metrics from.
 524 | 
 525 | 📌 **Why do we need a ServiceMonitor?**  
 526 | Prometheus does not automatically know which services to monitor. A **ServiceMonitor automatically finds and scrapes metrics from matching services** in Kubernetes.
 527 | 
 528 | ---
 529 | 
 530 | ### **Metadata**
 531 | 
 532 | ```yaml
 533 | metadata:
 534 |   name: api-service-monitor
 535 |   labels:
 536 |     release: prometheus
 537 |     app: prometheus
 538 | ```
 539 | 
 540 | - `name: api-service-monitor` → The name of the ServiceMonitor.  
 541 | - `labels:`  
 542 |   - `release: prometheus` → Associates this monitor with a specific Prometheus instance.  
 543 |   - `app: prometheus` → Indicates that this ServiceMonitor is part of the Prometheus monitoring setup.
 544 | 
 545 | 📌 **Why use labels here?**  
 546 | 
 547 | - Prometheus Operator uses labels to **discover ServiceMonitors** that it should scrape.  
 548 | - If the Prometheus instance is deployed with `release: prometheus`, it will only pick up ServiceMonitors with the **same label**.  
 549 | 
 550 | ---
 551 | 
 552 | ### **Specification  (`spec`)**  
 553 | 
 554 | The **spec** defines how Prometheus should scrape metrics from services.  
 555 | 
 556 | ```yaml
 557 | spec:
 558 |   jobLabel: job
 559 |   selector:
 560 |     matchLabels:
 561 |       app: api
 562 |   endpoints:
 563 |     - port: web
 564 |       path: /swagger-stats/metrics
 565 | ```
 566 | 
 567 | ---
 568 | 
 569 | ### **Breaking Down the ServiceMonitor Spec**  
 570 | 
 571 | #### **1️⃣ Job Label (`jobLabel`)**
 572 | 
 573 | ```yaml
 574 |   jobLabel: job
 575 | ```
 576 | 
 577 | - Specifies that the **job name** for Prometheus should be taken from the `job` label in the Service.
 578 | 
 579 | 📌 **Why is this needed?**  
 580 | In Prometheus, each **scraped target** (like a service) is associated with a **job name**. This helps in **grouping metrics** for easier analysis.
 581 | 
 582 | ---
 583 | 
 584 | #### **2️⃣ Selector  (`selector`)**
 585 | 
 586 | ```yaml
 587 |   selector:
 588 |     matchLabels:
 589 |       app: api
 590 | ```
 591 | 
 592 | - The **selector** ensures that the ServiceMonitor only scrapes **services** that have the label `app: api`.  
 593 | - It **filters** which services should be monitored.  
 594 | 
 595 | 📌 **How does this work?**  
 596 | If we have a **Kubernetes Service** defined like this:  
 597 | 
 598 | ```yaml
 599 | metadata:
 600 |   labels:
 601 |     app: api
 602 | ```
 603 | 
 604 | Then, the ServiceMonitor will find this service and scrape its metrics.
 605 | 
 606 | ---
 607 | 
 608 | #### **3️⃣ Endpoints (`endpoints`)**
 609 | 
 610 | ```yaml
 611 |   endpoints:
 612 |     - port: web
 613 |       path: /swagger-stats/metrics
 614 | ```
 615 | 
 616 | - `port: web` → Specifies **which port** of the service should be used for scraping metrics.  
 617 | - `path: /swagger-stats/metrics` → Specifies the **URL path** where Prometheus can fetch metrics.  
 618 | 
 619 | 📌 **Why do we need `endpoints`?**  
 620 | A service may have multiple ports, but only **one of them exposes Prometheus metrics**. This section tells Prometheus **exactly where to look**.
 621 | 
 622 | ---
 623 | 
 624 | ## **PrometheusRules**  
 625 | 
 626 | A **PrometheusRule** is a **custom resource** that defines **alerting and recording rules** for Prometheus. It helps in setting up automated **alerts** based on specific conditions in your metrics.  
 627 | 
 628 | Now, let’s break down the **PrometheusRules YAML file** step by step.  
 629 | 
 630 | ---
 631 | 
 632 | ### **API Version and Kind**  
 633 | 
 634 | ```yaml
 635 | apiVersion: monitoring.coreos.com/v1
 636 | kind: PrometheusRule
 637 | ```
 638 | 
 639 | - `apiVersion: monitoring.coreos.com/v1` → This API is part of the **Prometheus Operator**, which manages monitoring rules in Kubernetes.  
 640 | - `kind: PrometheusRule` → Defines this resource as a **PrometheusRule**, which contains alerting and recording rules.  
 641 | 
 642 | 📌 **Why do we need PrometheusRules?**  
 643 | 
 644 | - To **automatically trigger alerts** when specific conditions are met.  
 645 | - To **define recording rules** that precompute expensive queries for better performance.  
 646 | 
 647 | ---
 648 | 
 649 | ### **Metadata**
 650 | 
 651 | ```yaml
 652 | metadata:
 653 |   name: api-prometheus-rule
 654 |   labels:
 655 |     release: prometheus
 656 | ```
 657 | 
 658 | - `name: api-prometheus-rule` → The name of this rule set.  
 659 | - `labels:`  
 660 |   - `release: prometheus` → Associates this rule with the Prometheus instance.  
 661 | 
 662 | 📌 **Why use labels here?**  
 663 | Prometheus Operator looks for **PrometheusRule** objects that match a Prometheus instance using **labels**. If your Prometheus setup is using `release: prometheus`, it will only load rules with the **same label**.
 664 | 
 665 | ---
 666 | 
 667 | ### **Specification (`spec`):**
 668 | 
 669 | The `spec` section defines **alerting rules** that tell Prometheus when to trigger alerts.  
 670 | 
 671 | ```yaml
 672 | spec:
 673 |   groups:
 674 |     - name: api
 675 |       rules:
 676 |         - alert: down
 677 |           expr: up == 0
 678 |           for: 0m
 679 |           labels:
 680 |             severity: Critical
 681 |           annotations:
 682 |             summary: Prometheus target missing {{$labels.instance}}
 683 | ```
 684 | 
 685 | ---
 686 | 
 687 | ### **Breaking Down the PrometheusRules Spec**  
 688 | 
 689 | #### **1️⃣ Groups (`groups`)**  
 690 | 
 691 | ```yaml
 692 |   groups:
 693 |     - name: api
 694 | ```
 695 | 
 696 | - A **group** is a collection of rules.  
 697 | - `name: api` → The **name of the rule group**, which helps in organizing alerts.  
 698 | 
 699 | 📌 **Why use rule groups?**  
 700 | 
 701 | - **Groups allow efficient rule evaluation** by executing all rules in the same group at once.  
 702 | - Helps in **categorizing** rules based on different services (e.g., `api`, `database`, `network`).  
 703 | 
 704 | ---
 705 | 
 706 | #### **2️⃣ Rules (`rules`)**  
 707 | 
 708 | Each group contains **one or more rules** that define alerting conditions.  
 709 | 
 710 | ```yaml
 711 | rules:
 712 |   - alert: down
 713 |     expr: up == 0
 714 |     for: 0m
 715 |     labels:
 716 |       severity: Critical
 717 |     annotations:
 718 |       summary: Prometheus target missing {{$labels.instance}}
 719 | ```
 720 | 
 721 | ##### **🔹 Alert Name**
 722 | 
 723 | ```yaml
 724 |   - alert: down
 725 | ```
 726 | 
 727 | - `alert: down` → This is the **name of the alert** that will be triggered when the condition is met.
 728 | 
 729 | ---
 730 | 
 731 | ##### **🔹 Alert Expression (`expr`)**
 732 | 
 733 | ```yaml
 734 |     expr: up == 0
 735 | ```
 736 | 
 737 | - The **expression** determines when the alert should trigger.  
 738 | - `up == 0` → The **`up`** metric in Prometheus indicates whether a target is reachable.  
 739 |   - `1` → The target is **healthy**.  
 740 |   - `0` → The target is **down**.  
 741 | - This rule **triggers an alert if the target is down**.
 742 | 
 743 | 📌 **How does this work?**  
 744 | If an application crashes or a service becomes unreachable, the **`up` metric becomes 0**, and this alert fires. 🚨  
 745 | 
 746 | ---
 747 | 
 748 | ##### **🔹 Alert Duration (`for`)**
 749 | 
 750 | ```yaml
 751 |     for: 0m
 752 | ```
 753 | 
 754 | - Specifies how long the condition must hold **before triggering the alert**.  
 755 | - `0m` → The alert fires **immediately** when the condition is met.  
 756 | - You can set a delay (e.g., `5m`) to prevent **flapping alerts** (temporary issues that resolve quickly).  
 757 | 
 758 | 📌 **Example:**  
 759 | 
 760 | - `for: 5m` → Only triggers if `up == 0` for **5 continuous minutes**.  
 761 | - `for: 0m` → Triggers **instantly** when `up == 0`.
 762 | 
 763 | ---
 764 | 
 765 | ##### **🔹 Labels (`labels`)**
 766 | 
 767 | ```yaml
 768 |     labels:
 769 |       severity: Critical
 770 | ```
 771 | 
 772 | - Labels **add metadata** to alerts.  
 773 | - `severity: Critical` → Marks this alert as **Critical**, helping categorize alerts by urgency.  
 774 | 
 775 | 📌 **Why use labels?**  
 776 | 
 777 | - Allows **grouping and filtering alerts** in monitoring dashboards like Grafana.  
 778 | - Helps **alert managers** route notifications (e.g., send `Critical` alerts via SMS and `Warning` alerts via email).  
 779 | 
 780 | ---
 781 | 
 782 | ##### **🔹 Annotations (`annotations`)**
 783 | 
 784 | ```yaml
 785 |     annotations:
 786 |       summary: Prometheus target missing {{$labels.instance}}
 787 | ```
 788 | 
 789 | - **Annotations provide extra information** about the alert.  
 790 | - `summary: Prometheus target missing {{$labels.instance}}`  
 791 |   - `{{$labels.instance}}` → Inserts the instance name (e.g., `my-service:9090`).  
 792 |   - The final alert message might look like:  
 793 |     - **"Prometheus target missing my-service:9090"**  
 794 | 
 795 | 📌 **Why use annotations?**  
 796 | 
 797 | - Helps **provide context** in alerting tools like **Alertmanager, Slack, or PagerDuty**.  
 798 | - Reduces the need for manual investigation when an alert fires.
 799 | 
 800 | ---
 801 | 
 802 | ## **AlertmanagerConfig**  
 803 | 
 804 | An **AlertmanagerConfig** is a **custom resource** that helps define how **Alertmanager** handles alerts sent by **Prometheus**. It **routes alerts** to different receivers (e.g., email, Slack, PagerDuty) based on their **severity, labels, or conditions**.  
 805 | 
 806 | Now, let’s break down the **AlertmanagerConfig YAML file** step by step.  
 807 | 
 808 | ---
 809 | 
 810 | ### **API Version and Kind**  
 811 | 
 812 | ```yaml
 813 | apiVersion: monitoring.coreos.com/v1
 814 | kind: AlertmanagerConfig
 815 | ```
 816 | 
 817 | - `apiVersion: monitoring.coreos.com/v1` → Uses the **Prometheus Operator API** for managing alerting configurations.  
 818 | - `kind: AlertmanagerConfig` → Defines this resource as an **AlertmanagerConfig**, which specifies routing rules for alerts.  
 819 | 
 820 | 📌 **Why do we need AlertmanagerConfig?**  
 821 | 
 822 | - It allows **custom alert routing** to different teams based on alert **severity** or **labels**.  
 823 | - Helps **avoid alert fatigue** by grouping and delaying alerts instead of spamming notifications.  
 824 | 
 825 | ---
 826 | 
 827 | ### **Metadata**  
 828 | 
 829 | ```yaml
 830 | metadata:
 831 |   name: alertmanager-config
 832 |   labels:
 833 |     release: prometheus
 834 | ```
 835 | 
 836 | - `name: alertmanager-config` → The name of this Alertmanager configuration.  
 837 | - `labels:`  
 838 |   - `release: prometheus` → Associates this configuration with a **Prometheus** instance.  
 839 | 
 840 | 📌 **Why use labels here?**  
 841 | Labels ensure that **only the correct Prometheus instance** picks up this configuration.
 842 | 
 843 | ---
 844 | 
 845 | ### **Specification (`spec`)**:
 846 | 
 847 | The `spec` section defines **how alerts are grouped, delayed, and routed** to different receivers (e.g., email, Slack, webhook).  
 848 | 
 849 | ---
 850 | 
 851 | ### **1️⃣ Route Configuration**
 852 | 
 853 | ```yaml
 854 | spec:
 855 |   route:
 856 |     groupBy: ["severity"]
 857 |     groupWait: 30s
 858 |     groupInterval: 5m
 859 |     repeatInterval: 12h
 860 |     receiver: "team-notifications"
 861 | ```
 862 | 
 863 | This section **controls how alerts are grouped and routed** to receivers.  
 864 | 
 865 | #### **🔹 `groupBy`**
 866 | 
 867 | ```yaml
 868 |     groupBy: ["severity"]
 869 | ```
 870 | 
 871 | - Groups alerts based on their **severity** (e.g., Critical, Warning).  
 872 | - Instead of sending separate alerts for each **Critical** issue, it **bundles them together** into one notification.  
 873 | 
 874 | 📌 **Example:**  
 875 | 
 876 | - Instead of sending **5 separate Critical alerts**, it sends **1 grouped alert** with all details.  
 877 | 
 878 | ---
 879 | 
 880 | #### **🔹 `groupWait`**
 881 | 
 882 | ```yaml
 883 |     groupWait: 30s
 884 | ```
 885 | 
 886 | - **Waits for 30 seconds** before sending the first alert.  
 887 | - Helps **reduce noise** by allowing similar alerts to be **grouped** before notifying.  
 888 | 
 889 | 📌 **Why use `groupWait`?**  
 890 | If an application crashes and **multiple alerts** fire at once, it **waits 30 seconds** to group them instead of sending them separately.
 891 | 
 892 | ---
 893 | 
 894 | #### **🔹 `groupInterval`**
 895 | 
 896 | ```yaml
 897 |     groupInterval: 5m
 898 | ```
 899 | 
 900 | - After sending the first alert, it waits **5 minutes** before sending another batch of alerts.  
 901 | - Ensures that alerts for **the same issue** are **not repeatedly sent** in a short period.  
 902 | 
 903 | 📌 **Example:**  
 904 | If 10 alerts fire within 5 minutes, only **one alert is sent** every 5 minutes.
 905 | 
 906 | ---
 907 | 
 908 | #### **🔹 `repeatInterval`**
 909 | 
 910 | ```yaml
 911 |     repeatInterval: 12h
 912 | ```
 913 | 
 914 | - If an alert **remains active**, it sends **a reminder every 12 hours**.  
 915 | - Prevents excessive notifications while ensuring unresolved issues are **not ignored**.  
 916 | 
 917 | 📌 **Example:**  
 918 | If an API server is down for 2 days, **reminders are sent every 12 hours** until it's resolved.
 919 | 
 920 | ---
 921 | 
 922 | #### **🔹 `receiver`**
 923 | 
 924 | ```yaml
 925 |     receiver: "team-notifications"
 926 | ```
 927 | 
 928 | - Defines which **receiver** will handle this alert group.  
 929 | - Here, alerts are sent to a receiver named **"team-notifications"** (configured below).  
 930 | 
 931 | 📌 **Why use receivers?**  
 932 | It helps direct alerts to the right **team** or **notification method** (e.g., Slack for engineers, PagerDuty for on-call teams).
 933 | 
 934 | ---
 935 | 
 936 | ### **2️⃣ Receiver Configuration**
 937 | 
 938 | ```yaml
 939 |   receivers:
 940 |     - name: "team-notifications"
 941 |       emailConfigs:
 942 |         - to: "team@example.com"
 943 |           sendResolved: true
 944 | ```
 945 | 
 946 | This section **defines how alerts are delivered**.  
 947 | 
 948 | #### **🔹 Receiver Name**
 949 | 
 950 | ```yaml
 951 |     - name: "team-notifications"
 952 | ```
 953 | 
 954 | - This **matches the receiver name** in `route.receiver`, meaning alerts will be **sent here**.  
 955 | 
 956 | ---
 957 | 
 958 | #### **🔹 Email Notification (`emailConfigs`)**
 959 | 
 960 | ```yaml
 961 |       emailConfigs:
 962 |         - to: "team@example.com"
 963 |           sendResolved: true
 964 | ```
 965 | 
 966 | - **`to: "team@example.com"`** → Sends alerts to this **email address**.  
 967 | - **`sendResolved: true`** → Sends **another email** when the issue is **fixed**.  
 968 | 
 969 | 📌 **Why use `sendResolved: true`?**  
 970 | 
 971 | - Without it, users only get an alert when an issue occurs.  
 972 | - With it, users **also get a notification when the issue is resolved**.  
 973 | 
 974 | ---
 975 | 
 976 | ### **🔹 Other Notification Methods**
 977 | 
 978 | Instead of **emailConfigs**, you can also use **Slack, PagerDuty, or Webhooks**:
 979 | 
 980 | ✅ **Slack Notification Example**
 981 | 
 982 | ```yaml
 983 |       slackConfigs:
 984 |         - channel: "#alerts"
 985 |           apiURL: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
 986 |           sendResolved: true
 987 | ```
 988 | 
 989 | ✅ **PagerDuty Notification Example**
 990 | 
 991 | ```yaml
 992 |       pagerdutyConfigs:
 993 |         - serviceKey: "your-pagerduty-key"
 994 |           sendResolved: true
 995 | ```
 996 | 
 997 | 📌 **Why use multiple receivers?**  
 998 | 
 999 | - Send **Critical alerts to PagerDuty**.  
1000 | - Send **Warning alerts to Slack**.  
1001 | - Send **All alerts to email**.
1002 | 
1003 | ---
1004 | 
1005 | ### **Summary**
1006 | 
1007 | ✅ This **AlertmanagerConfig** will:  
1008 | 
1009 | 1. **Group alerts by severity** (`Critical`, `Warning`).  
1010 | 2. **Delay the first alert by 30s** to prevent spam.  
1011 | 3. **Send only one alert every 5 minutes** for ongoing issues.  
1012 | 4. **Repeat unresolved alerts every 12 hours**.  
1013 | 5. **Send alerts to `team@example.com` via email**.  
1014 | 6. **Notify when an alert is resolved** (`sendResolved: true`).  
1015 | 
1016 | ---
1017 | 
1018 | ### **Real-World Example**
1019 | 
1020 | Imagine you are running a **Kubernetes cluster**, and your **API server crashes**.  
1021 | 
1022 | 1. Prometheus detects the failure (`up == 0`).  
1023 | 2. The **PrometheusRule** fires an alert.  
1024 | 3. The **AlertmanagerConfig**:  
1025 |    - Waits 30s to group similar alerts.  
1026 |    - Sends an email to **<team@example.com>**.  
1027 |    - If the issue is not resolved, it **reminds the team every 12 hours**.  
1028 | 4. When the API server **recovers**, an email **confirmation is sent**.  
1029 | 
1030 | ---
1031 | 
1032 | ## **Author & Community**  
1033 | 
1034 | This project is crafted by **[Harshhaa](https://github.com/NotHarshhaa)** 💡.  
1035 | I’d love to hear your feedback! Feel free to share your thoughts.
1036 | 
1037 | ---
1038 | 
1039 | ### **Connect with me:**
1040 | 
1041 | [![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://linkedin.com/in/harshhaa-vardhan-reddy) [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/NotHarshhaa)  [![Telegram](https://img.shields.io/badge/Telegram-26A5E4?style=for-the-badge&logo=telegram&logoColor=white)](https://t.me/prodevopsguy) [![Dev.to](https://img.shields.io/badge/Dev.to-0A0A0A?style=for-the-badge&logo=dev.to&logoColor=white)](https://dev.to/notharshhaa) [![Hashnode](https://img.shields.io/badge/Hashnode-2962FF?style=for-the-badge&logo=hashnode&logoColor=white)](https://hashnode.com/@prodevopsguy)  
1042 | 
1043 | ---
1044 | 
1045 | ### 📢 **Stay Connected**  
1046 | 
1047 | ![Follow Me](https://imgur.com/2j7GSPs.png)
1048 | 


--------------------------------------------------------------------------------
/Prometheus-lab/k8s-yaml/Alertmanagerconfig.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: monitoring.coreos.com/v1
 2 | kind: AlertmanagerConfig
 3 | metadata:
 4 |   name: alertmanager-config
 5 |   labels:
 6 |     release: prometheus # Ensures correct association with Prometheus instance
 7 | spec:
 8 |   route:
 9 |     groupBy: ["alertname", "severity"] # Group alerts by alert name and severity level
10 |     groupWait: 30s # Wait time before sending grouped alerts
11 |     groupInterval: 5m # Interval between alert groups
12 |     repeatInterval: 12h # Frequency of repeated alerts for unresolved issues
13 |     receiver: "default-receiver" # Default receiver for alerts
14 | 
15 |   receivers:
16 |     - name: "email-team" # Email notification configuration
17 |       emailConfigs:
18 |         - to: "team@example.com" # Replace with actual team email
19 |           from: "alerts@yourdomain.com" # Replace with your email sender
20 |           smarthost: "smtp.yourmail.com:587" # Replace with SMTP server
21 |           authUsername: "your-username"
22 |           authPassword:
23 |             name: email-secret
24 |             key: password
25 |           sendResolved: true # Send notifications when issues are resolved
26 | 
27 |     - name: "slack-notifications" # Slack notification configuration
28 |       slackConfigs:
29 |         - channel: "#alerts" # Replace with Slack channel name
30 |           apiURL:
31 |             name: slack-secret
32 |             key: webhook-url
33 |           sendResolved: true
34 |           title: "{{ .CommonAnnotations.summary }}"
35 |           text: "{{ .CommonAnnotations.description }}"
36 | 
37 |     - name: "webhook-receiver" # Webhook integration (e.g., PagerDuty, custom APIs)
38 |       webhookConfigs:
39 |         - url: "https://your-webhook-url.com/alert"
40 |           sendResolved: true
41 | 


--------------------------------------------------------------------------------
/Prometheus-lab/k8s-yaml/Deployment.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: my-deployment # Name of the deployment
 5 |   labels:
 6 |     app: api # Label for identifying the application type
 7 | spec:
 8 |   replicas: 3 # Ensures high availability by running 3 replicas
 9 |   strategy:
10 |     type: RollingUpdate # Enables zero-downtime updates
11 |     rollingUpdate:
12 |       maxUnavailable: 1 # Only 1 pod can be unavailable during an update
13 |       maxSurge: 1 # Allows 1 extra pod to be created during an update
14 |   selector:
15 |     matchLabels:
16 |       app: api # Match labels to select pods for this deployment
17 |   template:
18 |     metadata:
19 |       labels:
20 |         app: api # Labels for pods created by this template
21 |     spec:
22 |       containers:
23 |         - name: my-container # Name of the container
24 |           image: panchanandevops/myexpress:v0.1.0 # Docker image for the container
25 |           imagePullPolicy: Always # Ensures the latest image is pulled on every deployment
26 |           resources:
27 |             limits:
28 |               memory: "256Mi" # Increased memory for better performance
29 |               cpu: "500m" # CPU limit for the container
30 |             requests:
31 |               memory: "128Mi" # Requested memory for initial allocation
32 |               cpu: "250m" # Requested CPU to ensure smooth operation
33 |           ports:
34 |             - containerPort: 3000 # Port on which the container will listen
35 |           readinessProbe: # Ensures the pod is ready before traffic is sent
36 |             httpGet:
37 |               path: /health
38 |               port: 3000
39 |             initialDelaySeconds: 5
40 |             periodSeconds: 10
41 |           livenessProbe: # Checks if the application is still running
42 |             httpGet:
43 |               path: /health
44 |               port: 3000
45 |             initialDelaySeconds: 10
46 |             periodSeconds: 20
47 |       restartPolicy: Always # Ensures the container restarts on failure
48 |       nodeSelector:
49 |         kubernetes.io/os: linux # Ensures the deployment runs on Linux nodes
50 | 


--------------------------------------------------------------------------------
/Prometheus-lab/k8s-yaml/PrometheusRule.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: monitoring.coreos.com/v1
 2 | kind: PrometheusRule
 3 | metadata:
 4 |   name: api-prometheus-rule
 5 |   labels:
 6 |     release: prometheus # Label for release association
 7 | spec:
 8 |   groups:
 9 |     - name: api-alerts # Improved naming convention for clarity
10 |       rules:
11 |         # Alert for when the target instance is down
12 |         - alert: InstanceDown
13 |           expr: up == 0
14 |           for: 1m # Alert only triggers if the instance is down for 1 minute
15 |           labels:
16 |             severity: critical
17 |             team: backend
18 |           annotations:
19 |             summary: "Instance {{ $labels.instance }} is down"
20 |             description: "Prometheus target instance {{ $labels.instance }} has been unreachable for over 1 minute."
21 | 
22 |         # Alert for high CPU usage (above 80% for 5 minutes)
23 |         - alert: HighCPUUsage
24 |           expr: (100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
25 |           for: 5m
26 |           labels:
27 |             severity: warning
28 |             team: backend
29 |           annotations:
30 |             summary: "High CPU Usage on {{ $labels.instance }}"
31 |             description: "CPU usage on instance {{ $labels.instance }} has been above 80% for the last 5 minutes."
32 | 
33 |         # Alert for low memory availability (less than 10% free for 5 minutes)
34 |         - alert: LowMemory
35 |           expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
36 |           for: 5m
37 |           labels:
38 |             severity: warning
39 |             team: backend
40 |           annotations:
41 |             summary: "Low Available Memory on {{ $labels.instance }}"
42 |             description: "Available memory on instance {{ $labels.instance }} is below 10% for over 5 minutes."
43 | 
44 |         # Alert for high disk usage (above 90% for 10 minutes)
45 |         - alert: HighDiskUsage
46 |           expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
47 |           for: 10m
48 |           labels:
49 |             severity: critical
50 |             team: backend
51 |           annotations:
52 |             summary: "High Disk Usage on {{ $labels.instance }}"
53 |             description: "Disk usage on instance {{ $labels.instance }} has exceeded 90% for the last 10 minutes."
54 | 


--------------------------------------------------------------------------------
/Prometheus-lab/k8s-yaml/Service-monitor.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: monitoring.coreos.com/v1
 2 | kind: ServiceMonitor
 3 | metadata:
 4 |   name: api-service-monitor
 5 |   labels:
 6 |     release: prometheus # Ensures association with Prometheus
 7 |     app: prometheus
 8 |     team: backend # Assigns the monitor to a specific team
 9 | spec:
10 |   jobLabel: api-monitor # More descriptive job label
11 |   selector:
12 |     matchLabels:
13 |       app: api # Matches the app label for monitoring
14 |   namespaceSelector:
15 |     matchNames:
16 |       - default # Ensures the correct namespace is targeted (modify as needed)
17 |   endpoints:
18 |     - port: web # Monitored port
19 |       path: /swagger-stats/metrics # Path where Prometheus scrapes metrics
20 |       interval: 15s # Scrape interval (adjust based on load)
21 |       scrapeTimeout: 10s # Timeout for scraping metrics
22 |       honorLabels: true # Preserves metric labels from the application
23 |       relabelings: # Modifies metric labels dynamically
24 |         - sourceLabels: [__meta_kubernetes_pod_node_name]
25 |           targetLabel: node
26 |         - sourceLabels: [__meta_kubernetes_namespace]
27 |           targetLabel: namespace
28 |         - sourceLabels: [__meta_kubernetes_pod_name]
29 |           targetLabel: pod
30 |         - sourceLabels: [__meta_kubernetes_pod_container_name]
31 |           targetLabel: container
32 |         - sourceLabels: [__meta_kubernetes_pod_label_app]
33 |           targetLabel: app
34 |         - sourceLabels: [__meta_kubernetes_pod_label_release]
35 |           targetLabel: release
36 |         - sourceLabels: [__meta_kubernetes_pod_label_team]
37 |           targetLabel: team
38 |         - sourceLabels: [__meta_kubernetes_pod_label_version]
39 |           targetLabel: version
40 |         - sourceLabels: [__meta_kubernetes_pod_label_component]
41 |           targetLabel: component
42 |         - sourceLabels: [__meta_kubernetes_pod_label_managed_by]
43 |           targetLabel: managed_by
44 |         - sourceLabels: [__meta_kubernetes_pod_label_created_by]
45 |           targetLabel: created_by
46 |         - sourceLabels: [__meta_kubernetes_pod_label_deployment]
47 |           targetLabel: deployment
48 |         - sourceLabels: [__meta_kubernetes_pod_label_service]
49 |           targetLabel: service
50 |         - sourceLabels: [__meta_kubernetes_pod_label_environment]
51 |           targetLabel: environment
52 |         - sourceLabels: [__meta_kubernetes_pod_label_tier]
53 |           targetLabel: tier
54 |         - sourceLabels: [__meta_kubernetes_pod_label_partition]
55 |           targetLabel: partition
56 |         - sourceLabels: [__meta_kubernetes_pod_label_track]
57 |           targetLabel: track
58 |         - sourceLabels: [__meta_kubernetes_pod_label_role]
59 |           targetLabel: role
60 |         - sourceLabels: [__meta_kubernetes_pod_label_zone]  
61 |           targetLabel: zone
62 |         - sourceLabels: [__meta_kubernetes_pod_label_region]
63 |           targetLabel: region
64 |         - sourceLabels: [__meta_kubernetes_pod_label_cluster]
65 |           targetLabel: cluster
66 | 


--------------------------------------------------------------------------------
/Prometheus-lab/k8s-yaml/Service.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: v1
 2 | kind: Service
 3 | metadata:
 4 |   name: my-service
 5 |   labels:
 6 |     job: node-api # Identifies the job associated with this service
 7 |     app: api
 8 |     environment: production # Helps differentiate between environments (dev/staging/prod)
 9 | spec:
10 |   type: ClusterIP # Internal service (change to NodePort or LoadBalancer if needed)
11 |   selector:
12 |     app: api # Matches pods labeled with 'app: api'
13 |   ports:
14 |     - name: http # Descriptive name for the port
15 |       protocol: TCP
16 |       port: 3000 # Exposed service port
17 |       targetPort: 3000 # Corresponding container port
18 |   sessionAffinity: None # Ensures no sticky sessions (modify if needed)
19 |   sessionAffinityConfig:
20 |     clientIP:
21 |       timeoutSeconds: 10800 # 3 hours
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🚀 **Learning Prometheus: A Complete Guide for Kubernetes Monitoring**  
  2 | 
  3 | ![Prometheus](https://imgur.com/0lYXGvg.png)  
  4 | 
  5 | ## 🔍 **Master Prometheus for Real-Time Monitoring & Observability** 
  6 | 
  7 | ![Prometheus](https://imgur.com/EZe96QW.png)
  8 | 
  9 | This repository is dedicated to learning, implementing, and deploying **Prometheus** for monitoring Kubernetes environments. Whether you're a beginner or an experienced DevOps engineer, this guide will help you master Prometheus with real-world use cases.
 10 | 
 11 | ---
 12 | 
 13 | ## 📌 **Repository Structure**  
 14 | 
 15 | ### 📂 **1. Prometheus-lab/**  
 16 | 
 17 | This directory contains hands-on labs and YAML manifest files for deploying Prometheus in Kubernetes.  
 18 | 
 19 | #### 📌 **k8s-yaml/** *(Kubernetes Deployment Manifests)*  
 20 | 
 21 | - `Alertmanagerconfig.yaml` - Configuration for Prometheus Alertmanager to handle alerts.  
 22 | - `Deployment.yaml` - Defines the Prometheus deployment in Kubernetes.  
 23 | - `PrometheusRule.yaml` - Alerting rules for Prometheus monitoring.  
 24 | - `Service-monitor.yaml` - ServiceMonitor definition for scraping Prometheus metrics.  
 25 | - `Service.yaml` - Kubernetes service to expose Prometheus.  
 26 | - `README.md` - Documentation on setting up Prometheus in Kubernetes.  
 27 | 
 28 | ### 📂 **2. promql-img/**  
 29 | 
 30 | - This folder contains images used to explain PromQL queries and dashboard visualizations.  
 31 | 
 32 | ### 📜 **3. promQl.md**  
 33 | 
 34 | - A guide to **PromQL (Prometheus Query Language)**, including syntax, functions, and real-world query examples.  
 35 | 
 36 | ### 📜 **4. prometheus_setup.md**  
 37 | 
 38 | - Step-by-step instructions for installing and setting up Prometheus.  
 39 | 
 40 | ### 📜 **5. README.md** *(This file)*  
 41 | 
 42 | - The main documentation file for understanding the structure and content of the repository.  
 43 | 
 44 | ---
 45 | 
 46 | ## 📖 **Detailed Learning Guide**  
 47 | 
 48 | 📌 **Read the full tutorial here:**  
 49 | 🔗 **[Real-world Prometheus Deployment: A Practical Guide for Kubernetes Monitoring](https://blog.prodevopsguy.xyz/real-world-prometheus-deployment-a-practical-guide-for-kubernetes-monitoring)**  
 50 | 
 51 | ---
 52 | 
 53 | ## 🚀 **What You'll Learn?**  
 54 | 
 55 | ✅ **Prometheus Fundamentals:** Understand Prometheus architecture, data collection, and querying.  
 56 | ✅ **Kubernetes Monitoring:** Learn how to integrate Prometheus with Kubernetes for system metrics and application observability.  
 57 | ✅ **PromQL (Prometheus Query Language):** Master querying techniques for efficient monitoring and alerting.  
 58 | ✅ **Grafana Integration:** Visualize Prometheus metrics using Grafana dashboards.  
 59 | ✅ **Alerting & Notifications:** Set up alert rules and integrate with Slack, Email, and other services.  
 60 | ✅ **Custom Exporters:** Learn to create and configure custom exporters for collecting application-specific metrics.  
 61 | ✅ **Scaling Prometheus:** Implement high-availability and federation strategies.  
 62 | 
 63 | ---
 64 | 
 65 | ## **Code of Conduct**
 66 | 
 67 | > [!CAUTION]
 68 | >
 69 | > We are committed to fostering a welcoming and respectful environment for all contributors. Please take a moment to review our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating in this community.
 70 | 
 71 | ---
 72 | 
 73 | ## **Contribute and Collaborate**
 74 | 
 75 | > [!TIP]
 76 | > This repository thrives on community contributions and collaboration. Here’s how you can get involved:
 77 | >
 78 | > - **Fork the Repository:** Create your own copy of the repository to work on.
 79 | > - **Submit Pull Requests:** Contribute your projects or improvements to existing projects by submitting pull requests.
 80 | > - **Engage with Others:** Participate in discussions, provide feedback on others’ projects, and collaborate to create better solutions.
 81 | > - **Share Your Knowledge:** If you’ve developed a new project or learned something valuable, share it with the community. Your contributions can help others in their learning journey.
 82 | 
 83 | ---
 84 | 
 85 | ## **Join the Community**
 86 | 
 87 | > [!IMPORTANT]
 88 | > We encourage you to be an active part of our community:
 89 | >
 90 | > - **Join Our Telegram Community:** Connect with fellow DevOps enthusiasts, ask questions, and share your progress in our [Telegram group](https://t.me/prodevopsguy).
 91 | > - **Follow Me on GitHub:** Stay updated with new projects and content by [following me on GitHub](https://github.com/NotHarshhaa).
 92 | 
 93 | ---
 94 | 
 95 | ## **Hit the Star!** ⭐
 96 | 
 97 | **If you find this repository helpful and plan to use it for learning, please give it a star. Your support is appreciated!**
 98 | 
 99 | ---
100 | 
101 | ## 🛠️ **Author & Community**  
102 | 
103 | This project is crafted by **[Harshhaa](https://github.com/NotHarshhaa)** 💡.  
104 | I’d love to hear your feedback! Feel free to share your thoughts.  
105 | 
106 | ---
107 | 
108 | ### 📧 **Connect with me:**
109 | 
110 | [![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://linkedin.com/in/harshhaa-vardhan-reddy) [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/NotHarshhaa)  [![Telegram](https://img.shields.io/badge/Telegram-26A5E4?style=for-the-badge&logo=telegram&logoColor=white)](https://t.me/prodevopsguy) [![Dev.to](https://img.shields.io/badge/Dev.to-0A0A0A?style=for-the-badge&logo=dev.to&logoColor=white)](https://dev.to/notharshhaa) [![Hashnode](https://img.shields.io/badge/Hashnode-2962FF?style=for-the-badge&logo=hashnode&logoColor=white)](https://hashnode.com/@prodevopsguy)  
111 | 
112 | ---
113 | 
114 | ### 📢 **Stay Connected**  
115 | 
116 | ![Follow Me](https://imgur.com/2j7GSPs.png)
117 | 


--------------------------------------------------------------------------------
/promQl.md:
--------------------------------------------------------------------------------
  1 | - [Decoding PromQL: A Deep Dive into Prometheus Query Language](#decoding-promql-a-deep-dive-into-prometheus-query-language)
  2 |   - [Introduction to PromQL](#introduction-to-promql)
  3 |   - [Data Types of PromQL](#data-types-of-promql)
  4 |     - [Scalar:](#scalar)
  5 |     - [String:](#string)
  6 |     - [Instant Vector:](#instant-vector)
  7 |     - [Range Vector:](#range-vector)
  8 |   - [Operators in PromQL](#operators-in-promql)
  9 |     - [Aggregation Operators:](#aggregation-operators)
 10 |     - [Binary Operators:](#binary-operators)
 11 |     - [Range Operator:](#range-operator)
 12 |     - [Offset Operator:](#offset-operator)
 13 |   - [Types of Prometheus Metrics for Data Storage and Organization](#types-of-prometheus-metrics-for-data-storage-and-organization)
 14 |     - [Counter:](#counter)
 15 |     - [Gauge:](#gauge)
 16 |     - [Histogram:](#histogram)
 17 |     - [Summary:](#summary)
 18 |   - [Begin Your Monitoring Journey!](#begin-your-monitoring-journey)
 19 | ---
 20 | 
 21 | 
 22 | # Decoding PromQL: A Deep Dive into Prometheus Query Language
 23 | ---
 24 | ## Introduction to PromQL
 25 | 
 26 | PromQL, short for Prometheus Query Language, is the dedicated language designed for **querying** and **extracting valuable insights** from the **time-series** data stored in Prometheus. As the backbone of Prometheus' querying capabilities, PromQL enables users to navigate and analyze metrics, providing a powerful tool for **monitoring** and **troubleshooting.** This section will provide a foundational understanding of PromQL, setting the stage for exploring its various aspects and applications in the realm of system observability.
 27 | 
 28 | ## Data Types of PromQL
 29 | 
 30 | PromQL has **scalar**, **instant vector**, **range vector**, **string**, and **boolean** data types for **querying** and **analyzing** Prometheus metrics.
 31 | 
 32 | Lets explore each in detail.
 33 | 
 34 | ### Scalar:
 35 | 
 36 | The scalar data type in Prometheus Query Language (PromQL) represents a single numeric value at a specific point in time. Scalars are fundamental to expressing instantaneous measurements or metrics that don't vary over a range.
 37 | 
 38 | **Key characteristics of the scalar data type:**
 39 | 
 40 | 1. **Single Value:** Scalars represent a solitary numeric value at a specific instance in time, offering a snapshot of a metric's value without time series information.
 41 | 
 42 | 2. **Direct Representation:** Scalars are direct representations of metrics' current states, making them suitable for instantaneous measurements such as the current CPU load, available memory, or a count at a specific time.
 43 | 
 44 | 3. **Fundamental Building Block:** Scalars serve as fundamental components for arithmetic operations, comparisons, and as basic inputs for computations and analyses within PromQL.
 45 | 
 46 | **Here are examples illustrating the characteristics of the scalar data type in PromQL:**
 47 | 
 48 | 
 49 | 1. **Single Value:**
 50 |    
 51 |      ```
 52 |      cpu_temperature
 53 |      ```
 54 |    - Returns the current temperature of the CPU as a scalar value, representing the temperature at the latest timestamp.
 55 |    
 56 | 
 57 | 2. **Direct Representation:**
 58 |    
 59 |      ```
 60 |      available_memory_bytes
 61 |      ```
 62 |    - The query returns the current available memory in bytes as a scalar, providing a direct representation of the available memory at the latest observation.
 63 |    
 64 | 
 65 | 3. **Fundamental Building Block:**
 66 |    
 67 |      ```
 68 |      cpu_usage + memory_usage
 69 |      ```
 70 |    - In this example, the query adds the current CPU usage and memory usage, leveraging scalars as fundamental building blocks for arithmetic operations and analysis.
 71 |    
 72 | ### String:
 73 | 
 74 | The string data type in Prometheus represents a sequence of characters and is commonly used for labeling and metadata in the metric data model.
 75 | 
 76 | **Key characteristics of the string data type in Prometheus:**
 77 | 
 78 | 1. **Character Sequence:** Strings represent a sequence of characters, which can include letters, numbers, symbols, and whitespace.
 79 | 
 80 | 2. **Labeling:** Strings are commonly used as labels in the metric data model, providing additional context or categorization for time series data.
 81 | 
 82 | 3. **Metadata:** Strings serve as a means to convey metadata information, such as labels describing service names, instance identifiers, or any descriptive information associated with a metric.
 83 | 
 84 | 4. **Quoting:** Strings in PromQL are enclosed in either single quotes (`'`) or double quotes (`"`), and they are used to define label values or string literals in queries.
 85 | 
 86 | **Here are examples illustrating the characteristics of the string data type in Prometheus:**
 87 | 
 88 | 1. **Character Sequence:**
 89 |    ```plaintext
 90 |    'service_name'
 91 |    ```
 92 |    - This string literal represents the character sequence 'service_name' and might be used as a label value.
 93 | 
 94 | 2. **Labeling:**
 95 |    ```plaintext
 96 |    http_requests_total{environment="production"}
 97 |    ```
 98 |    - In this query, the string "production" is used as a label value to filter for HTTP requests in the production environment.
 99 | 
100 | 3. **Metadata:**
101 |    ```plaintext
102 |    instance="webserver-01"
103 |    ```
104 |    - This string label provides metadata about the specific instance 'webserver-01' associated with a metric.
105 | 
106 | 4. **Quoting:**
107 |    ```plaintext
108 |    "metric_name" or 'metric_name'
109 |    ```
110 |    - Strings in PromQL are enclosed in either single or double quotes, as shown in these examples defining the string literals "metric_name" or 'metric_name'.
111 | 
112 | 5. **Usage:**
113 |    ```plaintext
114 |    up{job="api", environment='staging'}
115 |    ```
116 |    - This query selects time series with the label "job" having the value "api" and the label "environment" having the value 'staging'.
117 | 
118 | ### Instant Vector:
119 | 
120 | The instant vector in Prometheus is a set of time series data, each associated with a single value at a specific point in time. Instant vectors are commonly used in PromQL queries to retrieve and analyze metric values at a specific timestamp.
121 | 
122 | **Key characteristics of the instant vector:**
123 | 
124 | 1. **Snapshot in Time:** Instant vectors represent a snapshot of metric values at a specific timestamp, providing a point-in-time view of the data.
125 | 
126 | 2. **Single Value per Time Series:** Each time series within an instant vector contains a single value corresponding to the specified timestamp.
127 | 
128 | 3. **Time Series Selection:** Queries using instant vectors can filter and aggregate time series based on labels, allowing for targeted analysis of specific metrics.
129 | 
130 | 4. **Mathematical Operations:** Instant vectors can be involved in mathematical operations, enabling calculations, comparisons, and transformations within PromQL queries.
131 | 
132 | 
133 | **Here are examples illustrating the characteristics of the instant vector data type in Prometheus:**
134 | 
135 | 
136 | 1. **Current CPU Usage:**
137 |    ```plaintext
138 |    cpu_usage
139 |    ```
140 |    - This query returns an instant vector representing the current CPU usage for all relevant time series.
141 | 
142 | 2. **High Memory Utilization Instances:**
143 |    ```plaintext
144 |    node_memory_MemUsage_bytes > 80e9
145 |    ```
146 |    - The query filters instant vectors to select time series where the memory usage exceeds 80 gigabytes.
147 | 
148 | 3. **Rate of HTTP Requests:**
149 |    ```plaintext
150 |    rate(http_requests_total[5m])
151 |    ```
152 |    - This query calculates the per-second rate of HTTP requests over the last 5 minutes, returning an instant vector.
153 | 
154 | 4. **Combined Network Traffic:**
155 |    ```plaintext
156 |    sum(rate(network_traffic_bytes{direction="in"}[1h])) + sum(rate(network_traffic_bytes{direction="out"}[1h]))
157 |    ```
158 |    - The query computes the sum of the rates of incoming and outgoing network traffic over the last hour, providing an instant vector.
159 | 
160 | ### Range Vector:
161 | 
162 | The Range Vector in Prometheus is a set of time series data, each associated with a range of values over a specified time interval. Range Vectors are commonly used in PromQL queries to analyze and evaluate metrics over a duration, allowing for calculations, aggregations, and comparisons over time.
163 | 
164 | Key characteristics of the Range Vector data type:
165 | 
166 | 1. **Time Series Over a Range:**
167 |    - The Range Vector provides a set of time series data, each representing a range of values over a specified time interval.
168 | 
169 | 2. **Single Value per Time Series per Timestamp:**
170 |    - Each time series within a Range Vector contains a set of values corresponding to multiple timestamps within the specified interval.
171 | 
172 | 3. **Time Series Selection:**
173 |    - Similar to instant vectors, Range Vectors can filter and aggregate time series based on labels for targeted analysis.
174 | 
175 | 4. **Time Shifts:** 
176 |    - PromQL allows for time shifting operations on Range Vectors, such as using the offset modifier to shift the time range, enabling comparison of values at different points in time.
177 | 
178 | 5. **Alerting Conditions:** 
179 |    - Range Vectors are commonly used in alerting conditions to detect and trigger alerts based on abnormal behavior or patterns observed over a defined duration.
180 | 
181 | 
182 | **Here are examples illustrating the characteristics of the Range Vector data type in Prometheus:**
183 | 
184 | 
185 | 1. **Sum of HTTP Request Rates over the Last 5 Minutes:**
186 |    ```plaintext
187 |    sum(rate(http_requests_total[5m]))
188 |    ```
189 |    - This query calculates the sum of the per-second rate of HTTP requests over the last 5 minutes.
190 | 
191 | 2. **Average CPU Usage over the Last Hour:**
192 |    ```plaintext
193 |    avg(cpu_usage_percent[1h])
194 |    ```
195 |    - The query computes the average CPU usage percentage over the last hour.
196 | 
197 | 3. **Total Disk Space Used in Bytes over the Last 30 Minutes:**
198 |    ```plaintext
199 |    sum(node_filesystem_size_bytes - node_filesystem_free_bytes) offset 30m
200 |    ```
201 |    - This query determines the total disk space used by subtracting free space from total size over the last 30 minutes.
202 | 
203 | 4. **Rate of Error Responses in the Past 15 Minutes:**
204 |    ```plaintext
205 |    rate(http_responses_error_total[15m])
206 |    ```
207 |    - The query calculates the per-second rate of error responses in HTTP requests over the past 15 minutes.
208 | 
209 | 5. **Changes in Available Memory over the Last 10 Minutes:**
210 |    ```plaintext
211 |    changes(node_memory_MemAvailable_bytes[10m])
212 |    ```
213 |    - This query identifies the number of changes in available memory values over the last 10 minutes.
214 | 
215 | 6. **90th Percentile Response Time for API Requests in the Last 20 Minutes:**
216 |    ```plaintext
217 |    histogram_quantile(0.9, rate(api_request_duration_seconds_bucket[20m]))
218 |    ```
219 |    - The query computes the 90th percentile response time for API requests over the last 20 minutes using a histogram.
220 | 
221 | 
222 | 
223 | **Boolean:**
224 | 
225 | The boolean data type in Prometheus represents true or false values and is commonly used in logical expressions and conditions within PromQL queries.
226 | 
227 | **Key characteristics of the boolean data type:**
228 | 
229 | 1. **True or False:** Booleans can have two possible values: true or false, representing binary logic.
230 | 
231 | 2. **Logical Operators:** Booleans are often used with logical operators such as `==` (equals), `!=` (not equals), `and`, `or`, and `unless` to construct conditional expressions in PromQL queries.
232 | 
233 | 3. **Comparison Operators:** Boolean expressions often involve comparison operators like `<`, `>`, `<=`, and `>=` to evaluate conditions based on metric values.
234 | 
235 | 4. **Filtering and Filtering Conditions:** Booleans are used to filter time series data based on specific conditions, allowing for selective analysis and alerting.
236 | 
237 | 
238 | **Here are examples illustrating the characteristics of the Boolean data type in Prometheus:**
239 | 
240 | 1. **Combining Conditions:**
241 |    ```plaintext
242 |    up == 1 and http_requests_total > 100
243 |    ```
244 |    - This query combines two conditions using the logical AND operator, checking if the `up` metric is equal to 1 and the `http_requests_total` metric is greater than 100.
245 | 
246 | 2. **Negating Conditions:**
247 |    ```plaintext
248 |    not job{job="api"} == 0
249 |    ```
250 |    - Here, the `not` operator negates the condition, checking if any time series with the label `job` equal to "api" has a value not equal to 0.
251 | 
252 | 3. **Conditional Expression:**
253 |    ```plaintext
254 |    rate(http_requests_total[5m]) > 10 or (node_memory_MemFree_bytes / node_memory_MemTotal_bytes) < 0.2
255 |    ```
256 |    - This query uses a conditional expression, checking if the per-second rate of HTTP requests over the last 5 minutes is greater than 10 or if the ratio of free memory to total memory is less than 0.2.
257 | 
258 | 
259 | ---
260 | 
261 | ## Operators in PromQL
262 | 
263 | ### Aggregation Operators:
264 | 
265 | They will only use **instant vectors** as **input** and **return** instant vectors as the operator's **output.** 
266 | **`Aggregation (<instant vector>) => <instant vector>`**.
267 | 
268 | PromQL includes various aggregation operators that serve different purposes in summarizing and analyzing time series data. Here are some types of aggregation operators:
269 | 
270 | 1. **Basic Aggregation:**
271 |    - `sum()`: Calculates the total sum of values across time series.
272 |    - `avg()`: Computes the average value across time series.
273 |    - `min()`: Identifies the minimum value across time series.
274 |    - `max()`: Identifies the maximum value across time series.
275 |    - `count()`: Counts the number of time series matching a given condition.
276 | 
277 | 2. **Rate and Increase:**
278 |    - `rate()`: Calculates the per-second rate of increase for counters.
279 |    - `increase()`: Computes the total increase in a counter over a specified time range.
280 | 
281 | 3. **Statistical Aggregation:**
282 |    - `stddev()`: Computes the standard deviation of values across time series.
283 |    - `quantile()`: Calculates specified quantiles (e.g., 90th percentile) of values.
284 | 
285 | 4. **Top and Bottom K:**
286 |    - `topk()`: Identifies the top k time series based on a specified metric.
287 |    - `bottomk()`: Identifies the bottom k time series based on a specified metric.
288 | 
289 | 5. **Time Series Aggregation:**
290 |    - `sum_over_time()`: Aggregates total sums over a specified time range.
291 |    - `avg_over_time()`: Aggregates average values over a specified time range.
292 |    - `min_over_time()`: Aggregates minimum values over a specified time range.
293 |    - `max_over_time()`: Aggregates maximum values over a specified time range.
294 | 
295 | 
296 | ### Binary Operators:
297 | 
298 | 1. **Arithmetic Binary Operators:**
299 |    - `+` (Addition), `-` (Subtraction), `*` (Multiplication), `/` (Division)
300 | 
301 | 2. **Comparison Binary Operators:**
302 |    - `==` (Equal), `!=` (Not Equal), `<` (Less Than), `>` (Greater Than), `<=` (Less Than or Equal), `>=` (Greater Than or Equal)
303 | 
304 | 3. **Logical Binary Operators:**
305 |    - `and` (Logical AND), `or` (Logical OR), `unless` (Logical NOT)
306 | 
307 | 4. **Set Binary Operators:**
308 |    - `=~` (Regex Match), `!~` (Negative Regex Match)
309 | 
310 | 5. **Mathematical Binary Operators:**
311 |    - `^` (Exponentiation), `%` (Modulo)
312 | 
313 | ### Range Operator:
314 | With the range operator, you can specify a time duration that will filter vectors between now and specific timing.
315 | 
316 | `[time unit]`
317 | 
318 | **Time Durations Format:**
319 | Time durations are specified as a number, followed immediately by one of the following units:
320 | 
321 | - `ms` - milliseconds
322 | - `s` - seconds
323 | - `m` - minutes
324 | - `h` - hours
325 | - `d` - days (assuming a day has always 24h)
326 | - `w` - weeks (assuming a week has always 7d)
327 | - `y` - years (assuming a year has always 365d)
328 | 
329 | 
330 | ### Offset Operator:
331 | With offset, you can request the value from a certain amount of time before the moment the query was done.
332 | 
333 | 1. **`offset` for Rate Shifting:**
334 |    
335 |    - Shifts the time range for rate comparison, allowing historical analysis. 
336 |     
337 |    ```
338 |     rate(http_requests_total[5m]) > offset 1h rate(http_requests_total[5m])
339 |     ```
340 | 
341 | 2. **`offset` for Value Comparison:**
342 |    - Compares metric values at different time ranges, aiding in trend analysis.
343 |     
344 |    ```
345 |    cpu_temperature > offset 1d cpu_temperature
346 |    ```
347 | 
348 | 3. **`offset` for Rate of Change:**
349 |    - Utilizes offset to compare the rate of change in metrics over time. 
350 |    
351 |    ```
352 |    rate(cpu_usage[1h]) - rate(cpu_usage[2h])
353 |    ```
354 | 
355 | 4. **`offset` for Historical Comparisons:**
356 |    - Enables historical comparisons of metric values over a specified time range. 
357 |     
358 |    ```
359 |     http_requests_total > offset 7d http_requests_total
360 |    ```
361 | 
362 | 
363 | ## Filter Data with the @ modifier in PromQL
364 | 
365 | The `@` modifier in PromQL is used to query the value of a time series at a specific timestamp. It allows retrieving the value of a metric at a precise point in time.
366 | 
367 | Lets understand it through an example
368 | 
369 | ```plaintext
370 | http_requests_total @ 1632315600
371 | ```
372 | 
373 | In this example, `@ 1632315600` retrieves the value of the `http_requests_total` metric at the UNIX timestamp 1632315600 (representing a specific moment in time).
374 | 
375 | This enables querying historical or specific values of metrics at exact timestamps for analysis or comparison purposes.
376 | 
377 | ---
378 | 
379 | ## Types of Prometheus Metrics for Data Storage and Organization
380 | 
381 | Prometheus metrics primarily come in four types: **Counter**, **Gauge**, **Histogram**, **Summary.**
382 | 
383 | Lets explore each in detail.
384 | 
385 | ### Counter:
386 | A Prometheus Counter is a metric type that represents a **cumulative value** that can **only monotonically increase** over time.
387 | 
388 | <div style="text-align:center;">
389 |   <img src="./promql-img/counter_example.png" alt="Counter Example" style="width:90%;">
390 | </div>
391 | 
392 | 
393 | **Key characteristics of a Prometheus Counter:**
394 | 
395 | 1. **Monotonicity:** Counters always move in the positive direction, starting from zero and increasing over time. They do not decrease.
396 | 
397 | 2. **Cumulative:** Counters represent cumulative values, making them suitable for tracking totals or counts of events that continuously accumulate.
398 | 
399 | 3. **No Arbitrary Units:** Counters are dimensionless and have no specific unit attached to them. They represent a simple count or quantity.
400 | 
401 | 4. **Common Use Cases:** Counters are often used for measuring the total number of occurrences of an event, such as request counts, error counts, or other cumulative metrics in a system.
402 | 
403 | 5. **Querying for Rate:** Derivative operations in queries are commonly applied to counters to calculate rates of change over time, providing insights into the frequency of events.
404 | 
405 | **Example Queries**
406 | 
407 | 1. **Total Count:**
408 |    ```plaintext
409 |    http_requests_total
410 |    ```
411 | 
412 |    - Returns the total count of HTTP requests.
413 | 
414 | 2. **Rate of Change (Requests Per Second):**
415 |     ```plaintext
416 |     rate(http_requests_total[1m])
417 |     ```
418 |    - Calculates the rate of change of HTTP requests per second over the last 1 minute.
419 | 
420 | 3. **Error Rate as a Percentage:**
421 |     ```
422 |     100 * (http_errors_total / http_requests_total)
423 |     ```
424 |    - Calculates the percentage of HTTP requests that resulted in errors.
425 | 
426 | 4. **Increase in Count Since Last Hour:**
427 |     ```
428 |     increase(http_requests_total[1h])
429 |     ```
430 |    -  Shows the total increase in HTTP requests count over the last hour.
431 | 
432 | ### Gauge:
433 | 
434 | A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
435 | 
436 | Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.
437 | 
438 | 
439 | <div style="text-align:center;">
440 |   <img src="./promql-img/gauge_example.png" alt="Gauge Example" style="width:90%;">
441 | </div>
442 | 
443 | 
444 | **Key characteristics of a Gauge:**
445 | 
446 | 1. **Non-Cumulative:** Gauges do not accumulate values over time; they represent the latest observed value at a specific point in time.
447 | 
448 | 2. **Fluctuating Values:** Gauges can capture fluctuations in a metric, making them ideal for metrics that may vary, such as CPU usage, memory utilization, or the number of active connections.
449 | 
450 | 3. **No Automatic Resets:** Gauges retain their last observed value until a new value is recorded. They do not reset automatically, allowing continuous monitoring of changing conditions.
451 | 
452 | 4. **Arbitrary Units:** Gauges can have arbitrary units based on the metric they measure. For example, a gauge measuring temperature might have units in degrees Celsius or Fahrenheit.
453 | 
454 | 
455 | **Example Queries**
456 | 
457 | 1. **Current CPU Usage:**
458 |    
459 |    ```
460 |    cpu_usage 
461 |    ```
462 |    - Returns the current CPU usage as a gauge value.
463 | 
464 | 
465 | 2. **Average Memory Utilization over 5 Minutes:**
466 |    
467 |     ```
468 |     avg_over_time(memory_usage[5m])
469 |     ```
470 |    - Calculates the average memory utilization as a gauge value over the last 5 minutes.
471 | 
472 | 3. **Number of Active Connections:**
473 |    
474 |     ```
475 |     http_connections
476 |     ```
477 |    - Retrieves the current count of active HTTP connections as a gauge value.
478 | 
479 | 4. **Disk Space Utilization:**
480 |    
481 |     ```
482 |     100 - (disk_free / disk_total) * 100
483 |     ```
484 |    - Calculates the disk space utilization percentage as a gauge value.
485 | 
486 | ### Histogram:
487 | 
488 | A Prometheus Histogram is a metric type used to sample and observe the distribution of values in a dataset. It is particularly useful for measuring the spread of data, such as response times or request latencies. Histograms automatically bucketize data into configurable ranges (buckets) and provide aggregated information about the data distribution, including count, sum, and quantiles.
489 | 
490 | 
491 | <div style="text-align:center;">
492 |   <img src="./promql-img/heatmap_histogram.png" alt="Heatmap Histogram" style="width:90%;">
493 | </div>
494 | 
495 | 
496 | **Key characteristics of a Prometheus Histogram:**
497 | 
498 | 1. **Bucketization:** Histograms automatically group observed values into predefined buckets based on their magnitudes. Each bucket represents a range of values.
499 | 
500 | 2. **Dynamic Ranges:** Histograms allow dynamic adjustments of bucket ranges, making them adaptable to changes in the data distribution.
501 | 
502 | 3. **Non-Cumulative:** Unlike counters, histograms do not accumulate values over time. They provide a snapshot of the data distribution at the time of observation.
503 | 
504 | 4. **Bucket Labels:** Each bucket is labeled with an upper bound (`le` label) representing the maximum value that falls into that bucket.
505 | 
506 | 5. **Querying Percentiles:** Prometheus provides functions like **`histogram_quantile`** to query specific percentiles of a histogram, allowing users to analyze the distribution of values.
507 | 
508 | 
509 | **Example Queries**
510 | 
511 | 1. **Average Duration:**
512 |    
513 |     ```
514 |     rate(http_request_duration_seconds_sum[1m]) / rate(http_request_duration_seconds_count[1m])
515 |     ```
516 |    - Calculates the average duration of HTTP requests over the last 1 minute.
517 | 
518 | 2. **90th Percentile Response Time:**
519 |    
520 |     ```
521 |     histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))
522 |     ```
523 |    - Retrieves the 90th percentile response time of HTTP requests over the last 1 minute.
524 | 
525 | 3. **Bucket Counts:**
526 |    
527 |    ```
528 |    http_request_duration_seconds_count
529 |    ```
530 |    - Returns the count of HTTP requests in each bucket of the duration histogram.
531 | 
532 | 4. **Sum of Request Durations in Top 3 Buckets:**
533 | 
534 |     ```
535 |     sum(http_request_duration_seconds_bucket{le=~"0.1|0.5|1.0"})
536 |     ```
537 |    - Calculates the sum of request durations in the top 3 buckets (0.1s, 0.5s, 1.0s) of the duration histogram.
538 |    
539 | ### Summary:
540 | 
541 | A Prometheus Summary is a metric type designed to measure and track the distribution of observed values over time, particularly for quantiles and other percentile-based analyses. Similar to histograms, summaries provide insights into the variability and spread of data, but they do so by calculating quantiles over a sliding time window. Summaries are useful for monitoring metrics with changing distributions, such as request latencies.
542 | 
543 | 
544 | **Key characteristics of a Summary:**
545 | 
546 | 1. **Dynamic Ranges:** Summaries allow dynamic adjustments of quantile ranges, making them adaptable to changes in the data distribution.
547 | 
548 | 2. **No Explicit Bucketization:** Unlike histograms, summaries do not use predefined buckets. Instead, they directly calculate quantiles from the observed values.
549 | 
550 | 3. **Count and Sum:** Summaries track the count of observations and the sum of observed values over time, providing aggregated information for the entire dataset.
551 | 
552 | 4. **Non-Cumulative:** Similar to histograms, summaries do not accumulate values over time. They offer a snapshot of the data distribution within the specified time window.
553 | 
554 | 5. **Querying Percentiles:** Prometheus provides functions like `quantile` to query specific percentiles of a summary, enabling users to analyze the distribution of values.
555 | 
556 | **Example Queries**
557 | 
558 | 1. **Average Duration over Last 5 Minutes:**
559 | 
560 |    ```
561 |    rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
562 |    ```
563 |    - Calculates the average duration of HTTP requests over the last 5 minutes.
564 | 
565 | 2. **90th Percentile Response Time:**
566 |    
567 |    ```
568 |    quantile(0.9, http_request_duration_seconds)
569 |    ```
570 |    - Retrieves the 90th percentile response time of HTTP requests.
571 | 
572 | 3. **Count of Requests in the Last Hour:**
573 |    
574 |    ```
575 |    http_request_duration_seconds_count[1h]
576 |    ```
577 |    - Returns the count of HTTP requests observed in the last hour.
578 | 
579 | 4. **Sum of Response Durations in the Last 10 Minutes:**
580 |    
581 |    ```
582 |    sum(rate(http_request_duration_seconds_sum[10m]))
583 |    ```
584 |    - Calculates the sum of response durations for HTTP requests over the last 10 minutes.
585 | 
586 | ---
587 | 
588 | ## Begin Your Monitoring Journey! 
589 | 
590 | Get ready to navigate your metrics with **confidence!** Stay tuned for more insights, tips, and tricks to keep your monitoring game strong. Keep exploring, keep learning, and keep monitoring! **Happy monitoring!** 📊👀😊
591 | 
592 | 
593 | 
594 | 


--------------------------------------------------------------------------------
/prometheus_setup.md:
--------------------------------------------------------------------------------
  1 | # Table of Contents
  2 | 
  3 | - [Table of Contents](#table-of-contents)
  4 |   - [What Is Prometheus?](#what-is-prometheus)
  5 |     - [Why Do We Need Prometheus?](#why-do-we-need-prometheus)
  6 |   - [Prometheus Architecture: In K8S](#prometheus-architecture-in-k8s)
  7 |   - [Deploying Prometheus on Kubernetes](#deploying-prometheus-on-kubernetes)
  8 |     - [Create a Namespace for Monitoring](#create-a-namespace-for-monitoring)
  9 |     - [Add Helm Repository](#add-helm-repository)
 10 |     - [Install kube-prometheus-stack Helm Chart in monitoring Namespace](#install-kube-prometheus-stack-helm-chart-in-monitoring-namespace)
 11 |     - [Verify Deployment](#verify-deployment)
 12 |     - [Access Prometheus Dashboard](#access-prometheus-dashboard)
 13 |     - [Access Grafana Dashboard](#access-grafana-dashboard)
 14 |     - [Login with the default credentials:](#login-with-the-default-credentials)
 15 |   - [Begin Your Monitoring Journey! 🚀](#begin-your-monitoring-journey-)
 16 | 
 17 | 
 18 | ## What Is Prometheus?
 19 | 
 20 | Prometheus, an open-source monitoring toolkit, excels in dynamic system monitoring with a versatile data model and efficient time-series collection. Notable for its built-in alerting, adaptability, and strong community support, Prometheus empowers users to proactively manage and optimize system performance.
 21 | 
 22 | ### Why Do We Need Prometheus?
 23 | 
 24 | Lets understand Through a **Real-World** Example.
 25 | 
 26 | **Scenario: Managing a Real-Time Messaging App**
 27 | 
 28 | Imagine overseeing a real-time messaging app connecting millions worldwide. The app includes services like user authentication, message processing, and notifications. As the user base grows, ensuring smooth communication becomes a top priority.
 29 | 
 30 | **Challenges:**
 31 | 
 32 | 1. **Interconnected Services:**
 33 | 
 34 |    - Messaging involves many services working together. Understanding how each service affects communication is crucial but complicated.
 35 | 
 36 | 2. **Variable Workloads:**
 37 | 
 38 |    - Messaging apps deal with fluctuating workloads, especially during peak times. Predicting the exact resources needed for optimal performance is tricky, requiring a flexible approach to scaling.
 39 | 
 40 | 3. **Latency and Optimization:**
 41 |    - Fast message delivery is vital for a great user experience. Pinpointing services causing latency issues demands detailed insights often lacking in traditional monitoring tools.
 42 | 
 43 | **How Prometheus Helps:**
 44 | 
 45 | 1. **Dynamic Service Discovery:**
 46 | 
 47 |    - Prometheus automatically discovers and monitors new services as the app scales. No manual setup is needed, ensuring all parts are effectively monitored.
 48 | 
 49 | 2. **Flexible Monitoring:**
 50 | 
 51 |    - Prometheus adapts to changing workloads by collecting time-series data. This helps in closely monitoring performance and making smart decisions on resource allocation and scaling.
 52 | 
 53 | 3. **Alerts for Latency:**
 54 |    - Using Prometheus's alerting, you can set rules to catch latency issues in specific services. Proactive alerts allow the team to address potential problems before users notice.
 55 | 
 56 | ## Prometheus Architecture: In K8S
 57 | 
 58 | <div style="text-align:center;">
 59 |   <img src="./IMG/img.png" alt="Image" style="width:90%;">
 60 | </div>
 61 | 
 62 | 1. **Prometheus Server:**
 63 | 
 64 |    - Runs as a dedicated Pod in the Kubernetes cluster.
 65 |    - Scrapes and collects metrics from configured endpoints or services.
 66 |    - Utilizes Kubernetes ServiceMonitors or service discovery for dynamic service monitoring.
 67 | 
 68 | 2. **Time-Series Database (TSDB):**
 69 | 
 70 |    - Serves as the repository for time-series data collected by Prometheus.
 71 |    - Configurable retention policies for efficient data storage.
 72 |    - Can use persistent volumes for data storage across Prometheus restarts.
 73 | 
 74 | 3. **Alertmanager:**
 75 | 
 76 |    - Often deployed as a separate Pod alongside Prometheus.
 77 |    - Manages and dispatches alerts based on predefined rules and conditions.
 78 |    - Receives alerts from Prometheus for forwarding to various channels.
 79 | 
 80 | 4. **Exporters:**
 81 | 
 82 |    - Agents or sidecar containers exposing metrics from Kubernetes pods or services.
 83 |    - Types include Node Exporter, kube-state-metrics, and others for collecting specific metrics.
 84 | 
 85 | 5. **Service Discovery:**
 86 | 
 87 |    - Kubernetes ServiceMonitors facilitate automatic service discovery and monitoring based on labels.
 88 | 
 89 | 6. **Grafana Integration:**
 90 |    - Used with Prometheus for advanced metric visualization.
 91 |    - Offers pre-configured dashboards for rich and customizable visual representations.
 92 | 
 93 | ## Deploying Prometheus on Kubernetes
 94 | 
 95 | To set up Prometheus and its related components on your Kubernetes cluster, follow these steps:
 96 | 
 97 | ###  Create a Namespace for Monitoring
 98 | 
 99 | ```bash
100 | kubectl create namespace monitoring
101 | ```
102 | 
103 | ### Add Helm Repository
104 | 
105 | ```bash
106 | helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
107 | helm repo update
108 | ```
109 | 
110 | ### Install kube-prometheus-stack Helm Chart in monitoring Namespace
111 | 
112 | ```bash
113 | helm install prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring
114 | ```
115 | 
116 | ### Verify Deployment
117 | 
118 | Wait for the deployment to complete, and then check the status:
119 | 
120 | ```bash
121 | kubectl get pods -n monitoring
122 | ```
123 | 
124 | ### Access Prometheus Dashboard
125 | 
126 | ```bash
127 | kubectl port-forward svc/prometheus-stack-prometheus -n monitoring 9090:9090
128 | ```
129 | 
130 | Open your web browser and navigate to **`http://localhost:9090`** to access the **Prometheus dashboard.**
131 | 
132 | <div style="text-align:center; margin-bottom: 20px;">
133 |   <img src="./IMG/1.webp" alt="Image" style="width:90%;">
134 | </div>
135 | 
136 | 
137 | Remember to keep the port-forwarding terminal open as long as you need to access the dashboard.
138 | 
139 | ### Access Grafana Dashboard
140 | 
141 | Use the following command to port forward to the Grafana service:
142 | 
143 | ```bash
144 | kubectl port-forward svc/prometheus-stack-grafana -n monitoring 8080:80
145 | ```
146 | 
147 | Open your web browser and navigate to **`http://localhost:8080.`**
148 | 
149 | <div style="text-align:center;">
150 |   <img src="./IMG/grafana-security-login-authentication.png" alt="Grafana Security Login Authentication" style="max-width:90%;">
151 | </div>
152 | 
153 | 
154 | ### Login with the default credentials:
155 | 
156 | **Username:** admin
157 | **Password:** (Retrieve the password using the following command):
158 | 
159 | ```bash
160 | kubectl get secret prometheus-stack-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 --decode ; echo
161 | ```
162 | Understand the grafana UI by yourself. The following resources can be helpful.
163 | 
164 |   1. [Grafana Documentation](https://grafana.com/docs/)
165 |   2. [ YOUTUBE LINK: Grafana Setup & Simple Dashboard ](https://www.youtube.com/watch?v=EGgtJUjky8w)
166 | 
167 | ## Begin Your Monitoring Journey! 🚀
168 | 
169 | Start exploring system observability with Prometheus and Grafana. Learn from the [Grafana Documentation](https://grafana.com/docs/), set up Prometheus easily on Kubernetes, and join active communities. Whether you're experienced or new, keep learning to master these tools. Improve your systems and enjoy monitoring!📊👀😊
170 | 
171 | 


--------------------------------------------------------------------------------
/promql-img/counter_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/promql-img/counter_example.png


--------------------------------------------------------------------------------
/promql-img/gauge_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/promql-img/gauge_example.png


--------------------------------------------------------------------------------
/promql-img/heatmap_histogram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NotHarshhaa/Learning-Prometheus/c79e4344f3531c08ab27ced74bfa2b018d8fe4e0/promql-img/heatmap_histogram.png


--------------------------------------------------------------------------------