└── Steps.md /Steps.md: -------------------------------------------------------------------------------- 1 | 2 | ### Steps for Downloading, Extracting, and Starting Prometheus, Node Exporter, Blackbox Exporter, and Alertmanager 3 | 4 | #### Prerequisites 5 | - Ensure you have `wget` and `tar` installed on both VMs. 6 | - Ensure you have appropriate permissions to download, extract, and run these binaries. 7 | - Replace `` with the appropriate version number you wish to download. 8 | 9 | #### VM-1 (Node Exporter) 10 | 1. **Download Node Exporter** 11 | ```bash 12 | wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz 13 | ``` 14 | 15 | 2. **Extract Node Exporter** 16 | ```bash 17 | tar xvfz node_exporter-1.8.1.linux-amd64.tar.gz 18 | ``` 19 | 20 | 3. **Start Node Exporter** 21 | ```bash 22 | cd node_exporter-1.8.1.linux-amd64 23 | ./node_exporter & 24 | ``` 25 | 26 | #### VM-2 (Prometheus, Alertmanager, Blackbox Exporter) 27 | 28 | ##### Prometheus 29 | 1. **Download Prometheus** 30 | ```bash 31 | wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz 32 | ``` 33 | 34 | 2. **Extract Prometheus** 35 | ```bash 36 | tar xvfz prometheus-2.52.0.linux-amd64.tar.gz 37 | ``` 38 | 39 | 3. **Start Prometheus** 40 | ```bash 41 | cd prometheus-2.52.0.linux-amd64 42 | ./prometheus --config.file=prometheus.yml & 43 | ``` 44 | 45 | ##### Alertmanager 46 | 1. **Download Alertmanager** 47 | ```bash 48 | wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz 49 | ``` 50 | 51 | 2. **Extract Alertmanager** 52 | ```bash 53 | tar xvfz alertmanager-0.27.0.linux-amd64.tar.gz 54 | ``` 55 | 56 | 3. **Start Alertmanager** 57 | ```bash 58 | cd alertmanager-0.27.0.linux-amd64 59 | ./alertmanager --config.file=alertmanager.yml & 60 | ``` 61 | 62 | ##### Blackbox Exporter 63 | 1. **Download Blackbox Exporter** 64 | ```bash 65 | wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz 66 | ``` 67 | 68 | 2. **Extract Blackbox Exporter** 69 | ```bash 70 | tar xvfz blackbox_exporter-0.25.0.linux-amd64.tar.gz 71 | ``` 72 | 73 | 3. **Start Blackbox Exporter** 74 | ```bash 75 | cd blackbox_exporter-0.25.0.linux-amd64 76 | ./blackbox_exporter & 77 | ``` 78 | 79 | ### Notes: 80 | - The `&` at the end of each command ensures the process runs in the background. 81 | - Ensure that you have configured the `prometheus.yml` and `alertmanager.yml` configuration files correctly before starting the services. 82 | - Adjust the firewall and security settings to allow the necessary ports (typically 9090 for Prometheus, 9093 for Alertmanager, 9115 for Blackbox Exporter, and 9100 for Node Exporter) to be accessible. 83 | 84 | --- 85 | 86 | # Prometheus and Alertmanager Configuration 87 | 88 | ## Prometheus Configuration (`prometheus.yml`) 89 | 90 | ### Global Configuration 91 | ```yaml 92 | global: 93 | scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. 94 | evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. 95 | # scrape_timeout is set to the global default (10s). 96 | ``` 97 | 98 | ### Alertmanager Configuration 99 | ```yaml 100 | alerting: 101 | alertmanagers: 102 | - static_configs: 103 | - targets: 104 | - 'localhost:9093' # Alertmanager endpoint 105 | ``` 106 | 107 | ### Rule Files 108 | ```yaml 109 | rule_files: 110 | - "alert_rules.yml" # Path to alert rules file 111 | # - "second_rules.yml" # Additional rule files can be added here 112 | ``` 113 | 114 | ### Scrape Configuration 115 | #### Prometheus Itself 116 | ```yaml 117 | scrape_configs: 118 | - job_name: "prometheus" # Job name for Prometheus 119 | 120 | # metrics_path defaults to '/metrics' 121 | # scheme defaults to 'http'. 122 | 123 | static_configs: 124 | - targets: ["localhost:9090"] # Target to scrape (Prometheus itself) 125 | ``` 126 | 127 | #### Node Exporter 128 | ```yaml 129 | - job_name: "node_exporter" # Job name for node exporter 130 | 131 | # metrics_path defaults to '/metrics' 132 | # scheme defaults to 'http'. 133 | 134 | static_configs: 135 | - targets: ["3.110.195.114:9100"] # Target node exporter endpoint 136 | ``` 137 | 138 | #### Blackbox Exporter 139 | ```yaml 140 | - job_name: 'blackbox' # Job name for blackbox exporter 141 | metrics_path: /probe # Path for blackbox probe 142 | params: 143 | module: [http_2xx] # Module to look for HTTP 200 response 144 | static_configs: 145 | - targets: 146 | - http://prometheus.io # HTTP target 147 | - https://prometheus.io # HTTPS target 148 | - http://3.110.195.114:8080/ # HTTP target with port 8080 149 | relabel_configs: 150 | - source_labels: [__address__] 151 | target_label: __param_target 152 | - source_labels: [__param_target] 153 | target_label: instance 154 | - target_label: __address__ 155 | replacement: 13.235.248.225:9115 # Blackbox exporter address 156 | ``` 157 | 158 | ## Alert Rules Configuration (`alert_rules.yml`) 159 | 160 | ### Alert Rules Group 161 | ```yaml 162 | groups: 163 | - name: alert_rules # Name of the alert rules group 164 | rules: 165 | - alert: InstanceDown 166 | expr: up == 0 # Expression to detect instance down 167 | for: 1m 168 | labels: 169 | severity: "critical" 170 | annotations: 171 | summary: "Endpoint {{ $labels.instance }} down" 172 | description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute." 173 | 174 | - alert: WebsiteDown 175 | expr: probe_success == 0 # Expression to detect website down 176 | for: 1m 177 | labels: 178 | severity: critical 179 | annotations: 180 | description: The website at {{ $labels.instance }} is down. 181 | summary: Website down 182 | 183 | - alert: HostOutOfMemory 184 | expr: node_memory_MemAvailable / node_memory_MemTotal * 100 < 25 # Expression to detect low memory 185 | for: 5m 186 | labels: 187 | severity: warning 188 | annotations: 189 | summary: "Host out of memory (instance {{ $labels.instance }})" 190 | description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" 191 | 192 | - alert: HostOutOfDiskSpace 193 | expr: (node_filesystem_avail{mountpoint="/"} * 100) / node_filesystem_size{mountpoint="/"} < 50 # Expression to detect low disk space 194 | for: 1s 195 | labels: 196 | severity: warning 197 | annotations: 198 | summary: "Host out of disk space (instance {{ $labels.instance }})" 199 | description: "Disk is almost full (< 50% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" 200 | 201 | - alert: HostHighCpuLoad 202 | expr: (sum by (instance) (irate(node_cpu{job="node_exporter_metrics",mode="idle"}[5m]))) > 80 # Expression to detect high CPU load 203 | for: 5m 204 | labels: 205 | severity: warning 206 | annotations: 207 | summary: "Host high CPU load (instance {{ $labels.instance }})" 208 | description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" 209 | 210 | - alert: ServiceUnavailable 211 | expr: up{job="node_exporter"} == 0 # Expression to detect service unavailability 212 | for: 2m 213 | labels: 214 | severity: critical 215 | annotations: 216 | summary: "Service Unavailable (instance {{ $labels.instance }})" 217 | description: "The service {{ $labels.job }} is not available\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" 218 | 219 | - alert: HighMemoryUsage 220 | expr: (node_memory_Active / node_memory_MemTotal) * 100 > 90 # Expression to detect high memory usage 221 | for: 10m 222 | labels: 223 | severity: critical 224 | annotations: 225 | summary: "High Memory Usage (instance {{ $labels.instance }})" 226 | description: "Memory usage is > 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" 227 | 228 | - alert: FileSystemFull 229 | expr: (node_filesystem_avail / node_filesystem_size) * 100 < 10 # Expression to detect file system almost full 230 | for: 5m 231 | labels: 232 | severity: critical 233 | annotations: 234 | summary: "File System Almost Full (instance {{ $labels.instance }})" 235 | description: "File system has < 10% free space\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" 236 | ``` 237 | 238 | ## Alertmanager Configuration (`alertmanager.yml`) 239 | 240 | ### Routing Configuration 241 | ```yaml 242 | route: 243 | group_by: ['alertname'] # Group by alert name 244 | group_wait: 30s # Wait time before sending the first notification 245 | group_interval: 5m # Interval between notifications 246 | repeat_interval: 1h # Interval to resend notifications 247 | receiver: 'email-notifications' # Default receiver 248 | 249 | receivers: 250 | - name: 'email-notifications' # Receiver name 251 | email_configs: 252 | - to: jaiswaladi246@gmail.com # Email recipient 253 | from: test@gmail.com # Email sender 254 | smarthost: smtp.gmail.com:587 # SMTP server 255 | auth_username: your_email # SMTP auth username 256 | auth_identity: your_email # SMTP auth identity 257 | auth_password: "bdmq omqh vvkk zoqx" # SMTP auth password 258 | send_resolved: true # Send notifications for resolved alerts 259 | ``` 260 | 261 | ### Inhibition Rules 262 | ```yaml 263 | inhibit_rules: 264 | - source_match: 265 | severity: 'critical' # Source alert severity 266 | target_match: 267 | severity: 'warning' # Target alert severity 268 | equal: ['alertname', 'dev', 'instance'] # Fields to match 269 | ``` 270 | 271 | --- 272 | 273 | This documentation provides a detailed explanation of your Prometheus and Alertmanager configuration files. It covers global settings, alert rules, email notifications, and inhibition rules. 274 | --------------------------------------------------------------------------------