├── LICENSE
├── README.md
├── bgp-analyzer
    ├── Dockerfile
    ├── bgp_analyzer.py
    └── requirements.txt
├── canary
    ├── dns
    │   ├── Dockerfile
    │   ├── dns_canary.py
    │   └── requirements.txt
    ├── http
    │   ├── Dockerfile
    │   ├── http_canary.py
    │   └── requirements.txt
    └── ping
    │   ├── Dockerfile
    │   ├── ping_canary.py
    │   └── requirements.txt
├── infra
    ├── docker-compose.yml
    └── prometheus.yml
└── kafka-consumer
    ├── Dockerfile
    ├── kafka_consumer.py
    └── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Shankar K.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Network Observability Platform (Open Source ThousandEyes Alternative)
  2 | 
  3 | ## Introduction
  4 | 
  5 | ### What is this?
  6 | This project provides a distributed system for synthetic network monitoring and BGP analysis, inspired by platforms like ThousandEyes. It allows you to:
  7 | *   Run automated tests (canaries) from various Points of Presence (POPs) to measure the reachability and performance of internet services using common network protocols:
  8 |     *   **Ping (ICMP):** Measures round-trip time (RTT) and packet loss (`canary/ping/`).
  9 |     *   **DNS:** Checks resolution time and correctness for various record types (`canary/dns/`).
 10 |     *   **HTTP(S):** Verifies availability, response time, status codes, and optional content matching (`canary/http/`).
 11 |     *   **Traceroute:** (Future) Maps the network path between the canary and the target, measuring hop-by-hop latency.
 12 | *   Passively monitor BGP updates for specified prefixes using public data streams (`bgp-analyzer/`).
 13 | 
 14 | Test results from canaries are collected centrally via Kafka, processed into Prometheus metrics, and visualized on dashboards. BGP events are monitored separately and also exposed as Prometheus metrics.
 15 | 
 16 | ### Why build this?
 17 | Understanding how your services perform and how they are reachable from different network vantage points is crucial. This platform helps you:
 18 | *   **Gain User-Perspective Visibility:** See performance as your users or customers might experience it from different geographic locations or networks.
 19 | *   **Monitor Network Routing:** Observe BGP announcements and withdrawals affecting your important prefixes.
 20 | *   **Proactive Issue Detection:** Identify connectivity problems, latency spikes, service outages, or potential routing issues (like hijacks) before they impact a large number of users.
 21 | *   **Troubleshoot Connectivity:** Understand network paths and pinpoint potential bottlenecks or routing issues.
 22 | *   **Verify SLAs:** Monitor uptime and performance against service level agreements.
 23 | *   **Performance Benchmarking:** Track performance trends over time.
 24 | 
 25 | ### Core Concepts
 26 | *   **Synthetic Monitoring:** Instead of waiting for real user traffic to reveal problems, we *synthetically* generate test traffic (like pings or HTTP requests) at regular intervals to proactively check service health.
 27 | *   **Canary / Probe:** An automated script or agent responsible for running a specific type of test (e.g., a Ping Canary) from a particular location.
 28 | *   **Point of Presence (POP):** The physical or logical location where a canary agent runs. This could be a data center, a cloud region edge, an office network, or even a user's machine.
 29 | *   **BGP Monitoring:** Passively listening to data streams from BGP collectors (like RouteViews) to observe routing table updates (announcements, withdrawals) for specific network prefixes.
 30 | 
 31 | ## Architecture Deep Dive
 32 | 
 33 | ### Data Flow Diagram
 34 | 
 35 | ```mermaid
 36 | graph TD
 37 |     subgraph POP Locations
 38 |         direction LR
 39 |         subgraph POP 1
 40 |             PingCanary[Ping Canary]
 41 |         end
 42 |         subgraph POP 2
 43 |             DNSCanary[DNS Canary]
 44 |         end
 45 |         subgraph POP 3
 46 |             HTTPCanary[HTTP Canary]
 47 |         end
 48 |         subgraph POP N
 49 |             OtherCanaries[...]
 50 |         end
 51 |     end
 52 | 
 53 |     subgraph Central Services
 54 |         direction LR
 55 |         Kafka(Kafka Topic\n'canary-results')
 56 |         Consumer(Kafka Consumer\nService)
 57 |         BGPAnalyzer(BGP Analyzer\nService)
 58 |         Prometheus(Prometheus)
 59 |         Grafana(Grafana)
 60 |         Alertmanager(Alertmanager)
 61 |     end
 62 | 
 63 |     PingCanary -- JSON --> Kafka
 64 |     DNSCanary -- JSON --> Kafka
 65 |     HTTPCanary -- JSON --> Kafka
 66 |     OtherCanaries -- JSON --> Kafka
 67 | 
 68 |     Kafka -- Reads --> Consumer
 69 |     Consumer -- Exposes /metrics --> Prometheus
 70 |     BGPAnalyzer -- Exposes /metrics --> Prometheus
 71 | 
 72 |     Prometheus -- Scrapes --> Consumer
 73 |     Prometheus -- Scrapes --> BGPAnalyzer
 74 |     Prometheus -- Sends Alerts --> Alertmanager
 75 |     Grafana -- Queries --> Prometheus
 76 | 
 77 |     style Kafka fill:#f9f,stroke:#333,stroke-width:2px
 78 |     style Prometheus fill:#e6522c,stroke:#333,stroke-width:2px,color:#fff
 79 |     style Grafana fill:#F9A825,stroke:#333,stroke-width:2px,color:#000
 80 |     style Alertmanager fill:#ff7f0e,stroke:#333,stroke-width:2px,color:#fff
 81 |     style BGPAnalyzer fill:#add8e6,stroke:#333,stroke-width:2px,color:#000
 82 | ```
 83 | 
 84 | ### Component Roles & Rationale
 85 | 
 86 | *   **Canaries (Python Scripts + Docker)**
 87 |     *   *Role:* Execute specific network tests (Ping, DNS, HTTP) at scheduled intervals against defined targets. Package results as structured JSON. Reside in the `canary/` subdirectories (e.g., `canary/ping/`, `canary/dns/`, `canary/http/`).
 88 |     *   *Why:* Running tests from multiple POPs gives a realistic view of global user experience. Docker ensures consistent runtime environments across diverse POPs. Python is chosen for its excellent networking libraries, ease of development, and widespread availability.
 89 | *   **Kafka (Message Broker)**
 90 |     *   *Role:* Acts as a central, highly available, and durable buffer receiving test results (JSON messages) from all *canaries* via a specific topic (e.g., `canary-results`).
 91 |     *   *Why:* Decouples canaries from the backend processing. Canaries just need to send data to Kafka and don't need to know about the consumer(s). This handles bursts of data, prevents data loss if the consumer is temporarily down, and allows for future expansion with multiple consumers.
 92 | *   **Kafka Consumer (Python Service)**
 93 |     *   *Role:* Reads the JSON results from the Kafka topic (`canary-results`). Parses the data and translates it into metrics (e.g., latency gauges, status counters) exposed via an HTTP endpoint (`/metrics`) in a format Prometheus can understand. Located in `kafka-consumer/`.
 94 |     *   *Why:* Acts as the crucial bridge between the event-driven Kafka stream from canaries and Prometheus's pull-based metric scraping model. It centralizes the logic for interpreting canary results and defining the corresponding Prometheus metrics.
 95 | *   **BGP Analyzer (Python Service)**
 96 |     *   *Role:* Connects to public BGP data streams (e.g., RouteViews, RIPE RIS) using `pybgpstream`. Monitors announcements and withdrawals for configured network prefixes. Exposes BGP event counts and timestamps as Prometheus metrics via an HTTP endpoint (`/metrics`). Located in `bgp-analyzer/`.
 97 |     *   *Why:* Provides visibility into internet routing changes affecting key prefixes. Runs as a separate central service because BGP data is typically consumed from collectors, not generated *at* individual POPs like canary tests. Exposing metrics allows correlation with canary data and alerting on routing anomalies.
 98 | *   **Prometheus (Metrics Database & Alerting Engine)**
 99 |     *   *Role:* Periodically "scrapes" (fetches) the metrics from the Kafka Consumer's `/metrics` endpoint and the BGP Analyzer's `/metrics` endpoint. Stores these metrics efficiently as time-series data. Evaluates predefined alerting rules based on the collected metrics.
100 |     *   *Why:* An industry standard for time-series metrics and alerting. Its pull model simplifies the consumer/analyzer services. PromQL provides a powerful language for querying data and defining complex alert conditions across both canary and BGP data.
101 | *   **Alertmanager (Alert Routing & Management)**
102 |     *   *Role:* Receives alert notifications triggered by Prometheus. Handles deduplicating, grouping, and silencing alerts. Routes alerts to configured notification channels (e.g., Slack, PagerDuty, email - *configuration needed*).
103 |     *   *Why:* Separates the complex logic of alert notification from Prometheus. Provides a central point for managing alert state.
104 | *   **Grafana (Visualization)**
105 |     *   *Role:* Queries Prometheus (using PromQL) and displays the time-series metrics on interactive dashboards featuring graphs, tables, heatmaps, etc.
106 |     *   *Why:* The de facto standard for visualizing observability data. Highly flexible, supports numerous panel types, and makes it easy to explore trends, correlate data from different sources (canaries, BGP), and identify anomalies visually.
107 | 
108 | ## Getting Started: Running Locally with Docker Compose
109 | 
110 | This setup allows you to run the entire backend infrastructure (Kafka, Prometheus, Grafana, etc.), the Kafka Consumer service, and the BGP Analyzer service on your local machine for development and testing.
111 | 
112 | ### Prerequisites
113 | *   **Docker Engine:** Install Docker for your OS. [Official Docker Installation Guide](https://docs.docker.com/engine/install/)
114 | *   **Docker Compose:** Usually included with Docker Desktop. Verify installation. [Official Docker Compose Installation Guide](https://docs.docker.com/compose/install/)
115 | *   **Git:** Needed to clone the repository. [Official Git Installation Guide](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
116 | *   **(Optional but Recommended) Host Tools:** Install `fping`, `dnsutils` (for `dig`), `curl` on your host machine if you want to run canaries directly using Python (Method 2 below). On macOS: `brew install fping dnsutils curl`. On Debian/Ubuntu: `sudo apt-get update && sudo apt-get install fping dnsutils curl`.
117 | 
118 | ### Setup Steps
119 | 1.  **Clone the Repository:**
120 |     ```bash
121 |     git clone https://github.com/shankar0123/network-observability-platform.git
122 |     cd network-observability-platform
123 |     ```
124 | 2.  **Navigate to Infrastructure Directory:**
125 |     ```bash
126 |     cd infra
127 |     ```
128 | 3.  **Launch the Stack:**
129 |     ```bash
130 |     docker-compose up --build -d
131 |     ```
132 |     *   `--build`: Ensures the `kafka-consumer` and `bgp-analyzer` Docker images are built using their Dockerfiles. Needed the first time or when their source code/requirements change.
133 |     *   `up`: Creates and starts all the service containers defined in `docker-compose.yml`.
134 |     *   `-d`: Runs the containers in "detached" mode.
135 | 
136 | ### Verify Services
137 | *   **Check Container Status:** Wait a minute or two for services to initialize, then run:
138 |     ```bash
139 |     docker-compose ps
140 |     ```
141 |     All services listed (`zookeeper`, `kafka`, `kafka-consumer`, `prometheus`, `alertmanager`, `grafana`, `bgp-analyzer`) should have a `State` of `Up` or `running`.
142 | *   **Access Web UIs:** Open these URLs in your browser:
143 |     *   **Prometheus:** `http://localhost:9090`
144 |     *   **Grafana:** `http://localhost:3001` (Default Login: `admin` / `admin`)
145 |     *   **Alertmanager:** `http://localhost:9094`
146 | *   **Check Prometheus Targets:**
147 |     1.  Go to the Prometheus UI (`http://localhost:9090`).
148 |     2.  Navigate to `Status` -> `Targets`.
149 |     3.  Look for the `kafka-consumer` and `bgp-analyzer` jobs. Their `State` should be `UP`. If `DOWN`, check the logs of the respective container (e.g., `docker-compose logs kafka-consumer`, `docker-compose logs bgp-analyzer`). The `bgp-analyzer` might take a short while to connect initially.
150 | 
151 | ## Running Canaries
152 | 
153 | Once the backend stack is running, run one or more canary agents to send data into the system.
154 | 
155 | ### Understanding Canary Configuration
156 | Canaries are configured primarily through environment variables:
157 | *   `CANARY_ID`: Unique identifier (e.g., `ping-pop-lhr-01`, `dns-pop-nyc-02`).
158 | *   `KAFKA_BROKER`: Kafka broker address(es).
159 | *   `KAFKA_TOPIC`: Kafka topic (e.g., `canary-results`).
160 | *   `TARGET_HOSTS` / `TARGET_DOMAINS` / `TARGET_URLS`: Comma-separated list of targets.
161 | *   `*_INTERVAL_SECONDS`: How often to run tests.
162 | *   Other type-specific options (e.g., `QUERY_TYPE`, `EXPECTED_STRING`).
163 | 
164 | ### Method 1: Run Canary via Docker (Recommended)
165 | Build and run the desired canary container, connecting it to the Docker network.
166 | 
167 | 1.  **Build Image (Example: DNS Canary):**
168 |     ```bash
169 |     cd ../canary/dns # Navigate to the canary directory
170 |     docker build -t dns-canary:latest .
171 |     ```
172 | 2.  **Run Container (Example: DNS Canary):**
173 |     ```bash
174 |     # Find network name: docker network ls | grep monitoring (likely 'infra_monitoring')
175 |     NETWORK_NAME="infra_monitoring" # Replace if different
176 | 
177 |     docker run --rm --network=$NETWORK_NAME \
178 |       -e KAFKA_BROKER="kafka:9092" \
179 |       -e KAFKA_TOPIC="canary-results" \
180 |       -e CANARY_ID="dns-docker-test-01" \
181 |       -e TARGET_DOMAINS="example.com,wikipedia.org" \
182 |       -e QUERY_TYPE="AAAA" \
183 |       -e DNS_INTERVAL_SECONDS="45" \
184 |       dns-canary:latest
185 |     ```
186 |     *   Repeat for other canary types (e.g., `ping-canary`, `http-canary`), adjusting image name and environment variables.
187 | 
188 | ### Method 2: Run Canary Locally using Python (Development/Debugging)
189 | Run the script directly on your host. Remember to set `KAFKA_BROKER` to the host-exposed port (`localhost:9093`).
190 | 
191 | 1.  **Navigate & Setup (Example: HTTP Canary):**
192 |     ```bash
193 |     cd ../canary/http
194 |     python -m venv venv
195 |     source venv/bin/activate # Or .\venv\Scripts\activate on Windows
196 |     pip install -r requirements.txt
197 |     ```
198 | 2.  **Configure & Run:**
199 |     ```bash
200 |     export KAFKA_BROKER="localhost:9093" # Host port!
201 |     export CANARY_ID="http-local-dev-01"
202 |     export TARGET_URLS="https://httpbin.org/status/200,https://httpbin.org/status/404"
203 |     python http_canary.py
204 |     ```
205 | 
206 | ## Exploring the Data
207 | 
208 | 1.  **Check Logs:**
209 |     *   Canary data: `docker-compose logs -f kafka-consumer`
210 |     *   BGP events: `docker-compose logs -f bgp-analyzer`
211 | 2.  **Query Metrics in Prometheus (`http://localhost:9090/graph`):**
212 |     *   **Canary Latency:** `canary_latency_ms{type="ping", target="8.8.8.8"}`
213 |     *   **DNS Latency:** `dns_resolve_time_ms{target="example.com", query_type="AAAA"}`
214 |     *   **HTTP Status:** `http_status_code{target="https://httpbin.org/status/404"}`
215 |     *   **Content Match:** `http_content_match_status{canary_id="http-local-dev-01"}`
216 |     *   **Ping Failures:** `rate(canary_status_total{type="ping", status="FAILURE"}[5m])`
217 |     *   **BGP Announcements:** `rate(bgp_announcements_total{prefix="1.1.1.0/24"}[10m])`
218 |     *   **BGP Withdrawals:** `increase(bgp_withdrawals_total{prefix="8.8.8.0/24"}[1h])`
219 |     *   **BGP Analyzer Freshness:** `time() - bgp_last_event_timestamp_seconds` (Shows seconds since last event per project)
220 | 3.  **Visualize Data in Grafana (`http://localhost:3001`):**
221 |     *   Add Prometheus Data Source (URL: `http://prometheus:9090`, Access: `Server`).
222 |     *   Create dashboards combining canary metrics (latency, status) and BGP metrics (announcements, withdrawals) using the queries above.
223 | 
224 | ## Next Steps & Future Development
225 | *   **Implement Traceroute Canary:** Develop the `canary/traceroute/` component using `mtr` or `scamper`. Update the consumer to handle its complex output (hop-by-hop data).
226 | *   **Build Grafana Dashboards:** Create comprehensive dashboards showing global maps, latency/status summaries per test type, BGP event timelines, etc.
227 | *   **Configure Alerting:** Define meaningful alert rules in Prometheus (e.g., high latency, low success rate, specific HTTP errors, unexpected BGP withdrawals, BGP analyzer staleness). Configure Alertmanager to route alerts.
228 | *   **Deployment to POPs:** Document deploying canary containers to actual POP locations.
229 | *   **Security:** Implement Kafka security, add authentication to UIs, secure Grafana login.
230 | *   **Persistence:** Configure persistent volumes for Kafka/Zookeeper in `docker-compose.yml` if needed.
231 | *   **Refine BGP Analyzer:** Add more sophisticated analysis (path changes, RPKI validation), consider internal BGP feeds.
232 | 
233 | ## Troubleshooting
234 | *   **`docker-compose up` fails:** Check for port conflicts, ensure Docker daemon is running, check network connectivity.
235 | *   **Prometheus Target `kafka-consumer` or `bgp-analyzer` is DOWN:** Check container logs (`docker-compose logs <service_name>`). Verify Docker networking. Ensure Prometheus scrape config matches the service name and port.
236 | *   **No data in Grafana/Prometheus:** Is a canary running? Is the consumer processing messages (check logs)? Is the BGP analyzer running and receiving data (check logs)? Is Prometheus scraping targets successfully? Are PromQL queries correct? Is the Grafana time range correct?
237 | *   **BGP Analyzer Issues:** `pybgpstream` can be sensitive to network conditions or collector availability. Check its logs for connection errors. Ensure monitored prefixes are valid.
238 | 
239 | ## 🪪 License
240 | 
241 | This project is licensed under the [MIT License](LICENSE).
242 | 


--------------------------------------------------------------------------------
/bgp-analyzer/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use an official Python runtime as a parent image
 2 | FROM python:3.10-slim
 3 | 
 4 | # pybgpstream often requires C build tools and potentially libbz2
 5 | RUN apt-get update && apt-get install -y --no-install-recommends \
 6 |     build-essential \
 7 |     libbz2-dev \
 8 |     && rm -rf /var/lib/apt/lists/*
 9 | 
10 | # Set the working directory in the container
11 | WORKDIR /app
12 | 
13 | # Copy the requirements file into the container at /app
14 | COPY requirements.txt .
15 | 
16 | # Install any needed packages specified in requirements.txt
17 | RUN pip install --no-cache-dir -r requirements.txt
18 | 
19 | # Copy the current directory contents into the container at /app
20 | COPY bgp_analyzer.py .
21 | 
22 | # Make port 8001 available for Prometheus scraping
23 | EXPOSE 8001
24 | 
25 | # Define environment variables (can be overridden at runtime)
26 | ENV PREFIXES_TO_MONITOR="1.1.1.0/24,8.8.8.0/24"
27 | ENV BGPSTREAM_PROJECTS="routeviews,ris"
28 | # Default to live stream if BGPSTREAM_START_TIME is empty
29 | ENV BGPSTREAM_START_TIME=""
30 | ENV PROMETHEUS_PORT="8001"
31 | ENV LOG_LEVEL="INFO"
32 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr
33 | ENV PYTHONUNBUFFERED=1
34 | 
35 | # Run bgp_analyzer.py when the container launches
36 | CMD ["python", "bgp_analyzer.py"]
37 | 


--------------------------------------------------------------------------------
/bgp-analyzer/bgp_analyzer.py:
--------------------------------------------------------------------------------
  1 | import pybgpstream
  2 | import time
  3 | import os
  4 | import logging
  5 | import threading
  6 | from prometheus_client import start_http_server, Counter, Gauge
  7 | from dotenv import load_dotenv
  8 | from datetime import datetime, timezone
  9 | 
 10 | # --- Configuration ---
 11 | load_dotenv()
 12 | 
 13 | # Comma-separated list of prefixes to monitor
 14 | PREFIXES_TO_MONITOR = os.getenv('PREFIXES_TO_MONITOR', '1.1.1.0/24,8.8.8.0/24').split(',')
 15 | # Comma-separated list of BGPStream projects (e.g., routeviews, ris)
 16 | BGPSTREAM_PROJECTS = os.getenv('BGPSTREAM_PROJECTS', 'routeviews,ris').split(',')
 17 | # Optional: Start time for BGPStream (e.g., 'YYYY-MM-DD HH:MM:SS'). If empty, uses live data.
 18 | BGPSTREAM_START_TIME = os.getenv('BGPSTREAM_START_TIME', '')
 19 | # Port for exposing Prometheus metrics
 20 | PROMETHEUS_PORT = int(os.getenv('PROMETHEUS_PORT', '8001'))
 21 | # Log level
 22 | LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO').upper()
 23 | 
 24 | logging.basicConfig(level=LOG_LEVEL, format='%(asctime)s - %(levelname)s - %(message)s')
 25 | 
 26 | # --- Prometheus Metrics Definition ---
 27 | BGP_ANNOUNCEMENTS_TOTAL = Counter(
 28 |     'bgp_announcements_total',
 29 |     'Total number of BGP announcements observed for monitored prefixes',
 30 |     ['prefix', 'peer_asn', 'origin_as', 'project']
 31 | )
 32 | 
 33 | BGP_WITHDRAWALS_TOTAL = Counter(
 34 |     'bgp_withdrawals_total',
 35 |     'Total number of BGP withdrawals observed for monitored prefixes',
 36 |     ['prefix', 'peer_asn', 'project']
 37 | )
 38 | 
 39 | # Gauge to track the timestamp of the last processed BGP event (useful for monitoring staleness)
 40 | BGP_LAST_EVENT_TIMESTAMP = Gauge(
 41 |     'bgp_last_event_timestamp_seconds',
 42 |     'Unix timestamp of the last processed BGP event',
 43 |     ['project']
 44 | )
 45 | 
 46 | # --- BGPStream Processing Function ---
 47 | def process_bgp_stream():
 48 |     """Connects to BGPStream and processes elements."""
 49 |     logging.info("Starting BGPStream processing...")
 50 |     logging.info(f"Monitoring Prefixes: {PREFIXES_TO_MONITOR}")
 51 |     logging.info(f"Using Projects: {BGPSTREAM_PROJECTS}")
 52 | 
 53 |     while True: # Keep trying to connect/reconnect
 54 |         try:
 55 |             # Create a new BGPStream instance
 56 |             stream = pybgpstream.BGPStream(
 57 |                 # Use 'live' for real-time data, or specify time range
 58 |                 from_time=BGPSTREAM_START_TIME if BGPSTREAM_START_TIME else None,
 59 |                 until_time=None, # None means live/until now
 60 |                 collectors=BGPSTREAM_PROJECTS, # Filter by project implicitly via collectors
 61 |                 record_type="updates", # Process announcements and withdrawals
 62 |                 filter=",".join([f"prefix {p}" for p in PREFIXES_TO_MONITOR]) # Filter for specific prefixes
 63 |             )
 64 | 
 65 |             logging.info("BGPStream instance created and filters applied.")
 66 | 
 67 |             for elem in stream:
 68 |                 timestamp = time.time() # Record processing time
 69 |                 project = elem.collector # Identify the source project/collector
 70 | 
 71 |                 BGP_LAST_EVENT_TIMESTAMP.labels(project=project).set(timestamp)
 72 | 
 73 |                 prefix = str(elem.fields.get('prefix'))
 74 |                 peer_asn = str(elem.peer_asn)
 75 | 
 76 |                 # Check if the element's prefix is one we are monitoring
 77 |                 # Note: BGPStream filter should handle this, but double-check
 78 |                 if prefix not in PREFIXES_TO_MONITOR:
 79 |                      # This might happen if a more specific prefix matches the filter
 80 |                      # logging.debug(f"Skipping element for non-monitored prefix {prefix}")
 81 |                      continue
 82 | 
 83 |                 if elem.type == "A" or elem.type == "announce":
 84 |                     origin_as = 'N/A'
 85 |                     as_path = elem.fields.get("as-path", "")
 86 |                     if as_path:
 87 |                         # Origin AS is the last one in the path
 88 |                         origin_as = as_path.split(" ")[-1]
 89 | 
 90 |                     logging.info(f"ANNOUNCE: Prefix={prefix}, PeerAS={peer_asn}, OriginAS={origin_as}, Path={as_path}, Collector={project}")
 91 |                     BGP_ANNOUNCEMENTS_TOTAL.labels(
 92 |                         prefix=prefix,
 93 |                         peer_asn=peer_asn,
 94 |                         origin_as=origin_as,
 95 |                         project=project
 96 |                     ).inc()
 97 | 
 98 |                 elif elem.type == "W" or elem.type == "withdraw":
 99 |                     logging.info(f"WITHDRAW: Prefix={prefix}, PeerAS={peer_asn}, Collector={project}")
100 |                     BGP_WITHDRAWALS_TOTAL.labels(
101 |                         prefix=prefix,
102 |                         peer_asn=peer_asn,
103 |                         project=project
104 |                     ).inc()
105 |                 else:
106 |                     logging.debug(f"Ignoring BGP element of type: {elem.type}")
107 | 
108 |         except Exception as e:
109 |             logging.exception("Error during BGPStream processing. Restarting stream in 30 seconds...")
110 |             time.sleep(30) # Wait before attempting to restart
111 | 
112 | # --- Main ---
113 | def main():
114 |     logging.info(f"Starting BGP Analyzer Service")
115 |     logging.info(f"Exposing Prometheus metrics on port {PROMETHEUS_PORT}")
116 | 
117 |     # Start Prometheus HTTP server in a background thread
118 |     try:
119 |         start_http_server(PROMETHEUS_PORT)
120 |         logging.info("Prometheus metrics server started.")
121 |     except Exception as e:
122 |         logging.exception(f"Failed to start Prometheus server on port {PROMETHEUS_PORT}")
123 |         return # Exit if we can't expose metrics
124 | 
125 |     # Start BGP processing in the main thread (or optionally another thread)
126 |     process_bgp_stream()
127 | 
128 | 
129 | if __name__ == "__main__":
130 |     try:
131 |         main()
132 |     except KeyboardInterrupt:
133 |         logging.info("BGP Analyzer stopped by user.")
134 |     except Exception as e:
135 |         logging.exception("Unhandled exception in BGP Analyzer main.")
136 |     finally:
137 |         logging.info("BGP Analyzer shutting down.")
138 | 


--------------------------------------------------------------------------------
/bgp-analyzer/requirements.txt:
--------------------------------------------------------------------------------
1 | pybgpstream
2 | prometheus_client
3 | python-dotenv
4 | 


--------------------------------------------------------------------------------
/canary/dns/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use an official Python runtime as a parent image
 2 | FROM python:3.10-slim
 3 | 
 4 | # Set the working directory in the container
 5 | WORKDIR /app
 6 | 
 7 | # Copy the requirements file into the container at /app
 8 | COPY requirements.txt .
 9 | 
10 | # Install any needed packages specified in requirements.txt
11 | # Using --no-cache-dir can make the image smaller
12 | RUN pip install --no-cache-dir -r requirements.txt
13 | 
14 | # Copy the current directory contents into the container at /app
15 | COPY dns_canary.py .
16 | 
17 | # Define environment variables (can be overridden at runtime)
18 | ENV KAFKA_BROKER="kafka:9092"
19 | ENV KAFKA_TOPIC="canary-results"
20 | ENV CANARY_ID="dns-canary-default-01"
21 | ENV TARGET_DOMAINS="google.com,github.com"
22 | ENV DNS_INTERVAL_SECONDS="60"
23 | ENV QUERY_TYPE="A"
24 | # Empty TARGET_RESOLVER means use system default inside container
25 | ENV TARGET_RESOLVER=""
26 | ENV QUERY_TIMEOUT_SECONDS="5"
27 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr
28 | ENV PYTHONUNBUFFERED=1
29 | 
30 | # Run dns_canary.py when the container launches
31 | CMD ["python", "dns_canary.py"]
32 | 


--------------------------------------------------------------------------------
/canary/dns/dns_canary.py:
--------------------------------------------------------------------------------
  1 | import dns.resolver
  2 | import json
  3 | import os
  4 | import time
  5 | import logging
  6 | from datetime import datetime, timezone
  7 | from kafka import KafkaProducer
  8 | from kafka.errors import NoBrokersAvailable
  9 | from dotenv import load_dotenv
 10 | 
 11 | # --- Configuration ---
 12 | load_dotenv()
 13 | 
 14 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092')
 15 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results')
 16 | CANARY_ID = os.getenv('CANARY_ID', 'dns-canary-local-01')
 17 | # Comma-separated list of target domains
 18 | TARGET_DOMAINS = os.getenv('TARGET_DOMAINS', 'google.com,github.com,cloudflare.com').split(',')
 19 | DNS_INTERVAL_SECONDS = int(os.getenv('DNS_INTERVAL_SECONDS', '60'))
 20 | QUERY_TYPE = os.getenv('QUERY_TYPE', 'A') # e.g., A, AAAA, CNAME, MX, NS, TXT
 21 | # Optional: Specify a target resolver IP. If empty, use system default.
 22 | TARGET_RESOLVER = os.getenv('TARGET_RESOLVER', '') # e.g., '8.8.8.8'
 23 | QUERY_TIMEOUT_SECONDS = int(os.getenv('QUERY_TIMEOUT_SECONDS', '5'))
 24 | 
 25 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 26 | 
 27 | # --- Kafka Producer Setup ---
 28 | producer = None
 29 | while producer is None:
 30 |     try:
 31 |         producer = KafkaProducer(
 32 |             bootstrap_servers=[KAFKA_BROKER],
 33 |             value_serializer=lambda v: json.dumps(v).encode('utf-8'),
 34 |             retries=5,
 35 |             request_timeout_ms=30000
 36 |         )
 37 |         logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER}")
 38 |     except NoBrokersAvailable:
 39 |         logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...")
 40 |         time.sleep(10)
 41 |     except Exception as e:
 42 |         logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...")
 43 |         time.sleep(10)
 44 | 
 45 | # --- DNS Query Function ---
 46 | def perform_dns_query(domain, qtype, resolver_ip=None, timeout=5):
 47 |     """
 48 |     Performs a DNS query using dnspython and returns results.
 49 |     """
 50 |     logging.info(f"Querying {domain} for {qtype} record (Resolver: {resolver_ip or 'System Default'})")
 51 |     result = {
 52 |         "status": "ERROR",
 53 |         "latency_ms": None,
 54 |         "answer": None, # List of strings from the answer
 55 |         "error_message": None,
 56 |         "resolver": resolver_ip or 'System Default'
 57 |     }
 58 |     resolver = dns.resolver.Resolver()
 59 |     resolver.timeout = timeout
 60 |     resolver.lifetime = timeout # Total time including retries
 61 | 
 62 |     if resolver_ip:
 63 |         resolver.nameservers = [resolver_ip]
 64 | 
 65 |     start_time = time.monotonic()
 66 |     try:
 67 |         answer = resolver.resolve(domain, qtype)
 68 |         end_time = time.monotonic()
 69 | 
 70 |         result["status"] = "SUCCESS"
 71 |         result["latency_ms"] = round((end_time - start_time) * 1000, 2)
 72 |         # Extract relevant parts of the answer, convert to string list
 73 |         result["answer"] = sorted([str(rdata) for rdata in answer])
 74 |         logging.debug(f"DNS query for {domain} {qtype} successful: {result['answer']}")
 75 | 
 76 |     except dns.resolver.NXDOMAIN:
 77 |         end_time = time.monotonic()
 78 |         result["status"] = "FAILURE"
 79 |         result["latency_ms"] = round((end_time - start_time) * 1000, 2)
 80 |         result["error_message"] = "NXDOMAIN (Domain does not exist)"
 81 |         logging.warning(f"NXDOMAIN for {domain} {qtype}")
 82 |     except dns.resolver.Timeout:
 83 |         result["status"] = "TIMEOUT"
 84 |         result["error_message"] = f"Query timed out after {timeout} seconds"
 85 |         logging.error(f"Timeout querying {domain} {qtype}")
 86 |     except dns.resolver.NoNameservers as e:
 87 |         result["status"] = "ERROR"
 88 |         result["error_message"] = f"No nameservers available: {e}"
 89 |         logging.error(f"No nameservers for {domain} {qtype}: {e}")
 90 |     except dns.resolver.NoAnswer:
 91 |         end_time = time.monotonic()
 92 |         result["status"] = "FAILURE" # Technically successful query, but no answer of the requested type
 93 |         result["latency_ms"] = round((end_time - start_time) * 1000, 2)
 94 |         result["error_message"] = f"No {qtype} record found (NoAnswer)"
 95 |         logging.warning(f"No {qtype} record found for {domain}")
 96 |     except Exception as e:
 97 |         result["status"] = "ERROR"
 98 |         result["error_message"] = f"An unexpected error occurred: {e}"
 99 |         logging.exception(f"Unexpected error querying {domain} {qtype}")
100 | 
101 |     return result
102 | 
103 | # --- Main Loop ---
104 | def main():
105 |     logging.info(f"Starting DNS Canary: {CANARY_ID}")
106 |     logging.info(f"Targets: {TARGET_DOMAINS}")
107 |     logging.info(f"Query Type: {QUERY_TYPE}")
108 |     logging.info(f"Target Resolver: {TARGET_RESOLVER or 'System Default'}")
109 |     logging.info(f"Interval: {DNS_INTERVAL_SECONDS}s")
110 |     logging.info(f"Kafka Topic: {KAFKA_TOPIC}")
111 | 
112 |     while True:
113 |         for domain in TARGET_DOMAINS:
114 |             domain = domain.strip()
115 |             if not domain:
116 |                 continue
117 | 
118 |             query_result = perform_dns_query(domain, QUERY_TYPE, TARGET_RESOLVER, QUERY_TIMEOUT_SECONDS)
119 |             timestamp = datetime.now(timezone.utc).isoformat()
120 | 
121 |             message = {
122 |                 "type": "dns",
123 |                 "canary_id": CANARY_ID,
124 |                 "target": domain, # Use 'target' for consistency across canaries
125 |                 "timestamp": timestamp,
126 |                 "status": query_result["status"],
127 |                 "latency_ms": query_result["latency_ms"],
128 |                 "query_type": QUERY_TYPE,
129 |                 "resolver": query_result["resolver"],
130 |                 "answer": query_result["answer"],
131 |                 "error_message": query_result["error_message"]
132 |             }
133 | 
134 |             try:
135 |                 future = producer.send(KAFKA_TOPIC, value=message)
136 |                 logging.info(f"Sent DNS result for {domain} to Kafka topic {KAFKA_TOPIC}")
137 |             except Exception as e:
138 |                 logging.error(f"Failed to send message to Kafka: {e}")
139 | 
140 |         logging.info(f"Completed DNS query cycle. Sleeping for {DNS_INTERVAL_SECONDS} seconds...")
141 |         time.sleep(DNS_INTERVAL_SECONDS)
142 | 
143 | if __name__ == "__main__":
144 |     try:
145 |         main()
146 |     except KeyboardInterrupt:
147 |         logging.info("DNS Canary stopped by user.")
148 |     finally:
149 |         if producer:
150 |             producer.flush()
151 |             producer.close()
152 |             logging.info("Kafka producer closed.")
153 | 


--------------------------------------------------------------------------------
/canary/dns/requirements.txt:
--------------------------------------------------------------------------------
1 | dnspython
2 | kafka-python
3 | python-dotenv
4 | 


--------------------------------------------------------------------------------
/canary/http/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use an official Python runtime as a parent image
 2 | FROM python:3.10-slim
 3 | 
 4 | # Set the working directory in the container
 5 | WORKDIR /app
 6 | 
 7 | # Copy the requirements file into the container at /app
 8 | COPY requirements.txt .
 9 | 
10 | # Install any needed packages specified in requirements.txt
11 | RUN pip install --no-cache-dir -r requirements.txt
12 | 
13 | # Copy the current directory contents into the container at /app
14 | COPY http_canary.py .
15 | 
16 | # Define environment variables (can be overridden at runtime)
17 | ENV KAFKA_BROKER="kafka:9092"
18 | ENV KAFKA_TOPIC="canary-results"
19 | ENV CANARY_ID="http-canary-default-01"
20 | ENV TARGET_URLS="https://google.com,https://github.com"
21 | ENV HTTP_INTERVAL_SECONDS="60"
22 | ENV REQUEST_TIMEOUT_SECONDS="10"
23 | # Optional: Check for this string in response body
24 | ENV EXPECTED_STRING=""
25 | # Optional: Set custom User-Agent
26 | ENV USER_AGENT=""
27 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr
28 | ENV PYTHONUNBUFFERED=1
29 | 
30 | # Run http_canary.py when the container launches
31 | CMD ["python", "http_canary.py"]
32 | 


--------------------------------------------------------------------------------
/canary/http/http_canary.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import json
  3 | import os
  4 | import time
  5 | import logging
  6 | from datetime import datetime, timezone
  7 | from kafka import KafkaProducer
  8 | from kafka.errors import NoBrokersAvailable
  9 | from dotenv import load_dotenv
 10 | 
 11 | # --- Configuration ---
 12 | load_dotenv()
 13 | 
 14 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092')
 15 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results')
 16 | CANARY_ID = os.getenv('CANARY_ID', 'http-canary-local-01')
 17 | # Comma-separated list of target URLs
 18 | TARGET_URLS = os.getenv('TARGET_URLS', 'https://google.com,https://github.com,https://httpbin.org/delay/1').split(',')
 19 | HTTP_INTERVAL_SECONDS = int(os.getenv('HTTP_INTERVAL_SECONDS', '60'))
 20 | REQUEST_TIMEOUT_SECONDS = int(os.getenv('REQUEST_TIMEOUT_SECONDS', '10'))
 21 | # Optional: String to check for in the response body
 22 | EXPECTED_STRING = os.getenv('EXPECTED_STRING', '')
 23 | # Optional: User-Agent header
 24 | USER_AGENT = os.getenv('USER_AGENT', f'NetworkObservabilityCanary/{CANARY_ID}')
 25 | 
 26 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 27 | 
 28 | # --- Kafka Producer Setup ---
 29 | producer = None
 30 | while producer is None:
 31 |     try:
 32 |         producer = KafkaProducer(
 33 |             bootstrap_servers=[KAFKA_BROKER],
 34 |             value_serializer=lambda v: json.dumps(v).encode('utf-8'),
 35 |             retries=5,
 36 |             request_timeout_ms=30000
 37 |         )
 38 |         logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER}")
 39 |     except NoBrokersAvailable:
 40 |         logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...")
 41 |         time.sleep(10)
 42 |     except Exception as e:
 43 |         logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...")
 44 |         time.sleep(10)
 45 | 
 46 | # --- HTTP Request Function ---
 47 | def perform_http_request(url, timeout=10, expected_string="", user_agent=""):
 48 |     """
 49 |     Performs an HTTP GET request and returns results.
 50 |     """
 51 |     logging.info(f"Requesting URL: {url}")
 52 |     result = {
 53 |         "status": "ERROR",
 54 |         "latency_ms": None,
 55 |         "status_code": None,
 56 |         "response_size_bytes": None,
 57 |         "content_match": None, # True, False, or None if not checked
 58 |         "error_message": None,
 59 |     }
 60 |     headers = {'User-Agent': user_agent} if user_agent else {}
 61 | 
 62 |     start_time = time.monotonic()
 63 |     try:
 64 |         response = requests.get(url, timeout=timeout, headers=headers, allow_redirects=True)
 65 |         end_time = time.monotonic()
 66 | 
 67 |         result["latency_ms"] = round((end_time - start_time) * 1000, 2)
 68 |         result["status_code"] = response.status_code
 69 |         result["response_size_bytes"] = len(response.content)
 70 | 
 71 |         if response.ok: # Status code < 400
 72 |             result["status"] = "SUCCESS"
 73 |             if expected_string:
 74 |                 if expected_string in response.text:
 75 |                     result["content_match"] = True
 76 |                     logging.debug(f"Expected string '{expected_string}' found in response from {url}")
 77 |                 else:
 78 |                     result["content_match"] = False
 79 |                     result["status"] = "FAILURE" # Mark as failure if content doesn't match
 80 |                     result["error_message"] = f"Expected string '{expected_string}' not found in response"
 81 |                     logging.warning(f"Expected string not found in response from {url}")
 82 |         else:
 83 |             result["status"] = "FAILURE"
 84 |             result["error_message"] = f"HTTP Error: {response.status_code} {response.reason}"
 85 |             logging.warning(f"HTTP error for {url}: {result['error_message']}")
 86 | 
 87 |     except requests.exceptions.Timeout:
 88 |         result["status"] = "TIMEOUT"
 89 |         result["error_message"] = f"Request timed out after {timeout} seconds"
 90 |         logging.error(f"Timeout requesting {url}")
 91 |     except requests.exceptions.RequestException as e:
 92 |         result["status"] = "ERROR"
 93 |         # Extract a more specific error if possible
 94 |         result["error_message"] = f"Request failed: {type(e).__name__} - {e}"
 95 |         logging.error(f"Request failed for {url}: {result['error_message']}")
 96 |     except Exception as e:
 97 |         result["status"] = "ERROR"
 98 |         result["error_message"] = f"An unexpected error occurred: {e}"
 99 |         logging.exception(f"Unexpected error requesting {url}")
100 | 
101 |     return result
102 | 
103 | # --- Main Loop ---
104 | def main():
105 |     logging.info(f"Starting HTTP Canary: {CANARY_ID}")
106 |     logging.info(f"Target URLs: {TARGET_URLS}")
107 |     logging.info(f"Interval: {HTTP_INTERVAL_SECONDS}s")
108 |     logging.info(f"Kafka Topic: {KAFKA_TOPIC}")
109 |     if EXPECTED_STRING:
110 |         logging.info(f"Checking for string: '{EXPECTED_STRING}'")
111 | 
112 |     while True:
113 |         for url in TARGET_URLS:
114 |             url = url.strip()
115 |             if not url:
116 |                 continue
117 | 
118 |             request_result = perform_http_request(url, REQUEST_TIMEOUT_SECONDS, EXPECTED_STRING, USER_AGENT)
119 |             timestamp = datetime.now(timezone.utc).isoformat()
120 | 
121 |             message = {
122 |                 "type": "http",
123 |                 "canary_id": CANARY_ID,
124 |                 "target": url, # Use 'target' for consistency
125 |                 "timestamp": timestamp,
126 |                 "status": request_result["status"],
127 |                 "latency_ms": request_result["latency_ms"],
128 |                 "status_code": request_result["status_code"],
129 |                 "response_size_bytes": request_result["response_size_bytes"],
130 |                 "content_match": request_result["content_match"],
131 |                 "error_message": request_result["error_message"]
132 |             }
133 | 
134 |             try:
135 |                 future = producer.send(KAFKA_TOPIC, value=message)
136 |                 logging.info(f"Sent HTTP result for {url} to Kafka topic {KAFKA_TOPIC}")
137 |             except Exception as e:
138 |                 logging.error(f"Failed to send message to Kafka: {e}")
139 | 
140 |         logging.info(f"Completed HTTP request cycle. Sleeping for {HTTP_INTERVAL_SECONDS} seconds...")
141 |         time.sleep(HTTP_INTERVAL_SECONDS)
142 | 
143 | if __name__ == "__main__":
144 |     try:
145 |         main()
146 |     except KeyboardInterrupt:
147 |         logging.info("HTTP Canary stopped by user.")
148 |     finally:
149 |         if producer:
150 |             producer.flush()
151 |             producer.close()
152 |             logging.info("Kafka producer closed.")
153 | 


--------------------------------------------------------------------------------
/canary/http/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | kafka-python
3 | python-dotenv
4 | 


--------------------------------------------------------------------------------
/canary/ping/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use an official Python runtime as a parent image
 2 | FROM python:3.10-slim
 3 | 
 4 | # Install fping
 5 | RUN apt-get update && apt-get install -y fping --no-install-recommends && \
 6 |     rm -rf /var/lib/apt/lists/*
 7 | 
 8 | # Set the working directory in the container
 9 | WORKDIR /app
10 | 
11 | # Copy the requirements file into the container at /app
12 | COPY requirements.txt .
13 | 
14 | # Install any needed packages specified in requirements.txt
15 | RUN pip install --no-cache-dir -r requirements.txt
16 | 
17 | # Copy the current directory contents into the container at /app
18 | COPY ping_canary.py .
19 | 
20 | # Make port 9092 available to the world outside this container (optional, Kafka client doesn't need inbound)
21 | # EXPOSE 9092
22 | 
23 | # Define environment variables (can be overridden at runtime)
24 | ENV KAFKA_BROKER="kafka:9092"
25 | ENV KAFKA_TOPIC="canary-results"
26 | ENV CANARY_ID="ping-canary-default-01"
27 | ENV TARGET_HOSTS="8.8.8.8,1.1.1.1"
28 | ENV PING_INTERVAL_SECONDS="60"
29 | ENV PING_COUNT="5"
30 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr
31 | ENV PYTHONUNBUFFERED=1
32 | 
33 | # Run ping_canary.py when the container launches
34 | CMD ["python", "ping_canary.py"]
35 | 


--------------------------------------------------------------------------------
/canary/ping/ping_canary.py:
--------------------------------------------------------------------------------
  1 | import subprocess
  2 | import json
  3 | import os
  4 | import time
  5 | import logging
  6 | import re
  7 | from datetime import datetime, timezone
  8 | from kafka import KafkaProducer
  9 | from kafka.errors import NoBrokersAvailable
 10 | from dotenv import load_dotenv
 11 | 
 12 | # --- Configuration ---
 13 | load_dotenv() # Load .env file if present (for local development)
 14 | 
 15 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092')
 16 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results')
 17 | CANARY_ID = os.getenv('CANARY_ID', 'ping-canary-local-01')
 18 | # Comma-separated list of targets
 19 | TARGET_HOSTS = os.getenv('TARGET_HOSTS', '8.8.8.8,1.1.1.1').split(',')
 20 | PING_INTERVAL_SECONDS = int(os.getenv('PING_INTERVAL_SECONDS', '60'))
 21 | PING_COUNT = int(os.getenv('PING_COUNT', '5')) # Number of pings per target per interval
 22 | 
 23 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 24 | 
 25 | # --- Kafka Producer Setup ---
 26 | producer = None
 27 | while producer is None:
 28 |     try:
 29 |         producer = KafkaProducer(
 30 |             bootstrap_servers=[KAFKA_BROKER],
 31 |             value_serializer=lambda v: json.dumps(v).encode('utf-8'),
 32 |             retries=5, # Retry sending messages
 33 |             request_timeout_ms=30000 # Increase timeout
 34 |         )
 35 |         logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER}")
 36 |     except NoBrokersAvailable:
 37 |         logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...")
 38 |         time.sleep(10)
 39 |     except Exception as e:
 40 |         logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...")
 41 |         time.sleep(10)
 42 | 
 43 | 
 44 | # --- Ping Function ---
 45 | def perform_ping(target):
 46 |     """
 47 |     Performs ping measurement using fping and returns results.
 48 |     fping output (sent to stderr with -q):
 49 |     Target Name : xmt/rcv/%loss = 5/5/0%, min/avg/max = 1.23/4.56/7.89
 50 |     or if 100% loss:
 51 |     Target Name : xmt/rcv/%loss = 5/0/100%
 52 |     """
 53 |     logging.info(f"Pinging target: {target} ({PING_COUNT} times)")
 54 |     result = {
 55 |         "status": "ERROR",
 56 |         "rtt_avg_ms": None,
 57 |         "packet_loss_percent": None,
 58 |         "error_message": None
 59 |     }
 60 |     # fping command: quiet, count, interval 20ms, timeout 500ms per ping
 61 |     command = ["fping", "-q", "-C", str(PING_COUNT), "-p", "20", "-t", "500", target]
 62 | 
 63 |     try:
 64 |         # fping sends summary results to stderr when using -q
 65 |         process = subprocess.run(command, capture_output=True, text=True, check=False, timeout=10) # Added timeout
 66 | 
 67 |         if process.returncode > 1: # 0 = all reachable, 1 = some unreachable, >1 = error
 68 |              result["error_message"] = f"fping command error (return code {process.returncode}): {process.stderr.strip()}"
 69 |              logging.error(f"fping error for {target}: {result['error_message']}")
 70 |              return result
 71 | 
 72 |         output = process.stderr.strip()
 73 |         logging.debug(f"fping output for {target}: {output}")
 74 | 
 75 |         # Regex to parse fping output
 76 |         # Example: google.com : xmt/rcv/%loss = 5/5/0%, min/avg/max = 10.7/11.0/11.6
 77 |         # Example: 10.0.0.1 : xmt/rcv/%loss = 5/0/100%
 78 |         match = re.search(
 79 |             r": xmt/rcv/%loss = (\d+)/(\d+)/(\d+)%(?:, min/avg/max = ([0-9.]+)/([0-9.]+)/([0-9.]+))?",
 80 |             output
 81 |         )
 82 | 
 83 |         if match:
 84 |             sent, recv, loss_percent, min_rtt, avg_rtt, max_rtt = match.groups()
 85 |             result["packet_loss_percent"] = float(loss_percent)
 86 | 
 87 |             if int(recv) > 0 and avg_rtt is not None: # Check if avg_rtt was captured (not 100% loss)
 88 |                 result["status"] = "SUCCESS"
 89 |                 result["rtt_avg_ms"] = float(avg_rtt)
 90 |             elif int(loss_percent) == 100:
 91 |                  result["status"] = "FAILURE" # Target likely down
 92 |                  result["error_message"] = "100% packet loss"
 93 |                  logging.warning(f"100% packet loss for {target}")
 94 |             else:
 95 |                  # Should not happen with current fping flags, but handle defensively
 96 |                  result["status"] = "ERROR"
 97 |                  result["error_message"] = "Partial loss but couldn't parse RTT"
 98 |                  logging.error(f"Partial loss but couldn't parse RTT for {target}: {output}")
 99 | 
100 |         else:
101 |             result["error_message"] = f"Could not parse fping output: {output}"
102 |             logging.error(f"Could not parse fping output for {target}: {output}")
103 | 
104 |     except FileNotFoundError:
105 |         result["error_message"] = "fping command not found. Please install fping."
106 |         logging.critical("fping command not found. Please install fping.")
107 |         # Consider exiting or disabling ping if fping is missing
108 |     except subprocess.TimeoutExpired:
109 |         result["error_message"] = f"fping command timed out after 10 seconds for target {target}"
110 |         result["status"] = "TIMEOUT"
111 |         logging.error(f"fping command timed out for {target}")
112 |     except Exception as e:
113 |         result["error_message"] = f"An unexpected error occurred during ping: {e}"
114 |         logging.exception(f"Unexpected error pinging {target}") # Log full traceback
115 | 
116 |     return result
117 | 
118 | # --- Main Loop ---
119 | def main():
120 |     logging.info(f"Starting Ping Canary: {CANARY_ID}")
121 |     logging.info(f"Targets: {TARGET_HOSTS}")
122 |     logging.info(f"Interval: {PING_INTERVAL_SECONDS}s")
123 |     logging.info(f"Kafka Topic: {KAFKA_TOPIC}")
124 | 
125 |     while True:
126 |         for target in TARGET_HOSTS:
127 |             target = target.strip() # Remove leading/trailing whitespace
128 |             if not target:
129 |                 continue
130 | 
131 |             ping_result = perform_ping(target)
132 |             timestamp = datetime.now(timezone.utc).isoformat()
133 | 
134 |             message = {
135 |                 "type": "ping",
136 |                 "canary_id": CANARY_ID,
137 |                 "target": target,
138 |                 "timestamp": timestamp,
139 |                 "status": ping_result["status"],
140 |                 "rtt_avg_ms": ping_result["rtt_avg_ms"],
141 |                 "packet_loss_percent": ping_result["packet_loss_percent"],
142 |                 "error_message": ping_result["error_message"]
143 |             }
144 | 
145 |             try:
146 |                 future = producer.send(KAFKA_TOPIC, value=message)
147 |                 # Optional: Wait for send confirmation (can slow down if Kafka is slow)
148 |                 # record_metadata = future.get(timeout=10)
149 |                 # logging.debug(f"Message sent to {record_metadata.topic} partition {record_metadata.partition}")
150 |                 logging.info(f"Sent ping result for {target} to Kafka topic {KAFKA_TOPIC}")
151 |             except Exception as e:
152 |                 logging.error(f"Failed to send message to Kafka: {e}")
153 |                 # Consider buffering or other error handling here
154 | 
155 |         logging.info(f"Completed ping cycle. Sleeping for {PING_INTERVAL_SECONDS} seconds...")
156 |         time.sleep(PING_INTERVAL_SECONDS)
157 | 
158 | if __name__ == "__main__":
159 |     try:
160 |         main()
161 |     except KeyboardInterrupt:
162 |         logging.info("Ping Canary stopped by user.")
163 |     finally:
164 |         if producer:
165 |             producer.flush()
166 |             producer.close()
167 |             logging.info("Kafka producer closed.")
168 | 


--------------------------------------------------------------------------------
/canary/ping/requirements.txt:
--------------------------------------------------------------------------------
1 | kafka-python
2 | python-dotenv
3 | 


--------------------------------------------------------------------------------
/infra/docker-compose.yml:
--------------------------------------------------------------------------------
  1 | networks:
  2 |   monitoring:
  3 |     driver: bridge
  4 | 
  5 | volumes:
  6 |   prometheus_data: {}
  7 |   grafana_data: {}
  8 |   # Kafka data volumes (optional, for persistence across restarts)
  9 |   # kafka_data: {}
 10 |   # zookeeper_data: {}
 11 |   # zookeeper_log: {}
 12 | 
 13 | services:
 14 |   zookeeper:
 15 |     image: confluentinc/cp-zookeeper:7.3.2 # Using Confluent platform images
 16 |     container_name: zookeeper
 17 |     networks:
 18 |       - monitoring
 19 |     environment:
 20 |       ZOOKEEPER_CLIENT_PORT: 2181
 21 |       ZOOKEEPER_TICK_TIME: 2000
 22 |     # volumes: # Optional persistence
 23 |     #   - zookeeper_data:/var/lib/zookeeper/data
 24 |     #   - zookeeper_log:/var/lib/zookeeper/log
 25 | 
 26 |   kafka:
 27 |     image: confluentinc/cp-kafka:7.3.2
 28 |     container_name: kafka
 29 |     networks:
 30 |       - monitoring
 31 |     depends_on:
 32 |       - zookeeper
 33 |     ports:
 34 |       # Expose Kafka broker port to the host for potential external access if needed
 35 |       # Use 9092 for internal communication within docker network
 36 |       - "9093:9093" # Port for host access if needed (different from internal)
 37 |     environment:
 38 |       KAFKA_BROKER_ID: 1
 39 |       KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
 40 |       # Use kafka:9092 for internal communication within the Docker network
 41 |       KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
 42 |       KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9093
 43 |       KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
 44 |       KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
 45 |       KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 # Required for newer Kafka versions
 46 |       KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 # Required for newer Kafka versions
 47 |       # Auto-create topics (useful for development)
 48 |       KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
 49 |     # volumes: # Optional persistence
 50 |     #   - kafka_data:/var/lib/kafka/data
 51 | 
 52 |   kafka-consumer:
 53 |     build:
 54 |       context: ../kafka-consumer # Path relative to docker-compose.yml
 55 |       dockerfile: Dockerfile
 56 |     container_name: kafka-consumer
 57 |     networks:
 58 |       - monitoring
 59 |     depends_on:
 60 |       - kafka
 61 |     environment:
 62 |       KAFKA_BROKER: 'kafka:9092' # Use internal service name
 63 |       KAFKA_TOPIC: 'canary-results'
 64 |       PROMETHEUS_PORT: '8000'
 65 |       PYTHONUNBUFFERED: 1 # Ensure logs appear immediately
 66 |     ports:
 67 |       - "8000:8000" # Expose Prometheus metrics port to host
 68 |     restart: unless-stopped
 69 | 
 70 |   prometheus:
 71 |     image: prom/prometheus:v2.45.0
 72 |     container_name: prometheus
 73 |     networks:
 74 |       - monitoring
 75 |     volumes:
 76 |       - ./prometheus.yml:/etc/prometheus/prometheus.yml
 77 |       - prometheus_data:/prometheus # Persistent storage for metrics
 78 |     command:
 79 |       - '--config.file=/etc/prometheus/prometheus.yml'
 80 |       - '--storage.tsdb.path=/prometheus'
 81 |       - '--web.console.libraries=/usr/share/prometheus/console_libraries'
 82 |       - '--web.console.templates=/usr/share/prometheus/consoles'
 83 |       - '--web.enable-lifecycle' # Allows reloading config via API
 84 |     ports:
 85 |       - "9090:9090" # Expose Prometheus UI
 86 |     restart: unless-stopped
 87 | 
 88 |   alertmanager:
 89 |     image: prom/alertmanager:v0.25.0
 90 |     container_name: alertmanager
 91 |     networks:
 92 |       - monitoring
 93 |     ports:
 94 |       - "9094:9093" # Expose Alertmanager UI (use different host port)
 95 |     # volumes: # Optional: Mount config file if needed later
 96 |     #   - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
 97 |     command:
 98 |       - '--config.file=/etc/alertmanager/alertmanager.yml' # Default path inside container
 99 |       - '--storage.path=/alertmanager'
100 |     restart: unless-stopped
101 |     # depends_on: # Not strictly needed for startup order, but good practice
102 |     #   - prometheus
103 | 
104 |   grafana:
105 |     image: grafana/grafana-oss:9.5.3
106 |     container_name: grafana
107 |     networks:
108 |       - monitoring
109 |     ports:
110 |       - "3001:3000" # Expose Grafana UI on host port 3001
111 |     volumes:
112 |       - grafana_data:/var/lib/grafana # Persistent storage for dashboards, etc.
113 |       # Optional: Mount provisioning files for datasources/dashboards
114 |       # - ./grafana/provisioning:/etc/grafana/provisioning
115 |     environment:
116 |       # Default login: admin/admin (change GF_SECURITY_ADMIN_PASSWORD for production)
117 |       # GF_SECURITY_ADMIN_PASSWORD: 'your_secure_password'
118 |       GF_AUTH_ANONYMOUS_ENABLED: "false" # Disable anonymous access
119 |       # Optional: Configure Prometheus datasource automatically
120 |       # GF_DATASOURCES_0_NAME: Prometheus
121 |       # GF_DATASOURCES_0_TYPE: prometheus
122 |       # GF_DATASOURCES_0_URL: http://prometheus:9090
123 |       # GF_DATASOURCES_0_ACCESS: proxy
124 |       # GF_DATASOURCES_0_IS_DEFAULT: true
125 |     restart: unless-stopped
126 |     depends_on:
127 |       - prometheus
128 | 
129 |   bgp-analyzer:
130 |     build:
131 |       context: ../bgp-analyzer
132 |       dockerfile: Dockerfile
133 |     container_name: bgp-analyzer
134 |     networks:
135 |       - monitoring
136 |     ports:
137 |       - "8001:8001" # Expose Prometheus metrics port
138 |     environment:
139 |       # Optional: Override defaults here if needed
140 |       # PREFIXES_TO_MONITOR: "192.0.2.0/24,203.0.113.0/24"
141 |       # BGPSTREAM_PROJECTS: "routeviews"
142 |       PROMETHEUS_PORT: "8001"
143 |       LOG_LEVEL: "INFO"
144 |       PYTHONUNBUFFERED: 1
145 |     restart: unless-stopped
146 | 


--------------------------------------------------------------------------------
/infra/prometheus.yml:
--------------------------------------------------------------------------------
 1 | global:
 2 |   scrape_interval: 15s # How frequently to scrape targets by default.
 3 |   evaluation_interval: 15s # How frequently to evaluate rules.
 4 | 
 5 | # Alerting specifies runtime configuration for Alertmanager.
 6 | alerting:
 7 |   alertmanagers:
 8 |     - static_configs:
 9 |         - targets:
10 |            - alertmanager:9093 # Target Alertmanager instance
11 | 
12 | # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
13 | # rule_files:
14 | #   - "alert.rules.yml" # We'll add this later
15 | 
16 | # A scrape configuration containing exactly one endpoint to scrape:
17 | scrape_configs:
18 |   # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
19 |   - job_name: 'kafka-consumer'
20 |     # metrics_path defaults to '/metrics'
21 |     # scheme defaults to 'http'.
22 |     static_configs:
23 |       - targets: ['kafka-consumer:8000'] # Service name and port defined in docker-compose
24 | 
25 |   - job_name: 'bgp-analyzer'
26 |     static_configs:
27 |       - targets: ['bgp-analyzer:8001'] # Service name and port defined in docker-compose
28 | 


--------------------------------------------------------------------------------
/kafka-consumer/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use an official Python runtime as a parent image
 2 | FROM python:3.10-slim
 3 | 
 4 | # Set the working directory in the container
 5 | WORKDIR /app
 6 | 
 7 | # Copy the requirements file into the container at /app
 8 | COPY requirements.txt .
 9 | 
10 | # Install any needed packages specified in requirements.txt
11 | RUN pip install --no-cache-dir -r requirements.txt
12 | 
13 | # Copy the current directory contents into the container at /app
14 | COPY kafka_consumer.py .
15 | 
16 | # Make port 8000 available to the world outside this container for Prometheus scraping
17 | EXPOSE 8000
18 | 
19 | # Define environment variables (can be overridden at runtime)
20 | ENV KAFKA_BROKER="kafka:9092"
21 | ENV KAFKA_TOPIC="canary-results"
22 | ENV CONSUMER_GROUP_ID="canary-consumer-group-1"
23 | ENV PROMETHEUS_PORT="8000"
24 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr
25 | ENV PYTHONUNBUFFERED=1
26 | 
27 | # Run kafka_consumer.py when the container launches
28 | CMD ["python", "kafka_consumer.py"]
29 | 


--------------------------------------------------------------------------------
/kafka-consumer/kafka_consumer.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import time
  4 | import logging
  5 | import threading
  6 | from kafka import KafkaConsumer
  7 | from kafka.errors import NoBrokersAvailable
  8 | from prometheus_client import start_http_server, Gauge, Counter, Histogram
  9 | from dotenv import load_dotenv
 10 | 
 11 | # --- Configuration ---
 12 | load_dotenv()
 13 | 
 14 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092')
 15 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results')
 16 | CONSUMER_GROUP_ID = os.getenv('CONSUMER_GROUP_ID', 'canary-consumer-group-1')
 17 | PROMETHEUS_PORT = int(os.getenv('PROMETHEUS_PORT', '8000'))
 18 | 
 19 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 20 | 
 21 | # --- Prometheus Metrics Definition ---
 22 | # Define labels that will be common across metrics
 23 | LABELS = ['canary_id', 'target', 'type']
 24 | 
 25 | # Gauge for latency (e.g., RTT, HTTP response time)
 26 | CANARY_LATENCY_MS = Gauge(
 27 |     'canary_latency_ms',
 28 |     'Latency of canary test in milliseconds',
 29 |     LABELS
 30 | )
 31 | 
 32 | # Counter for test status (SUCCESS, FAILURE, ERROR, TIMEOUT)
 33 | CANARY_STATUS_TOTAL = Counter(
 34 |     'canary_status_total',
 35 |     'Total count of canary test results by status',
 36 |     LABELS + ['status'] # Add status label here
 37 | )
 38 | 
 39 | # Gauge for packet loss percentage (specific to ping)
 40 | PING_PACKET_LOSS_PERCENT = Gauge(
 41 |     'ping_packet_loss_percent',
 42 |     'Packet loss percentage for ping canary tests',
 43 |     LABELS # Uses the standard labels
 44 | )
 45 | 
 46 | # Gauge specifically for DNS resolve time (can be redundant with CANARY_LATENCY_MS but allows specific queries)
 47 | DNS_RESOLVE_TIME_MS = Gauge(
 48 |     'dns_resolve_time_ms',
 49 |     'DNS resolve time in milliseconds',
 50 |     LABELS + ['query_type', 'resolver'] # Add DNS specific labels
 51 | )
 52 | 
 53 | # --- HTTP Specific Metrics ---
 54 | HTTP_STATUS_CODE = Gauge(
 55 |     'http_status_code',
 56 |     'HTTP status code received from target',
 57 |     LABELS
 58 | )
 59 | 
 60 | HTTP_RESPONSE_SIZE_BYTES = Gauge(
 61 |     'http_response_size_bytes',
 62 |     'Size of the HTTP response body in bytes',
 63 |     LABELS
 64 | )
 65 | 
 66 | # Using a gauge for content match: 1=match, 0=mismatch, -1=not checked
 67 | HTTP_CONTENT_MATCH_STATUS = Gauge(
 68 |     'http_content_match_status',
 69 |     'Status of expected string match in HTTP response body (1=match, 0=mismatch, -1=not checked)',
 70 |     LABELS
 71 | )
 72 | 
 73 | 
 74 | # Optional: Histogram for latency distribution (more complex but powerful)
 75 | # CANARY_LATENCY_HISTOGRAM = Histogram(
 76 | #     'canary_latency_histogram_ms',
 77 | #     'Histogram of canary test latency in milliseconds',
 78 | #     LABELS,
 79 | #     buckets=[10, 50, 100, 250, 500, 1000, 2500, 5000, 10000] # Example buckets
 80 | # )
 81 | 
 82 | # --- Kafka Consumer Setup ---
 83 | consumer = None
 84 | while consumer is None:
 85 |     try:
 86 |         consumer = KafkaConsumer(
 87 |             KAFKA_TOPIC,
 88 |             bootstrap_servers=[KAFKA_BROKER],
 89 |             group_id=CONSUMER_GROUP_ID,
 90 |             auto_offset_reset='latest', # Start consuming from the latest message
 91 |             value_deserializer=lambda m: json.loads(m.decode('utf-8')),
 92 |             consumer_timeout_ms=1000 # Timeout for polling
 93 |         )
 94 |         logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER} and subscribed to topic {KAFKA_TOPIC}")
 95 |     except NoBrokersAvailable:
 96 |         logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...")
 97 |         time.sleep(10)
 98 |     except Exception as e:
 99 |         logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...")
100 |         time.sleep(10)
101 | 
102 | # --- Metrics Processing Function ---
103 | def process_message(message):
104 |     """Parses a message and updates Prometheus metrics."""
105 |     try:
106 |         data = message.value
107 |         logging.debug(f"Received message: {data}")
108 | 
109 |         # Basic validation
110 |         if not all(k in data for k in ['canary_id', 'target', 'type', 'status']):
111 |             logging.warning(f"Skipping malformed message (missing required keys): {data}")
112 |             return
113 | 
114 |         canary_id = data.get('canary_id', 'unknown')
115 |         target = data.get('target', 'unknown')
116 |         canary_type = data.get('type', 'unknown')
117 |         status = data.get('status', 'ERROR').upper() # Normalize status
118 | 
119 |         # Update status counter
120 |         CANARY_STATUS_TOTAL.labels(
121 |             canary_id=canary_id,
122 |             target=target,
123 |             type=canary_type,
124 |             status=status
125 |         ).inc()
126 | 
127 |         # Update latency if available and status is not ERROR/TIMEOUT
128 |         latency = data.get('rtt_avg_ms') # Defaulting to ping's key for now
129 |         if latency is None:
130 |              latency = data.get('latency_ms') # Check for generic latency key
131 | 
132 |         if latency is not None and status in ['SUCCESS', 'FAILURE']: # Only record latency for valid attempts
133 |             try:
134 |                 latency_float = float(latency)
135 |                 CANARY_LATENCY_MS.labels(
136 |                     canary_id=canary_id,
137 |                     target=target,
138 |                     type=canary_type
139 |                 ).set(latency_float)
140 |                 # Optional: Update histogram
141 |                 # CANARY_LATENCY_HISTOGRAM.labels(...).observe(latency_float)
142 |             except (ValueError, TypeError):
143 |                  logging.warning(f"Invalid latency value '{latency}' in message: {data}")
144 | 
145 | 
146 |         # Update ping-specific metrics
147 |         if canary_type == 'ping':
148 |             loss = data.get('packet_loss_percent')
149 |             if loss is not None:
150 |                 try:
151 |                     loss_float = float(loss)
152 |                     PING_PACKET_LOSS_PERCENT.labels(
153 |                         canary_id=canary_id,
154 |                         target=target,
155 |                         type=canary_type # Redundant here but keeps labels consistent
156 |                     ).set(loss_float)
157 |                 except (ValueError, TypeError):
158 |                     logging.warning(f"Invalid packet_loss_percent value '{loss}' in message: {data}")
159 | 
160 |         # --- DNS Specific Processing ---
161 |         elif canary_type == 'dns':
162 |             query_type = data.get('query_type', 'unknown')
163 |             resolver = data.get('resolver', 'unknown')
164 |             latency = data.get('latency_ms') # DNS uses 'latency_ms'
165 | 
166 |             # Update DNS specific gauge (if latency exists and valid status)
167 |             if latency is not None and status in ['SUCCESS', 'FAILURE']: # Include FAILURE as query might resolve but return NXDOMAIN/NoAnswer
168 |                  try:
169 |                     latency_float = float(latency)
170 |                     DNS_RESOLVE_TIME_MS.labels(
171 |                         canary_id=canary_id,
172 |                         target=target,
173 |                         type=canary_type,
174 |                         query_type=query_type,
175 |                         resolver=resolver
176 |                     ).set(latency_float)
177 |                  except (ValueError, TypeError):
178 |                     logging.warning(f"Invalid DNS latency value '{latency}' in message: {data}")
179 |             # Note: General CANARY_LATENCY_MS is already updated above if latency is present
180 | 
181 |         # --- HTTP Specific Processing ---
182 |         elif canary_type == 'http':
183 |             status_code = data.get('status_code')
184 |             response_size = data.get('response_size_bytes')
185 |             content_match = data.get('content_match') # True, False, or None
186 | 
187 |             if status_code is not None:
188 |                 try:
189 |                     HTTP_STATUS_CODE.labels(
190 |                         canary_id=canary_id, target=target, type=canary_type
191 |                     ).set(int(status_code))
192 |                 except (ValueError, TypeError):
193 |                     logging.warning(f"Invalid status_code value '{status_code}' in message: {data}")
194 | 
195 |             if response_size is not None:
196 |                  try:
197 |                     HTTP_RESPONSE_SIZE_BYTES.labels(
198 |                         canary_id=canary_id, target=target, type=canary_type
199 |                     ).set(int(response_size))
200 |                  except (ValueError, TypeError):
201 |                     logging.warning(f"Invalid response_size_bytes value '{response_size}' in message: {data}")
202 | 
203 |             # Set content match status gauge
204 |             match_value = -1 # Default to -1 (not checked)
205 |             if content_match is True:
206 |                 match_value = 1
207 |             elif content_match is False:
208 |                 match_value = 0
209 |             HTTP_CONTENT_MATCH_STATUS.labels(
210 |                 canary_id=canary_id, target=target, type=canary_type
211 |             ).set(match_value)
212 | 
213 |         # TODO: Add processing for traceroute canary type here
214 | 
215 |     except json.JSONDecodeError:
216 |         logging.error(f"Failed to decode JSON message: {message.value}")
217 |     except Exception as e:
218 |         logging.exception(f"Error processing message: {message.value}") # Log full traceback
219 | 
220 | # --- Main Loop ---
221 | def main():
222 |     logging.info(f"Starting Kafka Consumer for topic {KAFKA_TOPIC}")
223 |     logging.info(f"Exposing Prometheus metrics on port {PROMETHEUS_PORT}")
224 | 
225 |     # Start Prometheus HTTP server in a background thread
226 |     prometheus_thread = threading.Thread(target=start_http_server, args=(PROMETHEUS_PORT,), daemon=True)
227 |     prometheus_thread.start()
228 | 
229 |     logging.info("Prometheus metrics server started.")
230 | 
231 |     while True:
232 |         try:
233 |             for message in consumer:
234 |                 process_message(message)
235 |             # If consumer_timeout_ms is set, the loop continues here after timeout
236 |             # time.sleep(0.1) # Optional small sleep if timeout is not used or very long
237 | 
238 |         except Exception as e:
239 |             logging.error(f"Error in consumer loop: {e}. Attempting to reconnect...")
240 |             # Basic reconnect logic (KafkaConsumer handles some internally)
241 |             time.sleep(5)
242 |             # More robust reconnect/re-initialization might be needed here
243 | 
244 | if __name__ == "__main__":
245 |     try:
246 |         main()
247 |     except KeyboardInterrupt:
248 |         logging.info("Consumer stopped by user.")
249 |     finally:
250 |         if consumer:
251 |             consumer.close()
252 |             logging.info("Kafka consumer closed.")
253 |         # Prometheus server thread is daemon, will exit automatically
254 | 


--------------------------------------------------------------------------------
/kafka-consumer/requirements.txt:
--------------------------------------------------------------------------------
1 | kafka-python
2 | prometheus_client
3 | python-dotenv
4 | 


--------------------------------------------------------------------------------