├── LICENSE ├── README.md ├── bgp-analyzer ├── Dockerfile ├── bgp_analyzer.py └── requirements.txt ├── canary ├── dns │ ├── Dockerfile │ ├── dns_canary.py │ └── requirements.txt ├── http │ ├── Dockerfile │ ├── http_canary.py │ └── requirements.txt └── ping │ ├── Dockerfile │ ├── ping_canary.py │ └── requirements.txt ├── infra ├── docker-compose.yml └── prometheus.yml └── kafka-consumer ├── Dockerfile ├── kafka_consumer.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Shankar K. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Network Observability Platform (Open Source ThousandEyes Alternative) 2 | 3 | ## Introduction 4 | 5 | ### What is this? 6 | This project provides a distributed system for synthetic network monitoring and BGP analysis, inspired by platforms like ThousandEyes. It allows you to: 7 | * Run automated tests (canaries) from various Points of Presence (POPs) to measure the reachability and performance of internet services using common network protocols: 8 | * **Ping (ICMP):** Measures round-trip time (RTT) and packet loss (`canary/ping/`). 9 | * **DNS:** Checks resolution time and correctness for various record types (`canary/dns/`). 10 | * **HTTP(S):** Verifies availability, response time, status codes, and optional content matching (`canary/http/`). 11 | * **Traceroute:** (Future) Maps the network path between the canary and the target, measuring hop-by-hop latency. 12 | * Passively monitor BGP updates for specified prefixes using public data streams (`bgp-analyzer/`). 13 | 14 | Test results from canaries are collected centrally via Kafka, processed into Prometheus metrics, and visualized on dashboards. BGP events are monitored separately and also exposed as Prometheus metrics. 15 | 16 | ### Why build this? 17 | Understanding how your services perform and how they are reachable from different network vantage points is crucial. This platform helps you: 18 | * **Gain User-Perspective Visibility:** See performance as your users or customers might experience it from different geographic locations or networks. 19 | * **Monitor Network Routing:** Observe BGP announcements and withdrawals affecting your important prefixes. 20 | * **Proactive Issue Detection:** Identify connectivity problems, latency spikes, service outages, or potential routing issues (like hijacks) before they impact a large number of users. 21 | * **Troubleshoot Connectivity:** Understand network paths and pinpoint potential bottlenecks or routing issues. 22 | * **Verify SLAs:** Monitor uptime and performance against service level agreements. 23 | * **Performance Benchmarking:** Track performance trends over time. 24 | 25 | ### Core Concepts 26 | * **Synthetic Monitoring:** Instead of waiting for real user traffic to reveal problems, we *synthetically* generate test traffic (like pings or HTTP requests) at regular intervals to proactively check service health. 27 | * **Canary / Probe:** An automated script or agent responsible for running a specific type of test (e.g., a Ping Canary) from a particular location. 28 | * **Point of Presence (POP):** The physical or logical location where a canary agent runs. This could be a data center, a cloud region edge, an office network, or even a user's machine. 29 | * **BGP Monitoring:** Passively listening to data streams from BGP collectors (like RouteViews) to observe routing table updates (announcements, withdrawals) for specific network prefixes. 30 | 31 | ## Architecture Deep Dive 32 | 33 | ### Data Flow Diagram 34 | 35 | ```mermaid 36 | graph TD 37 | subgraph POP Locations 38 | direction LR 39 | subgraph POP 1 40 | PingCanary[Ping Canary] 41 | end 42 | subgraph POP 2 43 | DNSCanary[DNS Canary] 44 | end 45 | subgraph POP 3 46 | HTTPCanary[HTTP Canary] 47 | end 48 | subgraph POP N 49 | OtherCanaries[...] 50 | end 51 | end 52 | 53 | subgraph Central Services 54 | direction LR 55 | Kafka(Kafka Topic\n'canary-results') 56 | Consumer(Kafka Consumer\nService) 57 | BGPAnalyzer(BGP Analyzer\nService) 58 | Prometheus(Prometheus) 59 | Grafana(Grafana) 60 | Alertmanager(Alertmanager) 61 | end 62 | 63 | PingCanary -- JSON --> Kafka 64 | DNSCanary -- JSON --> Kafka 65 | HTTPCanary -- JSON --> Kafka 66 | OtherCanaries -- JSON --> Kafka 67 | 68 | Kafka -- Reads --> Consumer 69 | Consumer -- Exposes /metrics --> Prometheus 70 | BGPAnalyzer -- Exposes /metrics --> Prometheus 71 | 72 | Prometheus -- Scrapes --> Consumer 73 | Prometheus -- Scrapes --> BGPAnalyzer 74 | Prometheus -- Sends Alerts --> Alertmanager 75 | Grafana -- Queries --> Prometheus 76 | 77 | style Kafka fill:#f9f,stroke:#333,stroke-width:2px 78 | style Prometheus fill:#e6522c,stroke:#333,stroke-width:2px,color:#fff 79 | style Grafana fill:#F9A825,stroke:#333,stroke-width:2px,color:#000 80 | style Alertmanager fill:#ff7f0e,stroke:#333,stroke-width:2px,color:#fff 81 | style BGPAnalyzer fill:#add8e6,stroke:#333,stroke-width:2px,color:#000 82 | ``` 83 | 84 | ### Component Roles & Rationale 85 | 86 | * **Canaries (Python Scripts + Docker)** 87 | * *Role:* Execute specific network tests (Ping, DNS, HTTP) at scheduled intervals against defined targets. Package results as structured JSON. Reside in the `canary/` subdirectories (e.g., `canary/ping/`, `canary/dns/`, `canary/http/`). 88 | * *Why:* Running tests from multiple POPs gives a realistic view of global user experience. Docker ensures consistent runtime environments across diverse POPs. Python is chosen for its excellent networking libraries, ease of development, and widespread availability. 89 | * **Kafka (Message Broker)** 90 | * *Role:* Acts as a central, highly available, and durable buffer receiving test results (JSON messages) from all *canaries* via a specific topic (e.g., `canary-results`). 91 | * *Why:* Decouples canaries from the backend processing. Canaries just need to send data to Kafka and don't need to know about the consumer(s). This handles bursts of data, prevents data loss if the consumer is temporarily down, and allows for future expansion with multiple consumers. 92 | * **Kafka Consumer (Python Service)** 93 | * *Role:* Reads the JSON results from the Kafka topic (`canary-results`). Parses the data and translates it into metrics (e.g., latency gauges, status counters) exposed via an HTTP endpoint (`/metrics`) in a format Prometheus can understand. Located in `kafka-consumer/`. 94 | * *Why:* Acts as the crucial bridge between the event-driven Kafka stream from canaries and Prometheus's pull-based metric scraping model. It centralizes the logic for interpreting canary results and defining the corresponding Prometheus metrics. 95 | * **BGP Analyzer (Python Service)** 96 | * *Role:* Connects to public BGP data streams (e.g., RouteViews, RIPE RIS) using `pybgpstream`. Monitors announcements and withdrawals for configured network prefixes. Exposes BGP event counts and timestamps as Prometheus metrics via an HTTP endpoint (`/metrics`). Located in `bgp-analyzer/`. 97 | * *Why:* Provides visibility into internet routing changes affecting key prefixes. Runs as a separate central service because BGP data is typically consumed from collectors, not generated *at* individual POPs like canary tests. Exposing metrics allows correlation with canary data and alerting on routing anomalies. 98 | * **Prometheus (Metrics Database & Alerting Engine)** 99 | * *Role:* Periodically "scrapes" (fetches) the metrics from the Kafka Consumer's `/metrics` endpoint and the BGP Analyzer's `/metrics` endpoint. Stores these metrics efficiently as time-series data. Evaluates predefined alerting rules based on the collected metrics. 100 | * *Why:* An industry standard for time-series metrics and alerting. Its pull model simplifies the consumer/analyzer services. PromQL provides a powerful language for querying data and defining complex alert conditions across both canary and BGP data. 101 | * **Alertmanager (Alert Routing & Management)** 102 | * *Role:* Receives alert notifications triggered by Prometheus. Handles deduplicating, grouping, and silencing alerts. Routes alerts to configured notification channels (e.g., Slack, PagerDuty, email - *configuration needed*). 103 | * *Why:* Separates the complex logic of alert notification from Prometheus. Provides a central point for managing alert state. 104 | * **Grafana (Visualization)** 105 | * *Role:* Queries Prometheus (using PromQL) and displays the time-series metrics on interactive dashboards featuring graphs, tables, heatmaps, etc. 106 | * *Why:* The de facto standard for visualizing observability data. Highly flexible, supports numerous panel types, and makes it easy to explore trends, correlate data from different sources (canaries, BGP), and identify anomalies visually. 107 | 108 | ## Getting Started: Running Locally with Docker Compose 109 | 110 | This setup allows you to run the entire backend infrastructure (Kafka, Prometheus, Grafana, etc.), the Kafka Consumer service, and the BGP Analyzer service on your local machine for development and testing. 111 | 112 | ### Prerequisites 113 | * **Docker Engine:** Install Docker for your OS. [Official Docker Installation Guide](https://docs.docker.com/engine/install/) 114 | * **Docker Compose:** Usually included with Docker Desktop. Verify installation. [Official Docker Compose Installation Guide](https://docs.docker.com/compose/install/) 115 | * **Git:** Needed to clone the repository. [Official Git Installation Guide](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) 116 | * **(Optional but Recommended) Host Tools:** Install `fping`, `dnsutils` (for `dig`), `curl` on your host machine if you want to run canaries directly using Python (Method 2 below). On macOS: `brew install fping dnsutils curl`. On Debian/Ubuntu: `sudo apt-get update && sudo apt-get install fping dnsutils curl`. 117 | 118 | ### Setup Steps 119 | 1. **Clone the Repository:** 120 | ```bash 121 | git clone https://github.com/shankar0123/network-observability-platform.git 122 | cd network-observability-platform 123 | ``` 124 | 2. **Navigate to Infrastructure Directory:** 125 | ```bash 126 | cd infra 127 | ``` 128 | 3. **Launch the Stack:** 129 | ```bash 130 | docker-compose up --build -d 131 | ``` 132 | * `--build`: Ensures the `kafka-consumer` and `bgp-analyzer` Docker images are built using their Dockerfiles. Needed the first time or when their source code/requirements change. 133 | * `up`: Creates and starts all the service containers defined in `docker-compose.yml`. 134 | * `-d`: Runs the containers in "detached" mode. 135 | 136 | ### Verify Services 137 | * **Check Container Status:** Wait a minute or two for services to initialize, then run: 138 | ```bash 139 | docker-compose ps 140 | ``` 141 | All services listed (`zookeeper`, `kafka`, `kafka-consumer`, `prometheus`, `alertmanager`, `grafana`, `bgp-analyzer`) should have a `State` of `Up` or `running`. 142 | * **Access Web UIs:** Open these URLs in your browser: 143 | * **Prometheus:** `http://localhost:9090` 144 | * **Grafana:** `http://localhost:3001` (Default Login: `admin` / `admin`) 145 | * **Alertmanager:** `http://localhost:9094` 146 | * **Check Prometheus Targets:** 147 | 1. Go to the Prometheus UI (`http://localhost:9090`). 148 | 2. Navigate to `Status` -> `Targets`. 149 | 3. Look for the `kafka-consumer` and `bgp-analyzer` jobs. Their `State` should be `UP`. If `DOWN`, check the logs of the respective container (e.g., `docker-compose logs kafka-consumer`, `docker-compose logs bgp-analyzer`). The `bgp-analyzer` might take a short while to connect initially. 150 | 151 | ## Running Canaries 152 | 153 | Once the backend stack is running, run one or more canary agents to send data into the system. 154 | 155 | ### Understanding Canary Configuration 156 | Canaries are configured primarily through environment variables: 157 | * `CANARY_ID`: Unique identifier (e.g., `ping-pop-lhr-01`, `dns-pop-nyc-02`). 158 | * `KAFKA_BROKER`: Kafka broker address(es). 159 | * `KAFKA_TOPIC`: Kafka topic (e.g., `canary-results`). 160 | * `TARGET_HOSTS` / `TARGET_DOMAINS` / `TARGET_URLS`: Comma-separated list of targets. 161 | * `*_INTERVAL_SECONDS`: How often to run tests. 162 | * Other type-specific options (e.g., `QUERY_TYPE`, `EXPECTED_STRING`). 163 | 164 | ### Method 1: Run Canary via Docker (Recommended) 165 | Build and run the desired canary container, connecting it to the Docker network. 166 | 167 | 1. **Build Image (Example: DNS Canary):** 168 | ```bash 169 | cd ../canary/dns # Navigate to the canary directory 170 | docker build -t dns-canary:latest . 171 | ``` 172 | 2. **Run Container (Example: DNS Canary):** 173 | ```bash 174 | # Find network name: docker network ls | grep monitoring (likely 'infra_monitoring') 175 | NETWORK_NAME="infra_monitoring" # Replace if different 176 | 177 | docker run --rm --network=$NETWORK_NAME \ 178 | -e KAFKA_BROKER="kafka:9092" \ 179 | -e KAFKA_TOPIC="canary-results" \ 180 | -e CANARY_ID="dns-docker-test-01" \ 181 | -e TARGET_DOMAINS="example.com,wikipedia.org" \ 182 | -e QUERY_TYPE="AAAA" \ 183 | -e DNS_INTERVAL_SECONDS="45" \ 184 | dns-canary:latest 185 | ``` 186 | * Repeat for other canary types (e.g., `ping-canary`, `http-canary`), adjusting image name and environment variables. 187 | 188 | ### Method 2: Run Canary Locally using Python (Development/Debugging) 189 | Run the script directly on your host. Remember to set `KAFKA_BROKER` to the host-exposed port (`localhost:9093`). 190 | 191 | 1. **Navigate & Setup (Example: HTTP Canary):** 192 | ```bash 193 | cd ../canary/http 194 | python -m venv venv 195 | source venv/bin/activate # Or .\venv\Scripts\activate on Windows 196 | pip install -r requirements.txt 197 | ``` 198 | 2. **Configure & Run:** 199 | ```bash 200 | export KAFKA_BROKER="localhost:9093" # Host port! 201 | export CANARY_ID="http-local-dev-01" 202 | export TARGET_URLS="https://httpbin.org/status/200,https://httpbin.org/status/404" 203 | python http_canary.py 204 | ``` 205 | 206 | ## Exploring the Data 207 | 208 | 1. **Check Logs:** 209 | * Canary data: `docker-compose logs -f kafka-consumer` 210 | * BGP events: `docker-compose logs -f bgp-analyzer` 211 | 2. **Query Metrics in Prometheus (`http://localhost:9090/graph`):** 212 | * **Canary Latency:** `canary_latency_ms{type="ping", target="8.8.8.8"}` 213 | * **DNS Latency:** `dns_resolve_time_ms{target="example.com", query_type="AAAA"}` 214 | * **HTTP Status:** `http_status_code{target="https://httpbin.org/status/404"}` 215 | * **Content Match:** `http_content_match_status{canary_id="http-local-dev-01"}` 216 | * **Ping Failures:** `rate(canary_status_total{type="ping", status="FAILURE"}[5m])` 217 | * **BGP Announcements:** `rate(bgp_announcements_total{prefix="1.1.1.0/24"}[10m])` 218 | * **BGP Withdrawals:** `increase(bgp_withdrawals_total{prefix="8.8.8.0/24"}[1h])` 219 | * **BGP Analyzer Freshness:** `time() - bgp_last_event_timestamp_seconds` (Shows seconds since last event per project) 220 | 3. **Visualize Data in Grafana (`http://localhost:3001`):** 221 | * Add Prometheus Data Source (URL: `http://prometheus:9090`, Access: `Server`). 222 | * Create dashboards combining canary metrics (latency, status) and BGP metrics (announcements, withdrawals) using the queries above. 223 | 224 | ## Next Steps & Future Development 225 | * **Implement Traceroute Canary:** Develop the `canary/traceroute/` component using `mtr` or `scamper`. Update the consumer to handle its complex output (hop-by-hop data). 226 | * **Build Grafana Dashboards:** Create comprehensive dashboards showing global maps, latency/status summaries per test type, BGP event timelines, etc. 227 | * **Configure Alerting:** Define meaningful alert rules in Prometheus (e.g., high latency, low success rate, specific HTTP errors, unexpected BGP withdrawals, BGP analyzer staleness). Configure Alertmanager to route alerts. 228 | * **Deployment to POPs:** Document deploying canary containers to actual POP locations. 229 | * **Security:** Implement Kafka security, add authentication to UIs, secure Grafana login. 230 | * **Persistence:** Configure persistent volumes for Kafka/Zookeeper in `docker-compose.yml` if needed. 231 | * **Refine BGP Analyzer:** Add more sophisticated analysis (path changes, RPKI validation), consider internal BGP feeds. 232 | 233 | ## Troubleshooting 234 | * **`docker-compose up` fails:** Check for port conflicts, ensure Docker daemon is running, check network connectivity. 235 | * **Prometheus Target `kafka-consumer` or `bgp-analyzer` is DOWN:** Check container logs (`docker-compose logs `). Verify Docker networking. Ensure Prometheus scrape config matches the service name and port. 236 | * **No data in Grafana/Prometheus:** Is a canary running? Is the consumer processing messages (check logs)? Is the BGP analyzer running and receiving data (check logs)? Is Prometheus scraping targets successfully? Are PromQL queries correct? Is the Grafana time range correct? 237 | * **BGP Analyzer Issues:** `pybgpstream` can be sensitive to network conditions or collector availability. Check its logs for connection errors. Ensure monitored prefixes are valid. 238 | 239 | ## 🪪 License 240 | 241 | This project is licensed under the [MIT License](LICENSE). 242 | -------------------------------------------------------------------------------- /bgp-analyzer/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use an official Python runtime as a parent image 2 | FROM python:3.10-slim 3 | 4 | # pybgpstream often requires C build tools and potentially libbz2 5 | RUN apt-get update && apt-get install -y --no-install-recommends \ 6 | build-essential \ 7 | libbz2-dev \ 8 | && rm -rf /var/lib/apt/lists/* 9 | 10 | # Set the working directory in the container 11 | WORKDIR /app 12 | 13 | # Copy the requirements file into the container at /app 14 | COPY requirements.txt . 15 | 16 | # Install any needed packages specified in requirements.txt 17 | RUN pip install --no-cache-dir -r requirements.txt 18 | 19 | # Copy the current directory contents into the container at /app 20 | COPY bgp_analyzer.py . 21 | 22 | # Make port 8001 available for Prometheus scraping 23 | EXPOSE 8001 24 | 25 | # Define environment variables (can be overridden at runtime) 26 | ENV PREFIXES_TO_MONITOR="1.1.1.0/24,8.8.8.0/24" 27 | ENV BGPSTREAM_PROJECTS="routeviews,ris" 28 | # Default to live stream if BGPSTREAM_START_TIME is empty 29 | ENV BGPSTREAM_START_TIME="" 30 | ENV PROMETHEUS_PORT="8001" 31 | ENV LOG_LEVEL="INFO" 32 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr 33 | ENV PYTHONUNBUFFERED=1 34 | 35 | # Run bgp_analyzer.py when the container launches 36 | CMD ["python", "bgp_analyzer.py"] 37 | -------------------------------------------------------------------------------- /bgp-analyzer/bgp_analyzer.py: -------------------------------------------------------------------------------- 1 | import pybgpstream 2 | import time 3 | import os 4 | import logging 5 | import threading 6 | from prometheus_client import start_http_server, Counter, Gauge 7 | from dotenv import load_dotenv 8 | from datetime import datetime, timezone 9 | 10 | # --- Configuration --- 11 | load_dotenv() 12 | 13 | # Comma-separated list of prefixes to monitor 14 | PREFIXES_TO_MONITOR = os.getenv('PREFIXES_TO_MONITOR', '1.1.1.0/24,8.8.8.0/24').split(',') 15 | # Comma-separated list of BGPStream projects (e.g., routeviews, ris) 16 | BGPSTREAM_PROJECTS = os.getenv('BGPSTREAM_PROJECTS', 'routeviews,ris').split(',') 17 | # Optional: Start time for BGPStream (e.g., 'YYYY-MM-DD HH:MM:SS'). If empty, uses live data. 18 | BGPSTREAM_START_TIME = os.getenv('BGPSTREAM_START_TIME', '') 19 | # Port for exposing Prometheus metrics 20 | PROMETHEUS_PORT = int(os.getenv('PROMETHEUS_PORT', '8001')) 21 | # Log level 22 | LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO').upper() 23 | 24 | logging.basicConfig(level=LOG_LEVEL, format='%(asctime)s - %(levelname)s - %(message)s') 25 | 26 | # --- Prometheus Metrics Definition --- 27 | BGP_ANNOUNCEMENTS_TOTAL = Counter( 28 | 'bgp_announcements_total', 29 | 'Total number of BGP announcements observed for monitored prefixes', 30 | ['prefix', 'peer_asn', 'origin_as', 'project'] 31 | ) 32 | 33 | BGP_WITHDRAWALS_TOTAL = Counter( 34 | 'bgp_withdrawals_total', 35 | 'Total number of BGP withdrawals observed for monitored prefixes', 36 | ['prefix', 'peer_asn', 'project'] 37 | ) 38 | 39 | # Gauge to track the timestamp of the last processed BGP event (useful for monitoring staleness) 40 | BGP_LAST_EVENT_TIMESTAMP = Gauge( 41 | 'bgp_last_event_timestamp_seconds', 42 | 'Unix timestamp of the last processed BGP event', 43 | ['project'] 44 | ) 45 | 46 | # --- BGPStream Processing Function --- 47 | def process_bgp_stream(): 48 | """Connects to BGPStream and processes elements.""" 49 | logging.info("Starting BGPStream processing...") 50 | logging.info(f"Monitoring Prefixes: {PREFIXES_TO_MONITOR}") 51 | logging.info(f"Using Projects: {BGPSTREAM_PROJECTS}") 52 | 53 | while True: # Keep trying to connect/reconnect 54 | try: 55 | # Create a new BGPStream instance 56 | stream = pybgpstream.BGPStream( 57 | # Use 'live' for real-time data, or specify time range 58 | from_time=BGPSTREAM_START_TIME if BGPSTREAM_START_TIME else None, 59 | until_time=None, # None means live/until now 60 | collectors=BGPSTREAM_PROJECTS, # Filter by project implicitly via collectors 61 | record_type="updates", # Process announcements and withdrawals 62 | filter=",".join([f"prefix {p}" for p in PREFIXES_TO_MONITOR]) # Filter for specific prefixes 63 | ) 64 | 65 | logging.info("BGPStream instance created and filters applied.") 66 | 67 | for elem in stream: 68 | timestamp = time.time() # Record processing time 69 | project = elem.collector # Identify the source project/collector 70 | 71 | BGP_LAST_EVENT_TIMESTAMP.labels(project=project).set(timestamp) 72 | 73 | prefix = str(elem.fields.get('prefix')) 74 | peer_asn = str(elem.peer_asn) 75 | 76 | # Check if the element's prefix is one we are monitoring 77 | # Note: BGPStream filter should handle this, but double-check 78 | if prefix not in PREFIXES_TO_MONITOR: 79 | # This might happen if a more specific prefix matches the filter 80 | # logging.debug(f"Skipping element for non-monitored prefix {prefix}") 81 | continue 82 | 83 | if elem.type == "A" or elem.type == "announce": 84 | origin_as = 'N/A' 85 | as_path = elem.fields.get("as-path", "") 86 | if as_path: 87 | # Origin AS is the last one in the path 88 | origin_as = as_path.split(" ")[-1] 89 | 90 | logging.info(f"ANNOUNCE: Prefix={prefix}, PeerAS={peer_asn}, OriginAS={origin_as}, Path={as_path}, Collector={project}") 91 | BGP_ANNOUNCEMENTS_TOTAL.labels( 92 | prefix=prefix, 93 | peer_asn=peer_asn, 94 | origin_as=origin_as, 95 | project=project 96 | ).inc() 97 | 98 | elif elem.type == "W" or elem.type == "withdraw": 99 | logging.info(f"WITHDRAW: Prefix={prefix}, PeerAS={peer_asn}, Collector={project}") 100 | BGP_WITHDRAWALS_TOTAL.labels( 101 | prefix=prefix, 102 | peer_asn=peer_asn, 103 | project=project 104 | ).inc() 105 | else: 106 | logging.debug(f"Ignoring BGP element of type: {elem.type}") 107 | 108 | except Exception as e: 109 | logging.exception("Error during BGPStream processing. Restarting stream in 30 seconds...") 110 | time.sleep(30) # Wait before attempting to restart 111 | 112 | # --- Main --- 113 | def main(): 114 | logging.info(f"Starting BGP Analyzer Service") 115 | logging.info(f"Exposing Prometheus metrics on port {PROMETHEUS_PORT}") 116 | 117 | # Start Prometheus HTTP server in a background thread 118 | try: 119 | start_http_server(PROMETHEUS_PORT) 120 | logging.info("Prometheus metrics server started.") 121 | except Exception as e: 122 | logging.exception(f"Failed to start Prometheus server on port {PROMETHEUS_PORT}") 123 | return # Exit if we can't expose metrics 124 | 125 | # Start BGP processing in the main thread (or optionally another thread) 126 | process_bgp_stream() 127 | 128 | 129 | if __name__ == "__main__": 130 | try: 131 | main() 132 | except KeyboardInterrupt: 133 | logging.info("BGP Analyzer stopped by user.") 134 | except Exception as e: 135 | logging.exception("Unhandled exception in BGP Analyzer main.") 136 | finally: 137 | logging.info("BGP Analyzer shutting down.") 138 | -------------------------------------------------------------------------------- /bgp-analyzer/requirements.txt: -------------------------------------------------------------------------------- 1 | pybgpstream 2 | prometheus_client 3 | python-dotenv 4 | -------------------------------------------------------------------------------- /canary/dns/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use an official Python runtime as a parent image 2 | FROM python:3.10-slim 3 | 4 | # Set the working directory in the container 5 | WORKDIR /app 6 | 7 | # Copy the requirements file into the container at /app 8 | COPY requirements.txt . 9 | 10 | # Install any needed packages specified in requirements.txt 11 | # Using --no-cache-dir can make the image smaller 12 | RUN pip install --no-cache-dir -r requirements.txt 13 | 14 | # Copy the current directory contents into the container at /app 15 | COPY dns_canary.py . 16 | 17 | # Define environment variables (can be overridden at runtime) 18 | ENV KAFKA_BROKER="kafka:9092" 19 | ENV KAFKA_TOPIC="canary-results" 20 | ENV CANARY_ID="dns-canary-default-01" 21 | ENV TARGET_DOMAINS="google.com,github.com" 22 | ENV DNS_INTERVAL_SECONDS="60" 23 | ENV QUERY_TYPE="A" 24 | # Empty TARGET_RESOLVER means use system default inside container 25 | ENV TARGET_RESOLVER="" 26 | ENV QUERY_TIMEOUT_SECONDS="5" 27 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr 28 | ENV PYTHONUNBUFFERED=1 29 | 30 | # Run dns_canary.py when the container launches 31 | CMD ["python", "dns_canary.py"] 32 | -------------------------------------------------------------------------------- /canary/dns/dns_canary.py: -------------------------------------------------------------------------------- 1 | import dns.resolver 2 | import json 3 | import os 4 | import time 5 | import logging 6 | from datetime import datetime, timezone 7 | from kafka import KafkaProducer 8 | from kafka.errors import NoBrokersAvailable 9 | from dotenv import load_dotenv 10 | 11 | # --- Configuration --- 12 | load_dotenv() 13 | 14 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092') 15 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results') 16 | CANARY_ID = os.getenv('CANARY_ID', 'dns-canary-local-01') 17 | # Comma-separated list of target domains 18 | TARGET_DOMAINS = os.getenv('TARGET_DOMAINS', 'google.com,github.com,cloudflare.com').split(',') 19 | DNS_INTERVAL_SECONDS = int(os.getenv('DNS_INTERVAL_SECONDS', '60')) 20 | QUERY_TYPE = os.getenv('QUERY_TYPE', 'A') # e.g., A, AAAA, CNAME, MX, NS, TXT 21 | # Optional: Specify a target resolver IP. If empty, use system default. 22 | TARGET_RESOLVER = os.getenv('TARGET_RESOLVER', '') # e.g., '8.8.8.8' 23 | QUERY_TIMEOUT_SECONDS = int(os.getenv('QUERY_TIMEOUT_SECONDS', '5')) 24 | 25 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 26 | 27 | # --- Kafka Producer Setup --- 28 | producer = None 29 | while producer is None: 30 | try: 31 | producer = KafkaProducer( 32 | bootstrap_servers=[KAFKA_BROKER], 33 | value_serializer=lambda v: json.dumps(v).encode('utf-8'), 34 | retries=5, 35 | request_timeout_ms=30000 36 | ) 37 | logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER}") 38 | except NoBrokersAvailable: 39 | logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...") 40 | time.sleep(10) 41 | except Exception as e: 42 | logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...") 43 | time.sleep(10) 44 | 45 | # --- DNS Query Function --- 46 | def perform_dns_query(domain, qtype, resolver_ip=None, timeout=5): 47 | """ 48 | Performs a DNS query using dnspython and returns results. 49 | """ 50 | logging.info(f"Querying {domain} for {qtype} record (Resolver: {resolver_ip or 'System Default'})") 51 | result = { 52 | "status": "ERROR", 53 | "latency_ms": None, 54 | "answer": None, # List of strings from the answer 55 | "error_message": None, 56 | "resolver": resolver_ip or 'System Default' 57 | } 58 | resolver = dns.resolver.Resolver() 59 | resolver.timeout = timeout 60 | resolver.lifetime = timeout # Total time including retries 61 | 62 | if resolver_ip: 63 | resolver.nameservers = [resolver_ip] 64 | 65 | start_time = time.monotonic() 66 | try: 67 | answer = resolver.resolve(domain, qtype) 68 | end_time = time.monotonic() 69 | 70 | result["status"] = "SUCCESS" 71 | result["latency_ms"] = round((end_time - start_time) * 1000, 2) 72 | # Extract relevant parts of the answer, convert to string list 73 | result["answer"] = sorted([str(rdata) for rdata in answer]) 74 | logging.debug(f"DNS query for {domain} {qtype} successful: {result['answer']}") 75 | 76 | except dns.resolver.NXDOMAIN: 77 | end_time = time.monotonic() 78 | result["status"] = "FAILURE" 79 | result["latency_ms"] = round((end_time - start_time) * 1000, 2) 80 | result["error_message"] = "NXDOMAIN (Domain does not exist)" 81 | logging.warning(f"NXDOMAIN for {domain} {qtype}") 82 | except dns.resolver.Timeout: 83 | result["status"] = "TIMEOUT" 84 | result["error_message"] = f"Query timed out after {timeout} seconds" 85 | logging.error(f"Timeout querying {domain} {qtype}") 86 | except dns.resolver.NoNameservers as e: 87 | result["status"] = "ERROR" 88 | result["error_message"] = f"No nameservers available: {e}" 89 | logging.error(f"No nameservers for {domain} {qtype}: {e}") 90 | except dns.resolver.NoAnswer: 91 | end_time = time.monotonic() 92 | result["status"] = "FAILURE" # Technically successful query, but no answer of the requested type 93 | result["latency_ms"] = round((end_time - start_time) * 1000, 2) 94 | result["error_message"] = f"No {qtype} record found (NoAnswer)" 95 | logging.warning(f"No {qtype} record found for {domain}") 96 | except Exception as e: 97 | result["status"] = "ERROR" 98 | result["error_message"] = f"An unexpected error occurred: {e}" 99 | logging.exception(f"Unexpected error querying {domain} {qtype}") 100 | 101 | return result 102 | 103 | # --- Main Loop --- 104 | def main(): 105 | logging.info(f"Starting DNS Canary: {CANARY_ID}") 106 | logging.info(f"Targets: {TARGET_DOMAINS}") 107 | logging.info(f"Query Type: {QUERY_TYPE}") 108 | logging.info(f"Target Resolver: {TARGET_RESOLVER or 'System Default'}") 109 | logging.info(f"Interval: {DNS_INTERVAL_SECONDS}s") 110 | logging.info(f"Kafka Topic: {KAFKA_TOPIC}") 111 | 112 | while True: 113 | for domain in TARGET_DOMAINS: 114 | domain = domain.strip() 115 | if not domain: 116 | continue 117 | 118 | query_result = perform_dns_query(domain, QUERY_TYPE, TARGET_RESOLVER, QUERY_TIMEOUT_SECONDS) 119 | timestamp = datetime.now(timezone.utc).isoformat() 120 | 121 | message = { 122 | "type": "dns", 123 | "canary_id": CANARY_ID, 124 | "target": domain, # Use 'target' for consistency across canaries 125 | "timestamp": timestamp, 126 | "status": query_result["status"], 127 | "latency_ms": query_result["latency_ms"], 128 | "query_type": QUERY_TYPE, 129 | "resolver": query_result["resolver"], 130 | "answer": query_result["answer"], 131 | "error_message": query_result["error_message"] 132 | } 133 | 134 | try: 135 | future = producer.send(KAFKA_TOPIC, value=message) 136 | logging.info(f"Sent DNS result for {domain} to Kafka topic {KAFKA_TOPIC}") 137 | except Exception as e: 138 | logging.error(f"Failed to send message to Kafka: {e}") 139 | 140 | logging.info(f"Completed DNS query cycle. Sleeping for {DNS_INTERVAL_SECONDS} seconds...") 141 | time.sleep(DNS_INTERVAL_SECONDS) 142 | 143 | if __name__ == "__main__": 144 | try: 145 | main() 146 | except KeyboardInterrupt: 147 | logging.info("DNS Canary stopped by user.") 148 | finally: 149 | if producer: 150 | producer.flush() 151 | producer.close() 152 | logging.info("Kafka producer closed.") 153 | -------------------------------------------------------------------------------- /canary/dns/requirements.txt: -------------------------------------------------------------------------------- 1 | dnspython 2 | kafka-python 3 | python-dotenv 4 | -------------------------------------------------------------------------------- /canary/http/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use an official Python runtime as a parent image 2 | FROM python:3.10-slim 3 | 4 | # Set the working directory in the container 5 | WORKDIR /app 6 | 7 | # Copy the requirements file into the container at /app 8 | COPY requirements.txt . 9 | 10 | # Install any needed packages specified in requirements.txt 11 | RUN pip install --no-cache-dir -r requirements.txt 12 | 13 | # Copy the current directory contents into the container at /app 14 | COPY http_canary.py . 15 | 16 | # Define environment variables (can be overridden at runtime) 17 | ENV KAFKA_BROKER="kafka:9092" 18 | ENV KAFKA_TOPIC="canary-results" 19 | ENV CANARY_ID="http-canary-default-01" 20 | ENV TARGET_URLS="https://google.com,https://github.com" 21 | ENV HTTP_INTERVAL_SECONDS="60" 22 | ENV REQUEST_TIMEOUT_SECONDS="10" 23 | # Optional: Check for this string in response body 24 | ENV EXPECTED_STRING="" 25 | # Optional: Set custom User-Agent 26 | ENV USER_AGENT="" 27 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr 28 | ENV PYTHONUNBUFFERED=1 29 | 30 | # Run http_canary.py when the container launches 31 | CMD ["python", "http_canary.py"] 32 | -------------------------------------------------------------------------------- /canary/http/http_canary.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import os 4 | import time 5 | import logging 6 | from datetime import datetime, timezone 7 | from kafka import KafkaProducer 8 | from kafka.errors import NoBrokersAvailable 9 | from dotenv import load_dotenv 10 | 11 | # --- Configuration --- 12 | load_dotenv() 13 | 14 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092') 15 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results') 16 | CANARY_ID = os.getenv('CANARY_ID', 'http-canary-local-01') 17 | # Comma-separated list of target URLs 18 | TARGET_URLS = os.getenv('TARGET_URLS', 'https://google.com,https://github.com,https://httpbin.org/delay/1').split(',') 19 | HTTP_INTERVAL_SECONDS = int(os.getenv('HTTP_INTERVAL_SECONDS', '60')) 20 | REQUEST_TIMEOUT_SECONDS = int(os.getenv('REQUEST_TIMEOUT_SECONDS', '10')) 21 | # Optional: String to check for in the response body 22 | EXPECTED_STRING = os.getenv('EXPECTED_STRING', '') 23 | # Optional: User-Agent header 24 | USER_AGENT = os.getenv('USER_AGENT', f'NetworkObservabilityCanary/{CANARY_ID}') 25 | 26 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 27 | 28 | # --- Kafka Producer Setup --- 29 | producer = None 30 | while producer is None: 31 | try: 32 | producer = KafkaProducer( 33 | bootstrap_servers=[KAFKA_BROKER], 34 | value_serializer=lambda v: json.dumps(v).encode('utf-8'), 35 | retries=5, 36 | request_timeout_ms=30000 37 | ) 38 | logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER}") 39 | except NoBrokersAvailable: 40 | logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...") 41 | time.sleep(10) 42 | except Exception as e: 43 | logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...") 44 | time.sleep(10) 45 | 46 | # --- HTTP Request Function --- 47 | def perform_http_request(url, timeout=10, expected_string="", user_agent=""): 48 | """ 49 | Performs an HTTP GET request and returns results. 50 | """ 51 | logging.info(f"Requesting URL: {url}") 52 | result = { 53 | "status": "ERROR", 54 | "latency_ms": None, 55 | "status_code": None, 56 | "response_size_bytes": None, 57 | "content_match": None, # True, False, or None if not checked 58 | "error_message": None, 59 | } 60 | headers = {'User-Agent': user_agent} if user_agent else {} 61 | 62 | start_time = time.monotonic() 63 | try: 64 | response = requests.get(url, timeout=timeout, headers=headers, allow_redirects=True) 65 | end_time = time.monotonic() 66 | 67 | result["latency_ms"] = round((end_time - start_time) * 1000, 2) 68 | result["status_code"] = response.status_code 69 | result["response_size_bytes"] = len(response.content) 70 | 71 | if response.ok: # Status code < 400 72 | result["status"] = "SUCCESS" 73 | if expected_string: 74 | if expected_string in response.text: 75 | result["content_match"] = True 76 | logging.debug(f"Expected string '{expected_string}' found in response from {url}") 77 | else: 78 | result["content_match"] = False 79 | result["status"] = "FAILURE" # Mark as failure if content doesn't match 80 | result["error_message"] = f"Expected string '{expected_string}' not found in response" 81 | logging.warning(f"Expected string not found in response from {url}") 82 | else: 83 | result["status"] = "FAILURE" 84 | result["error_message"] = f"HTTP Error: {response.status_code} {response.reason}" 85 | logging.warning(f"HTTP error for {url}: {result['error_message']}") 86 | 87 | except requests.exceptions.Timeout: 88 | result["status"] = "TIMEOUT" 89 | result["error_message"] = f"Request timed out after {timeout} seconds" 90 | logging.error(f"Timeout requesting {url}") 91 | except requests.exceptions.RequestException as e: 92 | result["status"] = "ERROR" 93 | # Extract a more specific error if possible 94 | result["error_message"] = f"Request failed: {type(e).__name__} - {e}" 95 | logging.error(f"Request failed for {url}: {result['error_message']}") 96 | except Exception as e: 97 | result["status"] = "ERROR" 98 | result["error_message"] = f"An unexpected error occurred: {e}" 99 | logging.exception(f"Unexpected error requesting {url}") 100 | 101 | return result 102 | 103 | # --- Main Loop --- 104 | def main(): 105 | logging.info(f"Starting HTTP Canary: {CANARY_ID}") 106 | logging.info(f"Target URLs: {TARGET_URLS}") 107 | logging.info(f"Interval: {HTTP_INTERVAL_SECONDS}s") 108 | logging.info(f"Kafka Topic: {KAFKA_TOPIC}") 109 | if EXPECTED_STRING: 110 | logging.info(f"Checking for string: '{EXPECTED_STRING}'") 111 | 112 | while True: 113 | for url in TARGET_URLS: 114 | url = url.strip() 115 | if not url: 116 | continue 117 | 118 | request_result = perform_http_request(url, REQUEST_TIMEOUT_SECONDS, EXPECTED_STRING, USER_AGENT) 119 | timestamp = datetime.now(timezone.utc).isoformat() 120 | 121 | message = { 122 | "type": "http", 123 | "canary_id": CANARY_ID, 124 | "target": url, # Use 'target' for consistency 125 | "timestamp": timestamp, 126 | "status": request_result["status"], 127 | "latency_ms": request_result["latency_ms"], 128 | "status_code": request_result["status_code"], 129 | "response_size_bytes": request_result["response_size_bytes"], 130 | "content_match": request_result["content_match"], 131 | "error_message": request_result["error_message"] 132 | } 133 | 134 | try: 135 | future = producer.send(KAFKA_TOPIC, value=message) 136 | logging.info(f"Sent HTTP result for {url} to Kafka topic {KAFKA_TOPIC}") 137 | except Exception as e: 138 | logging.error(f"Failed to send message to Kafka: {e}") 139 | 140 | logging.info(f"Completed HTTP request cycle. Sleeping for {HTTP_INTERVAL_SECONDS} seconds...") 141 | time.sleep(HTTP_INTERVAL_SECONDS) 142 | 143 | if __name__ == "__main__": 144 | try: 145 | main() 146 | except KeyboardInterrupt: 147 | logging.info("HTTP Canary stopped by user.") 148 | finally: 149 | if producer: 150 | producer.flush() 151 | producer.close() 152 | logging.info("Kafka producer closed.") 153 | -------------------------------------------------------------------------------- /canary/http/requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | kafka-python 3 | python-dotenv 4 | -------------------------------------------------------------------------------- /canary/ping/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use an official Python runtime as a parent image 2 | FROM python:3.10-slim 3 | 4 | # Install fping 5 | RUN apt-get update && apt-get install -y fping --no-install-recommends && \ 6 | rm -rf /var/lib/apt/lists/* 7 | 8 | # Set the working directory in the container 9 | WORKDIR /app 10 | 11 | # Copy the requirements file into the container at /app 12 | COPY requirements.txt . 13 | 14 | # Install any needed packages specified in requirements.txt 15 | RUN pip install --no-cache-dir -r requirements.txt 16 | 17 | # Copy the current directory contents into the container at /app 18 | COPY ping_canary.py . 19 | 20 | # Make port 9092 available to the world outside this container (optional, Kafka client doesn't need inbound) 21 | # EXPOSE 9092 22 | 23 | # Define environment variables (can be overridden at runtime) 24 | ENV KAFKA_BROKER="kafka:9092" 25 | ENV KAFKA_TOPIC="canary-results" 26 | ENV CANARY_ID="ping-canary-default-01" 27 | ENV TARGET_HOSTS="8.8.8.8,1.1.1.1" 28 | ENV PING_INTERVAL_SECONDS="60" 29 | ENV PING_COUNT="5" 30 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr 31 | ENV PYTHONUNBUFFERED=1 32 | 33 | # Run ping_canary.py when the container launches 34 | CMD ["python", "ping_canary.py"] 35 | -------------------------------------------------------------------------------- /canary/ping/ping_canary.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | import json 3 | import os 4 | import time 5 | import logging 6 | import re 7 | from datetime import datetime, timezone 8 | from kafka import KafkaProducer 9 | from kafka.errors import NoBrokersAvailable 10 | from dotenv import load_dotenv 11 | 12 | # --- Configuration --- 13 | load_dotenv() # Load .env file if present (for local development) 14 | 15 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092') 16 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results') 17 | CANARY_ID = os.getenv('CANARY_ID', 'ping-canary-local-01') 18 | # Comma-separated list of targets 19 | TARGET_HOSTS = os.getenv('TARGET_HOSTS', '8.8.8.8,1.1.1.1').split(',') 20 | PING_INTERVAL_SECONDS = int(os.getenv('PING_INTERVAL_SECONDS', '60')) 21 | PING_COUNT = int(os.getenv('PING_COUNT', '5')) # Number of pings per target per interval 22 | 23 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 24 | 25 | # --- Kafka Producer Setup --- 26 | producer = None 27 | while producer is None: 28 | try: 29 | producer = KafkaProducer( 30 | bootstrap_servers=[KAFKA_BROKER], 31 | value_serializer=lambda v: json.dumps(v).encode('utf-8'), 32 | retries=5, # Retry sending messages 33 | request_timeout_ms=30000 # Increase timeout 34 | ) 35 | logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER}") 36 | except NoBrokersAvailable: 37 | logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...") 38 | time.sleep(10) 39 | except Exception as e: 40 | logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...") 41 | time.sleep(10) 42 | 43 | 44 | # --- Ping Function --- 45 | def perform_ping(target): 46 | """ 47 | Performs ping measurement using fping and returns results. 48 | fping output (sent to stderr with -q): 49 | Target Name : xmt/rcv/%loss = 5/5/0%, min/avg/max = 1.23/4.56/7.89 50 | or if 100% loss: 51 | Target Name : xmt/rcv/%loss = 5/0/100% 52 | """ 53 | logging.info(f"Pinging target: {target} ({PING_COUNT} times)") 54 | result = { 55 | "status": "ERROR", 56 | "rtt_avg_ms": None, 57 | "packet_loss_percent": None, 58 | "error_message": None 59 | } 60 | # fping command: quiet, count, interval 20ms, timeout 500ms per ping 61 | command = ["fping", "-q", "-C", str(PING_COUNT), "-p", "20", "-t", "500", target] 62 | 63 | try: 64 | # fping sends summary results to stderr when using -q 65 | process = subprocess.run(command, capture_output=True, text=True, check=False, timeout=10) # Added timeout 66 | 67 | if process.returncode > 1: # 0 = all reachable, 1 = some unreachable, >1 = error 68 | result["error_message"] = f"fping command error (return code {process.returncode}): {process.stderr.strip()}" 69 | logging.error(f"fping error for {target}: {result['error_message']}") 70 | return result 71 | 72 | output = process.stderr.strip() 73 | logging.debug(f"fping output for {target}: {output}") 74 | 75 | # Regex to parse fping output 76 | # Example: google.com : xmt/rcv/%loss = 5/5/0%, min/avg/max = 10.7/11.0/11.6 77 | # Example: 10.0.0.1 : xmt/rcv/%loss = 5/0/100% 78 | match = re.search( 79 | r": xmt/rcv/%loss = (\d+)/(\d+)/(\d+)%(?:, min/avg/max = ([0-9.]+)/([0-9.]+)/([0-9.]+))?", 80 | output 81 | ) 82 | 83 | if match: 84 | sent, recv, loss_percent, min_rtt, avg_rtt, max_rtt = match.groups() 85 | result["packet_loss_percent"] = float(loss_percent) 86 | 87 | if int(recv) > 0 and avg_rtt is not None: # Check if avg_rtt was captured (not 100% loss) 88 | result["status"] = "SUCCESS" 89 | result["rtt_avg_ms"] = float(avg_rtt) 90 | elif int(loss_percent) == 100: 91 | result["status"] = "FAILURE" # Target likely down 92 | result["error_message"] = "100% packet loss" 93 | logging.warning(f"100% packet loss for {target}") 94 | else: 95 | # Should not happen with current fping flags, but handle defensively 96 | result["status"] = "ERROR" 97 | result["error_message"] = "Partial loss but couldn't parse RTT" 98 | logging.error(f"Partial loss but couldn't parse RTT for {target}: {output}") 99 | 100 | else: 101 | result["error_message"] = f"Could not parse fping output: {output}" 102 | logging.error(f"Could not parse fping output for {target}: {output}") 103 | 104 | except FileNotFoundError: 105 | result["error_message"] = "fping command not found. Please install fping." 106 | logging.critical("fping command not found. Please install fping.") 107 | # Consider exiting or disabling ping if fping is missing 108 | except subprocess.TimeoutExpired: 109 | result["error_message"] = f"fping command timed out after 10 seconds for target {target}" 110 | result["status"] = "TIMEOUT" 111 | logging.error(f"fping command timed out for {target}") 112 | except Exception as e: 113 | result["error_message"] = f"An unexpected error occurred during ping: {e}" 114 | logging.exception(f"Unexpected error pinging {target}") # Log full traceback 115 | 116 | return result 117 | 118 | # --- Main Loop --- 119 | def main(): 120 | logging.info(f"Starting Ping Canary: {CANARY_ID}") 121 | logging.info(f"Targets: {TARGET_HOSTS}") 122 | logging.info(f"Interval: {PING_INTERVAL_SECONDS}s") 123 | logging.info(f"Kafka Topic: {KAFKA_TOPIC}") 124 | 125 | while True: 126 | for target in TARGET_HOSTS: 127 | target = target.strip() # Remove leading/trailing whitespace 128 | if not target: 129 | continue 130 | 131 | ping_result = perform_ping(target) 132 | timestamp = datetime.now(timezone.utc).isoformat() 133 | 134 | message = { 135 | "type": "ping", 136 | "canary_id": CANARY_ID, 137 | "target": target, 138 | "timestamp": timestamp, 139 | "status": ping_result["status"], 140 | "rtt_avg_ms": ping_result["rtt_avg_ms"], 141 | "packet_loss_percent": ping_result["packet_loss_percent"], 142 | "error_message": ping_result["error_message"] 143 | } 144 | 145 | try: 146 | future = producer.send(KAFKA_TOPIC, value=message) 147 | # Optional: Wait for send confirmation (can slow down if Kafka is slow) 148 | # record_metadata = future.get(timeout=10) 149 | # logging.debug(f"Message sent to {record_metadata.topic} partition {record_metadata.partition}") 150 | logging.info(f"Sent ping result for {target} to Kafka topic {KAFKA_TOPIC}") 151 | except Exception as e: 152 | logging.error(f"Failed to send message to Kafka: {e}") 153 | # Consider buffering or other error handling here 154 | 155 | logging.info(f"Completed ping cycle. Sleeping for {PING_INTERVAL_SECONDS} seconds...") 156 | time.sleep(PING_INTERVAL_SECONDS) 157 | 158 | if __name__ == "__main__": 159 | try: 160 | main() 161 | except KeyboardInterrupt: 162 | logging.info("Ping Canary stopped by user.") 163 | finally: 164 | if producer: 165 | producer.flush() 166 | producer.close() 167 | logging.info("Kafka producer closed.") 168 | -------------------------------------------------------------------------------- /canary/ping/requirements.txt: -------------------------------------------------------------------------------- 1 | kafka-python 2 | python-dotenv 3 | -------------------------------------------------------------------------------- /infra/docker-compose.yml: -------------------------------------------------------------------------------- 1 | networks: 2 | monitoring: 3 | driver: bridge 4 | 5 | volumes: 6 | prometheus_data: {} 7 | grafana_data: {} 8 | # Kafka data volumes (optional, for persistence across restarts) 9 | # kafka_data: {} 10 | # zookeeper_data: {} 11 | # zookeeper_log: {} 12 | 13 | services: 14 | zookeeper: 15 | image: confluentinc/cp-zookeeper:7.3.2 # Using Confluent platform images 16 | container_name: zookeeper 17 | networks: 18 | - monitoring 19 | environment: 20 | ZOOKEEPER_CLIENT_PORT: 2181 21 | ZOOKEEPER_TICK_TIME: 2000 22 | # volumes: # Optional persistence 23 | # - zookeeper_data:/var/lib/zookeeper/data 24 | # - zookeeper_log:/var/lib/zookeeper/log 25 | 26 | kafka: 27 | image: confluentinc/cp-kafka:7.3.2 28 | container_name: kafka 29 | networks: 30 | - monitoring 31 | depends_on: 32 | - zookeeper 33 | ports: 34 | # Expose Kafka broker port to the host for potential external access if needed 35 | # Use 9092 for internal communication within docker network 36 | - "9093:9093" # Port for host access if needed (different from internal) 37 | environment: 38 | KAFKA_BROKER_ID: 1 39 | KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' 40 | # Use kafka:9092 for internal communication within the Docker network 41 | KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT 42 | KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9093 43 | KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 44 | KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 45 | KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 # Required for newer Kafka versions 46 | KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 # Required for newer Kafka versions 47 | # Auto-create topics (useful for development) 48 | KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true' 49 | # volumes: # Optional persistence 50 | # - kafka_data:/var/lib/kafka/data 51 | 52 | kafka-consumer: 53 | build: 54 | context: ../kafka-consumer # Path relative to docker-compose.yml 55 | dockerfile: Dockerfile 56 | container_name: kafka-consumer 57 | networks: 58 | - monitoring 59 | depends_on: 60 | - kafka 61 | environment: 62 | KAFKA_BROKER: 'kafka:9092' # Use internal service name 63 | KAFKA_TOPIC: 'canary-results' 64 | PROMETHEUS_PORT: '8000' 65 | PYTHONUNBUFFERED: 1 # Ensure logs appear immediately 66 | ports: 67 | - "8000:8000" # Expose Prometheus metrics port to host 68 | restart: unless-stopped 69 | 70 | prometheus: 71 | image: prom/prometheus:v2.45.0 72 | container_name: prometheus 73 | networks: 74 | - monitoring 75 | volumes: 76 | - ./prometheus.yml:/etc/prometheus/prometheus.yml 77 | - prometheus_data:/prometheus # Persistent storage for metrics 78 | command: 79 | - '--config.file=/etc/prometheus/prometheus.yml' 80 | - '--storage.tsdb.path=/prometheus' 81 | - '--web.console.libraries=/usr/share/prometheus/console_libraries' 82 | - '--web.console.templates=/usr/share/prometheus/consoles' 83 | - '--web.enable-lifecycle' # Allows reloading config via API 84 | ports: 85 | - "9090:9090" # Expose Prometheus UI 86 | restart: unless-stopped 87 | 88 | alertmanager: 89 | image: prom/alertmanager:v0.25.0 90 | container_name: alertmanager 91 | networks: 92 | - monitoring 93 | ports: 94 | - "9094:9093" # Expose Alertmanager UI (use different host port) 95 | # volumes: # Optional: Mount config file if needed later 96 | # - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml 97 | command: 98 | - '--config.file=/etc/alertmanager/alertmanager.yml' # Default path inside container 99 | - '--storage.path=/alertmanager' 100 | restart: unless-stopped 101 | # depends_on: # Not strictly needed for startup order, but good practice 102 | # - prometheus 103 | 104 | grafana: 105 | image: grafana/grafana-oss:9.5.3 106 | container_name: grafana 107 | networks: 108 | - monitoring 109 | ports: 110 | - "3001:3000" # Expose Grafana UI on host port 3001 111 | volumes: 112 | - grafana_data:/var/lib/grafana # Persistent storage for dashboards, etc. 113 | # Optional: Mount provisioning files for datasources/dashboards 114 | # - ./grafana/provisioning:/etc/grafana/provisioning 115 | environment: 116 | # Default login: admin/admin (change GF_SECURITY_ADMIN_PASSWORD for production) 117 | # GF_SECURITY_ADMIN_PASSWORD: 'your_secure_password' 118 | GF_AUTH_ANONYMOUS_ENABLED: "false" # Disable anonymous access 119 | # Optional: Configure Prometheus datasource automatically 120 | # GF_DATASOURCES_0_NAME: Prometheus 121 | # GF_DATASOURCES_0_TYPE: prometheus 122 | # GF_DATASOURCES_0_URL: http://prometheus:9090 123 | # GF_DATASOURCES_0_ACCESS: proxy 124 | # GF_DATASOURCES_0_IS_DEFAULT: true 125 | restart: unless-stopped 126 | depends_on: 127 | - prometheus 128 | 129 | bgp-analyzer: 130 | build: 131 | context: ../bgp-analyzer 132 | dockerfile: Dockerfile 133 | container_name: bgp-analyzer 134 | networks: 135 | - monitoring 136 | ports: 137 | - "8001:8001" # Expose Prometheus metrics port 138 | environment: 139 | # Optional: Override defaults here if needed 140 | # PREFIXES_TO_MONITOR: "192.0.2.0/24,203.0.113.0/24" 141 | # BGPSTREAM_PROJECTS: "routeviews" 142 | PROMETHEUS_PORT: "8001" 143 | LOG_LEVEL: "INFO" 144 | PYTHONUNBUFFERED: 1 145 | restart: unless-stopped 146 | -------------------------------------------------------------------------------- /infra/prometheus.yml: -------------------------------------------------------------------------------- 1 | global: 2 | scrape_interval: 15s # How frequently to scrape targets by default. 3 | evaluation_interval: 15s # How frequently to evaluate rules. 4 | 5 | # Alerting specifies runtime configuration for Alertmanager. 6 | alerting: 7 | alertmanagers: 8 | - static_configs: 9 | - targets: 10 | - alertmanager:9093 # Target Alertmanager instance 11 | 12 | # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. 13 | # rule_files: 14 | # - "alert.rules.yml" # We'll add this later 15 | 16 | # A scrape configuration containing exactly one endpoint to scrape: 17 | scrape_configs: 18 | # The job name is added as a label `job=` to any timeseries scraped from this config. 19 | - job_name: 'kafka-consumer' 20 | # metrics_path defaults to '/metrics' 21 | # scheme defaults to 'http'. 22 | static_configs: 23 | - targets: ['kafka-consumer:8000'] # Service name and port defined in docker-compose 24 | 25 | - job_name: 'bgp-analyzer' 26 | static_configs: 27 | - targets: ['bgp-analyzer:8001'] # Service name and port defined in docker-compose 28 | -------------------------------------------------------------------------------- /kafka-consumer/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use an official Python runtime as a parent image 2 | FROM python:3.10-slim 3 | 4 | # Set the working directory in the container 5 | WORKDIR /app 6 | 7 | # Copy the requirements file into the container at /app 8 | COPY requirements.txt . 9 | 10 | # Install any needed packages specified in requirements.txt 11 | RUN pip install --no-cache-dir -r requirements.txt 12 | 13 | # Copy the current directory contents into the container at /app 14 | COPY kafka_consumer.py . 15 | 16 | # Make port 8000 available to the world outside this container for Prometheus scraping 17 | EXPOSE 8000 18 | 19 | # Define environment variables (can be overridden at runtime) 20 | ENV KAFKA_BROKER="kafka:9092" 21 | ENV KAFKA_TOPIC="canary-results" 22 | ENV CONSUMER_GROUP_ID="canary-consumer-group-1" 23 | ENV PROMETHEUS_PORT="8000" 24 | # Add PYTHONUNBUFFERED to ensure logs are sent straight to stdout/stderr 25 | ENV PYTHONUNBUFFERED=1 26 | 27 | # Run kafka_consumer.py when the container launches 28 | CMD ["python", "kafka_consumer.py"] 29 | -------------------------------------------------------------------------------- /kafka-consumer/kafka_consumer.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import time 4 | import logging 5 | import threading 6 | from kafka import KafkaConsumer 7 | from kafka.errors import NoBrokersAvailable 8 | from prometheus_client import start_http_server, Gauge, Counter, Histogram 9 | from dotenv import load_dotenv 10 | 11 | # --- Configuration --- 12 | load_dotenv() 13 | 14 | KAFKA_BROKER = os.getenv('KAFKA_BROKER', 'localhost:9092') 15 | KAFKA_TOPIC = os.getenv('KAFKA_TOPIC', 'canary-results') 16 | CONSUMER_GROUP_ID = os.getenv('CONSUMER_GROUP_ID', 'canary-consumer-group-1') 17 | PROMETHEUS_PORT = int(os.getenv('PROMETHEUS_PORT', '8000')) 18 | 19 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 20 | 21 | # --- Prometheus Metrics Definition --- 22 | # Define labels that will be common across metrics 23 | LABELS = ['canary_id', 'target', 'type'] 24 | 25 | # Gauge for latency (e.g., RTT, HTTP response time) 26 | CANARY_LATENCY_MS = Gauge( 27 | 'canary_latency_ms', 28 | 'Latency of canary test in milliseconds', 29 | LABELS 30 | ) 31 | 32 | # Counter for test status (SUCCESS, FAILURE, ERROR, TIMEOUT) 33 | CANARY_STATUS_TOTAL = Counter( 34 | 'canary_status_total', 35 | 'Total count of canary test results by status', 36 | LABELS + ['status'] # Add status label here 37 | ) 38 | 39 | # Gauge for packet loss percentage (specific to ping) 40 | PING_PACKET_LOSS_PERCENT = Gauge( 41 | 'ping_packet_loss_percent', 42 | 'Packet loss percentage for ping canary tests', 43 | LABELS # Uses the standard labels 44 | ) 45 | 46 | # Gauge specifically for DNS resolve time (can be redundant with CANARY_LATENCY_MS but allows specific queries) 47 | DNS_RESOLVE_TIME_MS = Gauge( 48 | 'dns_resolve_time_ms', 49 | 'DNS resolve time in milliseconds', 50 | LABELS + ['query_type', 'resolver'] # Add DNS specific labels 51 | ) 52 | 53 | # --- HTTP Specific Metrics --- 54 | HTTP_STATUS_CODE = Gauge( 55 | 'http_status_code', 56 | 'HTTP status code received from target', 57 | LABELS 58 | ) 59 | 60 | HTTP_RESPONSE_SIZE_BYTES = Gauge( 61 | 'http_response_size_bytes', 62 | 'Size of the HTTP response body in bytes', 63 | LABELS 64 | ) 65 | 66 | # Using a gauge for content match: 1=match, 0=mismatch, -1=not checked 67 | HTTP_CONTENT_MATCH_STATUS = Gauge( 68 | 'http_content_match_status', 69 | 'Status of expected string match in HTTP response body (1=match, 0=mismatch, -1=not checked)', 70 | LABELS 71 | ) 72 | 73 | 74 | # Optional: Histogram for latency distribution (more complex but powerful) 75 | # CANARY_LATENCY_HISTOGRAM = Histogram( 76 | # 'canary_latency_histogram_ms', 77 | # 'Histogram of canary test latency in milliseconds', 78 | # LABELS, 79 | # buckets=[10, 50, 100, 250, 500, 1000, 2500, 5000, 10000] # Example buckets 80 | # ) 81 | 82 | # --- Kafka Consumer Setup --- 83 | consumer = None 84 | while consumer is None: 85 | try: 86 | consumer = KafkaConsumer( 87 | KAFKA_TOPIC, 88 | bootstrap_servers=[KAFKA_BROKER], 89 | group_id=CONSUMER_GROUP_ID, 90 | auto_offset_reset='latest', # Start consuming from the latest message 91 | value_deserializer=lambda m: json.loads(m.decode('utf-8')), 92 | consumer_timeout_ms=1000 # Timeout for polling 93 | ) 94 | logging.info(f"Successfully connected to Kafka broker at {KAFKA_BROKER} and subscribed to topic {KAFKA_TOPIC}") 95 | except NoBrokersAvailable: 96 | logging.error(f"Kafka broker at {KAFKA_BROKER} not available. Retrying in 10 seconds...") 97 | time.sleep(10) 98 | except Exception as e: 99 | logging.error(f"Error connecting to Kafka: {e}. Retrying in 10 seconds...") 100 | time.sleep(10) 101 | 102 | # --- Metrics Processing Function --- 103 | def process_message(message): 104 | """Parses a message and updates Prometheus metrics.""" 105 | try: 106 | data = message.value 107 | logging.debug(f"Received message: {data}") 108 | 109 | # Basic validation 110 | if not all(k in data for k in ['canary_id', 'target', 'type', 'status']): 111 | logging.warning(f"Skipping malformed message (missing required keys): {data}") 112 | return 113 | 114 | canary_id = data.get('canary_id', 'unknown') 115 | target = data.get('target', 'unknown') 116 | canary_type = data.get('type', 'unknown') 117 | status = data.get('status', 'ERROR').upper() # Normalize status 118 | 119 | # Update status counter 120 | CANARY_STATUS_TOTAL.labels( 121 | canary_id=canary_id, 122 | target=target, 123 | type=canary_type, 124 | status=status 125 | ).inc() 126 | 127 | # Update latency if available and status is not ERROR/TIMEOUT 128 | latency = data.get('rtt_avg_ms') # Defaulting to ping's key for now 129 | if latency is None: 130 | latency = data.get('latency_ms') # Check for generic latency key 131 | 132 | if latency is not None and status in ['SUCCESS', 'FAILURE']: # Only record latency for valid attempts 133 | try: 134 | latency_float = float(latency) 135 | CANARY_LATENCY_MS.labels( 136 | canary_id=canary_id, 137 | target=target, 138 | type=canary_type 139 | ).set(latency_float) 140 | # Optional: Update histogram 141 | # CANARY_LATENCY_HISTOGRAM.labels(...).observe(latency_float) 142 | except (ValueError, TypeError): 143 | logging.warning(f"Invalid latency value '{latency}' in message: {data}") 144 | 145 | 146 | # Update ping-specific metrics 147 | if canary_type == 'ping': 148 | loss = data.get('packet_loss_percent') 149 | if loss is not None: 150 | try: 151 | loss_float = float(loss) 152 | PING_PACKET_LOSS_PERCENT.labels( 153 | canary_id=canary_id, 154 | target=target, 155 | type=canary_type # Redundant here but keeps labels consistent 156 | ).set(loss_float) 157 | except (ValueError, TypeError): 158 | logging.warning(f"Invalid packet_loss_percent value '{loss}' in message: {data}") 159 | 160 | # --- DNS Specific Processing --- 161 | elif canary_type == 'dns': 162 | query_type = data.get('query_type', 'unknown') 163 | resolver = data.get('resolver', 'unknown') 164 | latency = data.get('latency_ms') # DNS uses 'latency_ms' 165 | 166 | # Update DNS specific gauge (if latency exists and valid status) 167 | if latency is not None and status in ['SUCCESS', 'FAILURE']: # Include FAILURE as query might resolve but return NXDOMAIN/NoAnswer 168 | try: 169 | latency_float = float(latency) 170 | DNS_RESOLVE_TIME_MS.labels( 171 | canary_id=canary_id, 172 | target=target, 173 | type=canary_type, 174 | query_type=query_type, 175 | resolver=resolver 176 | ).set(latency_float) 177 | except (ValueError, TypeError): 178 | logging.warning(f"Invalid DNS latency value '{latency}' in message: {data}") 179 | # Note: General CANARY_LATENCY_MS is already updated above if latency is present 180 | 181 | # --- HTTP Specific Processing --- 182 | elif canary_type == 'http': 183 | status_code = data.get('status_code') 184 | response_size = data.get('response_size_bytes') 185 | content_match = data.get('content_match') # True, False, or None 186 | 187 | if status_code is not None: 188 | try: 189 | HTTP_STATUS_CODE.labels( 190 | canary_id=canary_id, target=target, type=canary_type 191 | ).set(int(status_code)) 192 | except (ValueError, TypeError): 193 | logging.warning(f"Invalid status_code value '{status_code}' in message: {data}") 194 | 195 | if response_size is not None: 196 | try: 197 | HTTP_RESPONSE_SIZE_BYTES.labels( 198 | canary_id=canary_id, target=target, type=canary_type 199 | ).set(int(response_size)) 200 | except (ValueError, TypeError): 201 | logging.warning(f"Invalid response_size_bytes value '{response_size}' in message: {data}") 202 | 203 | # Set content match status gauge 204 | match_value = -1 # Default to -1 (not checked) 205 | if content_match is True: 206 | match_value = 1 207 | elif content_match is False: 208 | match_value = 0 209 | HTTP_CONTENT_MATCH_STATUS.labels( 210 | canary_id=canary_id, target=target, type=canary_type 211 | ).set(match_value) 212 | 213 | # TODO: Add processing for traceroute canary type here 214 | 215 | except json.JSONDecodeError: 216 | logging.error(f"Failed to decode JSON message: {message.value}") 217 | except Exception as e: 218 | logging.exception(f"Error processing message: {message.value}") # Log full traceback 219 | 220 | # --- Main Loop --- 221 | def main(): 222 | logging.info(f"Starting Kafka Consumer for topic {KAFKA_TOPIC}") 223 | logging.info(f"Exposing Prometheus metrics on port {PROMETHEUS_PORT}") 224 | 225 | # Start Prometheus HTTP server in a background thread 226 | prometheus_thread = threading.Thread(target=start_http_server, args=(PROMETHEUS_PORT,), daemon=True) 227 | prometheus_thread.start() 228 | 229 | logging.info("Prometheus metrics server started.") 230 | 231 | while True: 232 | try: 233 | for message in consumer: 234 | process_message(message) 235 | # If consumer_timeout_ms is set, the loop continues here after timeout 236 | # time.sleep(0.1) # Optional small sleep if timeout is not used or very long 237 | 238 | except Exception as e: 239 | logging.error(f"Error in consumer loop: {e}. Attempting to reconnect...") 240 | # Basic reconnect logic (KafkaConsumer handles some internally) 241 | time.sleep(5) 242 | # More robust reconnect/re-initialization might be needed here 243 | 244 | if __name__ == "__main__": 245 | try: 246 | main() 247 | except KeyboardInterrupt: 248 | logging.info("Consumer stopped by user.") 249 | finally: 250 | if consumer: 251 | consumer.close() 252 | logging.info("Kafka consumer closed.") 253 | # Prometheus server thread is daemon, will exit automatically 254 | -------------------------------------------------------------------------------- /kafka-consumer/requirements.txt: -------------------------------------------------------------------------------- 1 | kafka-python 2 | prometheus_client 3 | python-dotenv 4 | --------------------------------------------------------------------------------