├── .dockerignore ├── .gitignore ├── .travis.yml ├── LICENSE ├── MANIFEST.in ├── README.md ├── airflow_prometheus_exporter ├── __init__.py ├── config.yaml ├── prometheus_exporter.py └── xcom_config.py ├── docker ├── Dockerfile └── airflow.sh ├── runtests ├── setup.cfg ├── setup.py └── tox.ini /.dockerignore: -------------------------------------------------------------------------------- 1 | # Distribution / packaging 2 | .Python 3 | env/ 4 | build/ 5 | develop-eggs/ 6 | dist/ 7 | downloads/ 8 | eggs/ 9 | .eggs/ 10 | lib/ 11 | lib64/ 12 | parts/ 13 | sdist/ 14 | var/ 15 | wheels/ 16 | *.egg-info/ 17 | .installed.cfg 18 | *.egg 19 | 20 | # Unit test / coverage reports 21 | htmlcov/ 22 | .tox/ 23 | .coverage 24 | .coverage.* 25 | .cache 26 | nosetests.xml 27 | coverage.xml 28 | *.cover 29 | .hypothesis/ 30 | .pytest_cache/ 31 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.pyc 3 | *$py.class 4 | *~ 5 | dist/ 6 | *.egg-info 7 | *.egg 8 | *.eggs/ 9 | *.egg/ 10 | *.cache/ 11 | build/ 12 | .build/ 13 | _build/ 14 | pip-log.txt 15 | .directory 16 | erl_crash.dump 17 | *.db 18 | db.sqlite3 19 | Documentation/ 20 | .tox/ 21 | .idea/ 22 | .coverage 23 | .ve* 24 | cover/ 25 | coverage.* 26 | htmlcov/ 27 | .vagrant/ 28 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # Config file for automatic testing at travis-ci.org 2 | dist: xenial 3 | sudo: false 4 | language: python 5 | python: 6 | - "3.6" 7 | - "3.7" 8 | install: pip install tox-travis 9 | script: tox 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017-2019, Robinhood Markets, Inc. 2 | All rights reserved. 3 | 4 | Airflow Prometheus Exporter is licensed under The BSD License (3 Clause, also known as 5 | the new BSD license). The license is an OSI approved Open Source 6 | license and is GPL-compatible(1). 7 | 8 | The license text can also be found here: 9 | http://www.opensource.org/licenses/BSD-3-Clause 10 | 11 | License 12 | ======= 13 | 14 | Redistribution and use in source and binary forms, with or without 15 | modification, are permitted provided that the following conditions are met: 16 | * Redistributions of source code must retain the above copyright 17 | notice, this list of conditions and the following disclaimer. 18 | * Redistributions in binary form must reproduce the above copyright 19 | notice, this list of conditions and the following disclaimer in the 20 | documentation and/or other materials provided with the distribution. 21 | * Neither the name of Robinhood Markets, Inc. nor the 22 | names of its contributors may be used to endorse or promote products 23 | derived from this software without specific prior written permission. 24 | 25 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 26 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, 27 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 28 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Robinhood Markets, Inc. OR CONTRIBUTORS 29 | BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 30 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 31 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 32 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 33 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 34 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 35 | POSSIBILITY OF SUCH DAMAGE. 36 | 37 | Documentation License 38 | ===================== 39 | 40 | The documentation portion of Airflow Prometheus Exporter (the rendered contents of the 41 | "docs" directory of a software distribution or checkout) is supplied 42 | under the "Creative Commons Attribution-ShareAlike 4.0 43 | International" (CC BY-SA 4.0) License as described by 44 | http://creativecommons.org/licenses/by-sa/4.0/ 45 | 46 | Footnotes 47 | ========= 48 | (1) A GPL-compatible license makes it possible to 49 | combine Airflow Prometheus Exporter with other software that is released 50 | under the GPL, it does not mean that we're distributing 51 | Faust under the GPL license. The BSD license, unlike the GPL, 52 | let you distribute a modified version without making your 53 | changes open source. 54 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include MANIFEST.in 3 | include setup.cfg 4 | include setup.py 5 | recursive-include airflow_prometheus_exporter/ * 6 | 7 | recursive-exclude * __pycache__ 8 | recursive-exclude * *.py[co] 9 | recursive-exclude * .*.sw[a-z] 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Airflow Prometheus Exporter 2 | 3 | [![Build Status](https://travis-ci.org/robinhood/airflow-prometheus-exporter.svg?branch=master)](https://travis-ci.org/robinhood/airflow-prometheus-exporter) 4 | 5 | The Airflow Prometheus Exporter exposes various metrics about the Scheduler, DAGs and Tasks which helps improve the observability of an Airflow cluster. 6 | 7 | The exporter is based on this [prometheus exporter for Airflow](https://github.com/epoch8/airflow-exporter). 8 | 9 | ## Requirements 10 | 11 | The plugin has been tested with: 12 | 13 | - Airflow >= 1.10.4 14 | - Python 3.6+ 15 | 16 | The scheduler metrics assume that there is a DAG named `canary_dag`. In our setup, the `canary_dag` is a DAG which has a tasks which perform very simple actions such as establishing database connections. This DAG is used to test the uptime of the Airflow scheduler itself. 17 | 18 | ## Installation 19 | 20 | The exporter can be installed as an Airflow Plugin using: 21 | 22 | ```pip install airflow-prometheus-exporter``` 23 | 24 | This should ideally be installed in your Airflow virtualenv. 25 | 26 | ## Metrics 27 | 28 | Metrics will be available at 29 | 30 | `http:///admin/metrics/` 31 | 32 | ### Task Specific Metrics 33 | 34 | #### `airflow_task_status` 35 | 36 | Number of tasks with a specific status. 37 | 38 | All the possible states are listed [here](https://github.com/apache/airflow/blob/master/airflow/utils/state.py#L46). 39 | 40 | #### `airflow_task_duration` 41 | 42 | Duration of successful tasks in seconds. 43 | 44 | #### `airflow_task_fail_count` 45 | 46 | Number of times a particular task has failed. 47 | 48 | #### `airflow_xcom_param` 49 | 50 | value of configurable parameter in xcom table 51 | 52 | xcom fields is deserialized as a dictionary and if key is found for a paticular task-id, the value is reported as a guage 53 | 54 | Add task / key combinations in config.yaml: 55 | 56 | ```bash 57 | xcom_params: 58 | - 59 | task_id: abc 60 | key: count 61 | - 62 | task_id: def 63 | key: errors 64 | 65 | ``` 66 | 67 | 68 | a task_id of 'all' will match against all airflow tasks: 69 | 70 | ``` 71 | xcom_params: 72 | - 73 | task_id: all 74 | key: count 75 | ``` 76 | 77 | 78 | 79 | ### Dag Specific Metrics 80 | 81 | #### `airflow_dag_status` 82 | 83 | Number of DAGs with a specific status. 84 | 85 | All the possible states are listed [here](https://github.com/apache/airflow/blob/master/airflow/utils/state.py#L59) 86 | 87 | #### `airflow_dag_run_duration` 88 | Duration of successful DagRun in seconds. 89 | 90 | ### Scheduler Metrics 91 | 92 | #### `airflow_dag_scheduler_delay` 93 | 94 | Scheduling delay for a DAG Run in seconds. This metric assumes there is a `canary_dag`. 95 | 96 | The scheduling delay is measured as the delay between when a DAG is marked as `SCHEDULED` and when it actually starts `RUNNING`. 97 | 98 | #### `airflow_task_scheduler_delay` 99 | 100 | Scheduling delay for a Task in seconds. This metric assumes there is a `canary_dag`. 101 | 102 | #### `airflow_num_queued_tasks` 103 | 104 | Number of tasks in the `QUEUED` state at any given instance. 105 | -------------------------------------------------------------------------------- /airflow_prometheus_exporter/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | """Top-level package for Airflow Prometheus Exporter.""" 4 | 5 | __author__ = """Robinhood Markets, Inc.""" 6 | __email__ = "open-source@robinhood.com" 7 | __version__ = "__version__ = '1.0.7'" 8 | -------------------------------------------------------------------------------- /airflow_prometheus_exporter/config.yaml: -------------------------------------------------------------------------------- 1 | xcom_params: 2 | 3 | - 4 | task_id: all 5 | key: record_count 6 | 7 | 8 | -------------------------------------------------------------------------------- /airflow_prometheus_exporter/prometheus_exporter.py: -------------------------------------------------------------------------------- 1 | """Prometheus exporter for Airflow.""" 2 | import json 3 | import pickle 4 | from contextlib import contextmanager 5 | 6 | from airflow.configuration import conf 7 | from airflow.models import DagModel, DagRun, TaskInstance, TaskFail, XCom 8 | from airflow.plugins_manager import AirflowPlugin 9 | from airflow.settings import RBAC, Session 10 | from airflow.utils.state import State 11 | from airflow.utils.log.logging_mixin import LoggingMixin 12 | from flask import Response 13 | from flask_admin import BaseView, expose 14 | from prometheus_client import generate_latest, REGISTRY 15 | from prometheus_client.core import GaugeMetricFamily 16 | from sqlalchemy import and_, func 17 | 18 | from airflow_prometheus_exporter.xcom_config import load_xcom_config 19 | 20 | CANARY_DAG = "canary_dag" 21 | 22 | 23 | @contextmanager 24 | def session_scope(session): 25 | """Provide a transactional scope around a series of operations.""" 26 | try: 27 | yield session 28 | finally: 29 | session.close() 30 | 31 | 32 | ###################### 33 | # DAG Related Metrics 34 | ###################### 35 | 36 | 37 | def get_dag_state_info(): 38 | """Number of DAG Runs with particular state.""" 39 | with session_scope(Session) as session: 40 | dag_status_query = ( 41 | session.query( 42 | DagRun.dag_id, 43 | DagRun.state, 44 | func.count(DagRun.state).label("count"), 45 | ) 46 | .group_by(DagRun.dag_id, DagRun.state) 47 | .subquery() 48 | ) 49 | return ( 50 | session.query( 51 | dag_status_query.c.dag_id, 52 | dag_status_query.c.state, 53 | dag_status_query.c.count, 54 | DagModel.owners, 55 | ) 56 | .join(DagModel, DagModel.dag_id == dag_status_query.c.dag_id) 57 | .filter( 58 | DagModel.is_active == True, # noqa 59 | DagModel.is_paused == False, 60 | ) 61 | .all() 62 | ) 63 | 64 | 65 | def get_dag_duration_info(): 66 | """Duration of successful DAG Runs.""" 67 | with session_scope(Session) as session: 68 | max_execution_dt_query = ( 69 | session.query( 70 | DagRun.dag_id, 71 | func.max(DagRun.execution_date).label("max_execution_dt"), 72 | ) 73 | .join(DagModel, DagModel.dag_id == DagRun.dag_id) 74 | .filter( 75 | DagModel.is_active == True, # noqa 76 | DagModel.is_paused == False, 77 | DagRun.state == State.SUCCESS, 78 | DagRun.end_date.isnot(None), 79 | ) 80 | .group_by(DagRun.dag_id) 81 | .subquery() 82 | ) 83 | 84 | dag_start_dt_query = ( 85 | session.query( 86 | max_execution_dt_query.c.dag_id, 87 | max_execution_dt_query.c.max_execution_dt.label( 88 | "execution_date" 89 | ), 90 | func.min(TaskInstance.start_date).label("start_date"), 91 | ) 92 | .join( 93 | TaskInstance, 94 | and_( 95 | TaskInstance.dag_id == max_execution_dt_query.c.dag_id, 96 | ( 97 | TaskInstance.execution_date 98 | == max_execution_dt_query.c.max_execution_dt 99 | ), 100 | ), 101 | ) 102 | .group_by( 103 | max_execution_dt_query.c.dag_id, 104 | max_execution_dt_query.c.max_execution_dt, 105 | ) 106 | .subquery() 107 | ) 108 | 109 | return ( 110 | session.query( 111 | dag_start_dt_query.c.dag_id, 112 | dag_start_dt_query.c.start_date, 113 | DagRun.end_date, 114 | ) 115 | .join( 116 | DagRun, 117 | and_( 118 | DagRun.dag_id == dag_start_dt_query.c.dag_id, 119 | DagRun.execution_date 120 | == dag_start_dt_query.c.execution_date, 121 | ), 122 | ) 123 | .filter( 124 | TaskInstance.start_date.isnot(None), 125 | TaskInstance.end_date.isnot(None), 126 | ) 127 | .all() 128 | ) 129 | 130 | 131 | ###################### 132 | # Task Related Metrics 133 | ###################### 134 | 135 | 136 | def get_task_state_info(): 137 | """Number of task instances with particular state.""" 138 | with session_scope(Session) as session: 139 | task_status_query = ( 140 | session.query( 141 | TaskInstance.dag_id, 142 | TaskInstance.task_id, 143 | TaskInstance.state, 144 | func.count(TaskInstance.dag_id).label("value"), 145 | ) 146 | .group_by( 147 | TaskInstance.dag_id, TaskInstance.task_id, TaskInstance.state 148 | ) 149 | .subquery() 150 | ) 151 | return ( 152 | session.query( 153 | task_status_query.c.dag_id, 154 | task_status_query.c.task_id, 155 | task_status_query.c.state, 156 | task_status_query.c.value, 157 | DagModel.owners, 158 | ) 159 | .join(DagModel, DagModel.dag_id == task_status_query.c.dag_id) 160 | .filter( 161 | DagModel.is_active == True, # noqa 162 | DagModel.is_paused == False, 163 | ) 164 | .all() 165 | ) 166 | 167 | 168 | def get_task_failure_counts(): 169 | """Compute Task Failure Counts.""" 170 | with session_scope(Session) as session: 171 | return ( 172 | session.query( 173 | TaskFail.dag_id, 174 | TaskFail.task_id, 175 | func.count(TaskFail.dag_id).label("count"), 176 | ) 177 | .join(DagModel, DagModel.dag_id == TaskFail.dag_id,) 178 | .filter( 179 | DagModel.is_active == True, # noqa 180 | DagModel.is_paused == False, 181 | ) 182 | .group_by(TaskFail.dag_id, TaskFail.task_id,) 183 | ) 184 | 185 | 186 | def get_xcom_params(task_id): 187 | """XCom parameters for matching task_id's for the latest run of a DAG.""" 188 | with session_scope(Session) as session: 189 | max_execution_dt_query = ( 190 | session.query( 191 | DagRun.dag_id, 192 | func.max(DagRun.execution_date).label("max_execution_dt"), 193 | ) 194 | .group_by(DagRun.dag_id) 195 | .subquery() 196 | ) 197 | 198 | query = session.query(XCom.dag_id, XCom.task_id, XCom.value).join( 199 | max_execution_dt_query, 200 | and_( 201 | (XCom.dag_id == max_execution_dt_query.c.dag_id), 202 | ( 203 | XCom.execution_date 204 | == max_execution_dt_query.c.max_execution_dt 205 | ), 206 | ), 207 | ) 208 | if task_id == "all": 209 | return query.all() 210 | else: 211 | return query.filter(XCom.task_id == task_id).all() 212 | 213 | 214 | def extract_xcom_parameter(value): 215 | """Deserializes value stored in xcom table.""" 216 | enable_pickling = conf.getboolean("core", "enable_xcom_pickling") 217 | if enable_pickling: 218 | value = pickle.loads(value) 219 | try: 220 | value = json.loads(value) 221 | return value 222 | except Exception: 223 | return {} 224 | else: 225 | try: 226 | return json.loads(value.decode("UTF-8")) 227 | except ValueError: 228 | log = LoggingMixin().log 229 | log.error( 230 | "Could not deserialize the XCOM value from JSON. " 231 | "If you are using pickles instead of JSON " 232 | "for XCOM, then you need to enable pickle " 233 | "support for XCOM in your airflow config." 234 | ) 235 | return {} 236 | 237 | 238 | def get_task_duration_info(): 239 | """Duration of successful tasks in seconds.""" 240 | with session_scope(Session) as session: 241 | max_execution_dt_query = ( 242 | session.query( 243 | DagRun.dag_id, 244 | func.max(DagRun.execution_date).label("max_execution_dt"), 245 | ) 246 | .join(DagModel, DagModel.dag_id == DagRun.dag_id,) 247 | .filter( 248 | DagModel.is_active == True, # noqa 249 | DagModel.is_paused == False, 250 | DagRun.state == State.SUCCESS, 251 | DagRun.end_date.isnot(None), 252 | ) 253 | .group_by(DagRun.dag_id) 254 | .subquery() 255 | ) 256 | 257 | return ( 258 | session.query( 259 | TaskInstance.dag_id, 260 | TaskInstance.task_id, 261 | TaskInstance.start_date, 262 | TaskInstance.end_date, 263 | TaskInstance.execution_date, 264 | ) 265 | .join( 266 | max_execution_dt_query, 267 | and_( 268 | (TaskInstance.dag_id == max_execution_dt_query.c.dag_id), 269 | ( 270 | TaskInstance.execution_date 271 | == max_execution_dt_query.c.max_execution_dt 272 | ), 273 | ), 274 | ) 275 | .filter( 276 | TaskInstance.state == State.SUCCESS, 277 | TaskInstance.start_date.isnot(None), 278 | TaskInstance.end_date.isnot(None), 279 | ) 280 | .all() 281 | ) 282 | 283 | 284 | ###################### 285 | # Scheduler Related Metrics 286 | ###################### 287 | 288 | 289 | def get_dag_scheduler_delay(): 290 | """Compute DAG scheduling delay.""" 291 | with session_scope(Session) as session: 292 | return ( 293 | session.query( 294 | DagRun.dag_id, DagRun.execution_date, DagRun.start_date, 295 | ) 296 | .filter(DagRun.dag_id == CANARY_DAG,) 297 | .order_by(DagRun.execution_date.desc()) 298 | .limit(1) 299 | .all() 300 | ) 301 | 302 | 303 | def get_task_scheduler_delay(): 304 | """Compute Task scheduling delay.""" 305 | with session_scope(Session) as session: 306 | task_status_query = ( 307 | session.query( 308 | TaskInstance.queue, 309 | func.max(TaskInstance.start_date).label("max_start"), 310 | ) 311 | .filter( 312 | TaskInstance.dag_id == CANARY_DAG, 313 | TaskInstance.queued_dttm.isnot(None), 314 | ) 315 | .group_by(TaskInstance.queue) 316 | .subquery() 317 | ) 318 | return ( 319 | session.query( 320 | task_status_query.c.queue, 321 | TaskInstance.execution_date, 322 | TaskInstance.queued_dttm, 323 | task_status_query.c.max_start.label("start_date"), 324 | ) 325 | .join( 326 | TaskInstance, 327 | and_( 328 | TaskInstance.queue == task_status_query.c.queue, 329 | TaskInstance.start_date == task_status_query.c.max_start, 330 | ), 331 | ) 332 | .filter( 333 | TaskInstance.dag_id 334 | == CANARY_DAG, # Redundant, for performance. 335 | ) 336 | .all() 337 | ) 338 | 339 | 340 | def get_num_queued_tasks(): 341 | """Number of queued tasks currently.""" 342 | with session_scope(Session) as session: 343 | return ( 344 | session.query(TaskInstance) 345 | .filter(TaskInstance.state == State.QUEUED) 346 | .count() 347 | ) 348 | 349 | 350 | class MetricsCollector(object): 351 | """Metrics Collector for prometheus.""" 352 | 353 | def describe(self): 354 | return [] 355 | 356 | def collect(self): 357 | """Collect metrics.""" 358 | # Task metrics 359 | task_info = get_task_state_info() 360 | t_state = GaugeMetricFamily( 361 | "airflow_task_status", 362 | "Shows the number of task instances with particular status", 363 | labels=["dag_id", "task_id", "owner", "status"], 364 | ) 365 | for task in task_info: 366 | t_state.add_metric( 367 | [task.dag_id, task.task_id, task.owners, task.state or "none"], 368 | task.value, 369 | ) 370 | yield t_state 371 | 372 | task_duration = GaugeMetricFamily( 373 | "airflow_task_duration", 374 | "Duration of successful tasks in seconds", 375 | labels=["task_id", "dag_id", "execution_date"], 376 | ) 377 | for task in get_task_duration_info(): 378 | task_duration_value = ( 379 | task.end_date - task.start_date 380 | ).total_seconds() 381 | task_duration.add_metric( 382 | [task.task_id, task.dag_id, str(task.execution_date.date())], 383 | task_duration_value, 384 | ) 385 | yield task_duration 386 | 387 | task_failure_count = GaugeMetricFamily( 388 | "airflow_task_fail_count", 389 | "Count of failed tasks", 390 | labels=["dag_id", "task_id"], 391 | ) 392 | for task in get_task_failure_counts(): 393 | task_failure_count.add_metric( 394 | [task.dag_id, task.task_id], task.count 395 | ) 396 | yield task_failure_count 397 | 398 | # Dag Metrics 399 | dag_info = get_dag_state_info() 400 | d_state = GaugeMetricFamily( 401 | "airflow_dag_status", 402 | "Shows the number of dag starts with this status", 403 | labels=["dag_id", "owner", "status"], 404 | ) 405 | for dag in dag_info: 406 | d_state.add_metric([dag.dag_id, dag.owners, dag.state], dag.count) 407 | yield d_state 408 | 409 | dag_duration = GaugeMetricFamily( 410 | "airflow_dag_run_duration", 411 | "Duration of successful dag_runs in seconds", 412 | labels=["dag_id"], 413 | ) 414 | for dag in get_dag_duration_info(): 415 | dag_duration_value = ( 416 | dag.end_date - dag.start_date 417 | ).total_seconds() 418 | dag_duration.add_metric([dag.dag_id], dag_duration_value) 419 | yield dag_duration 420 | 421 | # Scheduler Metrics 422 | dag_scheduler_delay = GaugeMetricFamily( 423 | "airflow_dag_scheduler_delay", 424 | "Airflow DAG scheduling delay", 425 | labels=["dag_id"], 426 | ) 427 | 428 | for dag in get_dag_scheduler_delay(): 429 | dag_scheduling_delay_value = ( 430 | dag.start_date - dag.execution_date 431 | ).total_seconds() 432 | dag_scheduler_delay.add_metric( 433 | [dag.dag_id], dag_scheduling_delay_value 434 | ) 435 | yield dag_scheduler_delay 436 | 437 | # XCOM parameters 438 | 439 | xcom_params = GaugeMetricFamily( 440 | "airflow_xcom_parameter", 441 | "Airflow Xcom Parameter", 442 | labels=["dag_id", "task_id"], 443 | ) 444 | 445 | xcom_config = load_xcom_config() 446 | for tasks in xcom_config.get("xcom_params", []): 447 | for param in get_xcom_params(tasks["task_id"]): 448 | xcom_value = extract_xcom_parameter(param.value) 449 | 450 | if tasks["key"] in xcom_value: 451 | xcom_params.add_metric( 452 | [param.dag_id, param.task_id], xcom_value[tasks["key"]] 453 | ) 454 | 455 | yield xcom_params 456 | 457 | task_scheduler_delay = GaugeMetricFamily( 458 | "airflow_task_scheduler_delay", 459 | "Airflow Task scheduling delay", 460 | labels=["queue"], 461 | ) 462 | 463 | for task in get_task_scheduler_delay(): 464 | task_scheduling_delay_value = ( 465 | task.start_date - task.queued_dttm 466 | ).total_seconds() 467 | task_scheduler_delay.add_metric( 468 | [task.queue], task_scheduling_delay_value 469 | ) 470 | yield task_scheduler_delay 471 | 472 | num_queued_tasks_metric = GaugeMetricFamily( 473 | "airflow_num_queued_tasks", "Airflow Number of Queued Tasks", 474 | ) 475 | 476 | num_queued_tasks = get_num_queued_tasks() 477 | num_queued_tasks_metric.add_metric([], num_queued_tasks) 478 | yield num_queued_tasks_metric 479 | 480 | 481 | REGISTRY.register(MetricsCollector()) 482 | 483 | if RBAC: 484 | from flask_appbuilder import BaseView as FABBaseView, expose as FABexpose 485 | 486 | class RBACMetrics(FABBaseView): 487 | route_base = "/admin/metrics/" 488 | 489 | @FABexpose('/') 490 | def list(self): 491 | return Response(generate_latest(), mimetype='text') 492 | 493 | # Metrics View for Flask app builder used in airflow with rbac enabled 494 | RBACmetricsView = { 495 | "view": RBACMetrics(), 496 | "name": "Metrics", 497 | "category": "Public" 498 | } 499 | ADMIN_VIEW = [] 500 | RBAC_VIEW = [RBACmetricsView] 501 | else: 502 | class Metrics(BaseView): 503 | @expose("/") 504 | def index(self): 505 | return Response(generate_latest(), mimetype="text/plain") 506 | 507 | ADMIN_VIEW = [Metrics(category="Prometheus exporter", name="Metrics")] 508 | RBAC_VIEW = [] 509 | 510 | 511 | class AirflowPrometheusPlugin(AirflowPlugin): 512 | """Airflow Plugin for collecting metrics.""" 513 | 514 | name = "airflow_prometheus_plugin" 515 | operators = [] 516 | hooks = [] 517 | executors = [] 518 | macros = [] 519 | admin_views = ADMIN_VIEW 520 | flask_blueprints = [] 521 | menu_links = [] 522 | appbuilder_views = RBAC_VIEW 523 | appbuilder_menu_items = [] 524 | -------------------------------------------------------------------------------- /airflow_prometheus_exporter/xcom_config.py: -------------------------------------------------------------------------------- 1 | import yaml 2 | from pathlib import Path 3 | 4 | CONFIG_FILE = Path.cwd() / "config.yaml" 5 | 6 | 7 | def load_xcom_config(): 8 | """Loads the XCom config if present.""" 9 | try: 10 | with open(CONFIG_FILE) as file: 11 | # The FullLoader parameter handles the conversion from YAML 12 | # scalar values to Python the dictionary format 13 | return yaml.load(file, Loader=yaml.FullLoader) 14 | except FileNotFoundError: 15 | return {} 16 | -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.6-slim 2 | 3 | ARG AIRFLOW_USER_HOME=/usr/local/airflow 4 | ENV AIRFLOW_HOME=${AIRFLOW_USER_HOME} 5 | 6 | RUN set -ex \ 7 | && apt-get update -yqq \ 8 | && apt-get upgrade -yqq \ 9 | && apt-get install -yqq --no-install-recommends \ 10 | build-essential \ 11 | apt-utils \ 12 | && useradd -ms /bin/bash -d ${AIRFLOW_USER_HOME} airflow \ 13 | && pip install -U pip 14 | 15 | COPY docker/airflow.sh ${AIRFLOW_USER_HOME}/airflow.sh 16 | RUN chmod +x ${AIRFLOW_USER_HOME}/airflow.sh 17 | 18 | COPY . /airflow-prometheus-exporter 19 | WORKDIR /airflow-prometheus-exporter 20 | RUN pip install -e . 21 | 22 | RUN chown -R airflow: ${AIRFLOW_USER_HOME} 23 | 24 | EXPOSE 8080 25 | 26 | USER airflow 27 | WORKDIR ${AIRFLOW_USER_HOME} 28 | RUN airflow initdb 29 | 30 | CMD ["./airflow.sh"] 31 | -------------------------------------------------------------------------------- /docker/airflow.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | airflow webserver 3 | -------------------------------------------------------------------------------- /runtests: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | docker build -t airflow-prometheus-exporter -f docker/Dockerfile . 3 | docker run -d -p 8080:8080 --name airflow-prometheus-exporter airflow-prometheus-exporter 4 | 5 | BASE_URL="http://localhost:8080" 6 | HEALTH_ENDPOINT="${BASE_URL}/health" 7 | METRICS_ENDPOINT="${BASE_URL}/admin/metrics/" 8 | NUM_ATTEMPS=10 9 | ATTEMPT_COUNTER=0 10 | 11 | echo "Waiting for Airflow to be ready at ${HEALTH_ENDPOINT}..." 12 | 13 | until $(curl --output /dev/null --silent --show-error --head --fail "${HEALTH_ENDPOINT}"); do 14 | if [ ${ATTEMPT_COUNTER} -eq ${NUM_ATTEMPS} ];then 15 | echo "Timeout: Airflow webserver not ready" 16 | exit 1 17 | fi 18 | 19 | printf '.' 20 | ATTEMPT_COUNTER=$(($ATTEMPT_COUNTER+1)) 21 | sleep 10 22 | done 23 | 24 | echo "Retrieving metrics from ${METRICS_ENDPOINT}..." 25 | curl --silent --show-error --fail "${METRICS_ENDPOINT}" 26 | 27 | STATUSCODE=$(curl --silent --output /dev/stderr --write-out "%{http_code}" "${METRICS_ENDPOINT}") 28 | 29 | docker stop airflow-prometheus-exporter 30 | docker rm airflow-prometheus-exporter 31 | 32 | if test $STATUSCODE -ne 200; then 33 | # error handling 34 | exit 1 35 | fi 36 | 37 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [bumpversion] 2 | current_version = 1.0.7 3 | commit = True 4 | tag = True 5 | 6 | [bumpversion:file:setup.py] 7 | search = version='{current_version}' 8 | replace = version='{new_version}' 9 | 10 | [bumpversion:file:airflow_prometheus_exporter/__init__.py] 11 | search = __version__ = '{current_version}' 12 | replace = __version__ = '{new_version}' 13 | 14 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """The setup script.""" 5 | 6 | from setuptools import setup, find_packages 7 | 8 | with open('README.md', encoding='utf-8') as readme_file: 9 | readme = readme_file.read() 10 | 11 | install_requirements = [ 12 | 'apache-airflow>=1.10.4', 13 | 'prometheus_client>=0.4.2', 14 | ], 15 | 16 | extras_require={ 17 | 'dev': [ 18 | 'bumpversion', 19 | 'tox', 20 | 'twine', 21 | ] 22 | } 23 | 24 | setup( 25 | author='Robinhood Markets, Inc.', 26 | author_email='open-source@robinhood.com', 27 | classifiers=[ 28 | 'Intended Audience :: Developers', 29 | 'License :: OSI Approved :: BSD License', 30 | 'Natural Language :: English', 31 | 'Programming Language :: Python :: 3 :: Only', 32 | 'Programming Language :: Python :: 3.6', 33 | 'Programming Language :: Python :: 3.7', 34 | ], 35 | description='Prometheus Exporter for Airflow Metrics', 36 | install_requires=install_requirements, 37 | extras_require=extras_require, 38 | license='BSD 3-Clause', 39 | long_description=readme, 40 | long_description_content_type='text/markdown', 41 | keywords='airflow_prometheus_exporter', 42 | name='airflow_prometheus_exporter', 43 | packages=find_packages(include=['airflow_prometheus_exporter']), 44 | include_package_data=True, 45 | url='https://github.com/robinhood/airflow_prometheus_exporter', 46 | version='1.0.7', 47 | entry_points={ 48 | 'airflow.plugins': [ 49 | 'AirflowPrometheus = airflow_prometheus_exporter.prometheus_exporter:AirflowPrometheusPlugin' 50 | ] 51 | }, 52 | ) 53 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = 3.7, 3.6, style 3 | 4 | [testenv] 5 | whitelist_externals= 6 | /bin/sh 7 | commands = 8 | sh -c "./runtests" 9 | 10 | [testenv:style] 11 | deps = 12 | flake8 13 | commands = 14 | flake8 {toxinidir}/airflow_prometheus_exporter 15 | --------------------------------------------------------------------------------