├── backup-configs ├── README.md └── airflow-backup-configs.py ├── delete-broken-dags ├── README.md └── airflow-delete-broken-dags.py ├── kill-halted-tasks ├── README.md └── airflow-kill-halted-tasks.py ├── clear-missing-dags ├── README.md └── airflow-clear-missing-dags.py ├── .gitignore ├── db-cleanup ├── README.md └── airflow-db-cleanup.py ├── README.md ├── log-cleanup ├── README.md ├── airflow-log-cleanup-pwdless-ssh.py └── airflow-log-cleanup.py ├── sla-miss-report ├── README.md └── airflow-sla-miss-report.py └── LICENSE /backup-configs/README.md: -------------------------------------------------------------------------------- 1 | # Airflow Backup Configs 2 | 3 | A maintenance workflow that you can deploy into Airflow to periodically take backups of various Airflow configurations and files. 4 | 5 | ## Deploy 6 | 7 | 1. Login to the machine running Airflow 8 | 9 | 2. Navigate to the dags directory 10 | 11 | 3. Copy the airflow-backup-configs.py file to this dags directory 12 | 13 | a. Here's a fast way: 14 | 15 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/backup-configs/airflow-backup-configs.py 16 | 17 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES, BACKUP_FOLDER_DATE_FORMAT, BACKUP_HOME_DIRECTORY, BACKUPS_ENABLED, and BACKUP_RETENTION_COUNT) in the DAG with the desired values 18 | 19 | 6. Enable the DAG in the Airflow Webserver 20 | 21 | -------------------------------------------------------------------------------- /delete-broken-dags/README.md: -------------------------------------------------------------------------------- 1 | # Airflow Delete Broken DAGs 2 | 3 | A maintenance workflow that you can deploy into Airflow to periodically delete DAG files and clean out entries in the 4 | ImportError table for DAGs which Airflow cannot parse or import properly. This ensures that the ImportError table is cleaned every day. 5 | 6 | ## Deploy 7 | 8 | 1. Login to the machine running Airflow 9 | 10 | 2. Navigate to the dags directory 11 | 12 | 3. Copy the airflow-delete-broken-dags.py file to this dags directory 13 | 14 | a. Here's a fast way: 15 | 16 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/delete-broken-dags/airflow-delete-broken-dags.py 17 | 18 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES and ENABLE_DELETE) in the DAG with the desired values 19 | 20 | 5. Enable the DAG in the Airflow Webserver 21 | -------------------------------------------------------------------------------- /kill-halted-tasks/README.md: -------------------------------------------------------------------------------- 1 | # Airflow Kill Halted Tasks 2 | 3 | A maintenance workflow that you can deploy into Airflow to periodically kill off tasks that are running in the background that don't correspond to a running task in the DB. 4 | 5 | This is useful because when you kill off a DAG Run or Task through the Airflow Web Server, the task still runs in the background on one of the executors until the task is complete. 6 | 7 | ## Deploy 8 | 9 | 1. Login to the machine running Airflow 10 | 11 | 2. Navigate to the dags directory 12 | 13 | 3. Copy the airflow-kill-halted-tasks.py file to this dags directory 14 | 15 | a. Here's a fast way: 16 | 17 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/kill-halted-tasks/airflow-kill-halted-tasks.py 18 | 19 | 4. Update the global variables in the DAG with the desired values 20 | 21 | 5. Enable the DAG in the Airflow Webserver 22 | 23 | -------------------------------------------------------------------------------- /clear-missing-dags/README.md: -------------------------------------------------------------------------------- 1 | # Airflow Clear Missing DAGs 2 | 3 | A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs. 4 | 5 | ## Deploy 6 | 7 | 1. Login to the machine running Airflow 8 | 9 | 2. Navigate to the dags directory 10 | 11 | 3. Copy the airflow-clear-missing-dags.py file to this dags directory 12 | 13 | a. Here's a fast way: 14 | 15 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/clear-missing-dags/airflow-clear-missing-dags.py 16 | 17 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES and ENABLE_DELETE) in the DAG with the desired values 18 | 19 | 5. Enable the DAG in the Airflow Webserver 20 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | # IDEA 92 | .idea 93 | 94 | # DS-STORE REMOVAL 95 | .DS_Store 96 | -------------------------------------------------------------------------------- /db-cleanup/README.md: -------------------------------------------------------------------------------- 1 | # Airflow DB Cleanup 2 | 3 | A maintenance workflow that you can deploy into Airflow to periodically clean out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid having too much data in your Airflow MetaStore. 4 | 5 | ## Deploy 6 | 7 | 1. Login to the machine running Airflow 8 | 9 | 2. Navigate to the dags directory 10 | 11 | 3. Copy the airflow-db-cleanup.py file to this dags directory 12 | 13 | a. Here's a fast way: 14 | 15 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/db-cleanup/airflow-db-cleanup.py 16 | 17 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES and ENABLE_DELETE) in the DAG with the desired values 18 | 19 | 5. Modify the DATABASE_OBJECTS list to add/remove objects as needed. Each dictionary in the list features the following parameters: 20 | - airflow_db_model: Model imported from airflow.models corresponding to a table in the airflow metadata database 21 | - age_check_column: Column in the model/table to use for calculating max date of data deletion 22 | - keep_last: Boolean to specify whether to preserve last run instance 23 | - keep_last_filters: List of filters to preserve data from deleting during clean-up, such as DAG runs where the external trigger is set to 0. 24 | - keep_last_group_by: Option to specify column by which to group the database entries and perform aggregate functions. 25 | 26 | 6. Create and Set the following Variables in the Airflow Web Server (Admin -> Variables) 27 | 28 | - airflow_db_cleanup__max_db_entry_age_in_days - integer - Length to retain the log files if not already provided in the conf. If this is set to 30, the job will remove those files that are 30 days old or older. 29 | 30 | 7. Enable the DAG in the Airflow Webserver 31 | 32 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # airflow-maintenance-dags 2 | A series of DAGs/Workflows to help maintain the operation of Airflow 3 | 4 | ## DAGs/Workflows 5 | 6 | * [backup-configs](backup-configs) 7 | * A maintenance workflow that you can deploy into Airflow to periodically take backups of various Airflow configurations and files. 8 | * [clear-missing-dags](clear-missing-dags) 9 | * A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs. 10 | * [db-cleanup](db-cleanup) 11 | * A maintenance workflow that you can deploy into Airflow to periodically clean out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid having too much data in your Airflow MetaStore. 12 | * [kill-halted-tasks](kill-halted-tasks) 13 | * A maintenance workflow that you can deploy into Airflow to periodically kill off tasks that are running in the background that don't correspond to a running task in the DB. 14 | * This is useful because when you kill off a DAG Run or Task through the Airflow Web Server, the task still runs in the background on one of the executors until the task is complete. 15 | * [log-cleanup](log-cleanup) 16 | * A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big. 17 | * [delete-broken-dags](delete-broken-dags) 18 | * A maintenance workflow that you can deploy into Airflow to periodically delete DAG files and clean out entries in the ImportError table for DAGs which Airflow cannot parse or import properly. This ensures that the ImportError table is cleaned every day. 19 | * [sla-miss-report](sla-miss-report) 20 | * DAG providing an extensive analysis report of SLA misses broken down on a daily, hourly, and task level 21 | -------------------------------------------------------------------------------- /log-cleanup/README.md: -------------------------------------------------------------------------------- 1 | # Airflow Log Cleanup 2 | 3 | A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big. 4 | 5 | - **airflow-log-cleanup.py**: Allows to delete logs by specifying the **number** of worker nodes. Does not guarantee log deletion of all nodes. 6 | - **airflow-log-cleanup-pwdless-ssh.py**: Allows to delete logs by specifying the list of worker nodes by their hostname. Requires the `airflow` user to have passwordless ssh to access all nodes. 7 | 8 | ## Deploy 9 | 10 | 1. Login to the machine running Airflow 11 | 2. Navigate to the dags directory 12 | 3. Select the DAG to deploy (with or without SSH access) and follow the instructions 13 | 14 | ### airflow-log-cleanup.py 15 | 16 | 1. Copy the airflow-log-cleanup.py file to this dags directory 17 | 18 | a. Here's a fast way: 19 | 20 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/log-cleanup/airflow-log-cleanup.py 21 | 22 | 2. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES, ENABLE_DELETE and NUMBER_OF_WORKERS) in the DAG with the desired values 23 | 24 | 3. Create and Set the following Variables in the Airflow Web Server (Admin -> Variables) 25 | 26 | - airflow_log_cleanup__max_log_age_in_days - integer - Length to retain the log files if not already provided in the conf. If this is set to 30, the job will remove those files that are 30 days old or older. 27 | - airflow_log_cleanup__enable_delete_child_log - boolean (True/False) - Whether to delete files from the Child Log directory defined under [scheduler] in the airflow.cfg file 28 | 29 | 4. Enable the DAG in the Airflow Webserver 30 | 31 | ### airflow-log-cleanup-pwdless-ssh.py ### 32 | 33 | 1. Copy the airflow-log-cleanup-pwdless-ssh.py file to this dags directory 34 | 35 | a. Here's a fast way: 36 | 37 | $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/log-cleanup/airflow-log-cleanup-pwdless-ssh.py 38 | 39 | 2. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES, ENABLE_DELETE and AIRFLOW_HOSTS) in the DAG with the desired values 40 | 41 | 3. Create and Set the following Variables in the Airflow Web Server (Admin -> Variables) 42 | 43 | - airflow_log_cleanup__max_log_age_in_days - integer - Length to retain the log files if not already provided in the conf. If this is set to 30, the job will remove those files that are 30 days old or older. 44 | - airflow_log_cleanup__enable_delete_child_log - boolean (True/False) - Whether to delete files from the Child Log directory defined under [scheduler] in the airflow.cfg file 45 | 46 | 4. Ensure the `airflow` user can passwordless SSH on the hosts listed in `AIRFLOW_HOSTS` 47 | 1. Create a public and private key SSH key on all the worker nodes. You can follow these instructions: https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys--2 48 | 2. Add the public key content to the ~/.ssh/authorized_keys file on all the other machines 49 | 50 | 5. Enable the DAG in the Airflow Webserver 51 | -------------------------------------------------------------------------------- /delete-broken-dags/airflow-delete-broken-dags.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically delete broken DAG file(s). 3 | 4 | airflow trigger_dag airflow-delete-broken-dags 5 | 6 | """ 7 | from airflow.models import DAG, ImportError 8 | from airflow.operators.python_operator import PythonOperator 9 | from airflow import settings 10 | from datetime import timedelta 11 | import os 12 | import os.path 13 | import socket 14 | import logging 15 | import airflow 16 | 17 | 18 | # airflow-delete-broken-dags 19 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 20 | START_DATE = airflow.utils.dates.days_ago(1) 21 | # How often to Run. @daily - Once a day at Midnight 22 | SCHEDULE_INTERVAL = "@daily" 23 | # Who is listed as the owner of this DAG in the Airflow Web Server 24 | DAG_OWNER_NAME = "operations" 25 | # List of email address to send email alerts to if this job fails 26 | ALERT_EMAIL_ADDRESSES = [] 27 | # Whether the job should delete the logs or not. Included if you want to 28 | # temporarily avoid deleting the logs 29 | ENABLE_DELETE = True 30 | 31 | default_args = { 32 | 'owner': DAG_OWNER_NAME, 33 | 'email': ALERT_EMAIL_ADDRESSES, 34 | 'email_on_failure': True, 35 | 'email_on_retry': False, 36 | 'start_date': START_DATE, 37 | 'retries': 1, 38 | 'retry_delay': timedelta(minutes=1) 39 | } 40 | 41 | dag = DAG( 42 | DAG_ID, 43 | default_args=default_args, 44 | schedule_interval=SCHEDULE_INTERVAL, 45 | start_date=START_DATE, 46 | tags=['teamclairvoyant', 'airflow-maintenance-dags'] 47 | ) 48 | if hasattr(dag, 'doc_md'): 49 | dag.doc_md = __doc__ 50 | if hasattr(dag, 'catchup'): 51 | dag.catchup = False 52 | 53 | 54 | def delete_broken_dag_files(**context): 55 | 56 | logging.info("Starting to run Clear Process") 57 | 58 | try: 59 | host_name = socket.gethostname() 60 | host_ip = socket.gethostbyname(host_name) 61 | logging.info("Running on Machine with Host Name: " + host_name) 62 | logging.info("Running on Machine with IP: " + host_ip) 63 | except Exception as e: 64 | print("Unable to get Host Name and IP: " + str(e)) 65 | 66 | session = settings.Session() 67 | 68 | logging.info("Configurations:") 69 | logging.info("enable_delete: " + str(ENABLE_DELETE)) 70 | logging.info("session: " + str(session)) 71 | logging.info("") 72 | 73 | errors = session.query(ImportError).all() 74 | 75 | logging.info( 76 | "Process will be removing broken DAG file(s) from the file system:" 77 | ) 78 | for error in errors: 79 | logging.info("\tFile: " + str(error.filename)) 80 | logging.info( 81 | "Process will be Deleting " + str(len(errors)) + " DAG file(s)" 82 | ) 83 | 84 | if ENABLE_DELETE: 85 | logging.info("Performing Delete...") 86 | for error in errors: 87 | if os.path.exists(error.filename): 88 | os.remove(error.filename) 89 | session.delete(error) 90 | logging.info("Finished Performing Delete") 91 | else: 92 | logging.warn("You're opted to skip Deleting the DAG file(s)!!!") 93 | 94 | logging.info("Finished") 95 | 96 | 97 | delete_broken_dag_files = PythonOperator( 98 | task_id='delete_broken_dag_files', 99 | python_callable=delete_broken_dag_files, 100 | provide_context=True, 101 | dag=dag) 102 | -------------------------------------------------------------------------------- /clear-missing-dags/airflow-clear-missing-dags.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs. 3 | 4 | airflow trigger_dag airflow-clear-missing-dags 5 | 6 | """ 7 | from airflow.models import DAG, DagModel 8 | from airflow.operators.python_operator import PythonOperator 9 | from airflow import settings 10 | from datetime import timedelta 11 | import os 12 | import os.path 13 | import socket 14 | import logging 15 | import airflow 16 | 17 | 18 | # airflow-clear-missing-dags 19 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 20 | START_DATE = airflow.utils.dates.days_ago(1) 21 | # How often to Run. @daily - Once a day at Midnight 22 | SCHEDULE_INTERVAL = "@daily" 23 | # Who is listed as the owner of this DAG in the Airflow Web Server 24 | DAG_OWNER_NAME = "operations" 25 | # List of email address to send email alerts to if this job fails 26 | ALERT_EMAIL_ADDRESSES = [] 27 | # Whether the job should delete the logs or not. Included if you want to 28 | # temporarily avoid deleting the logs 29 | ENABLE_DELETE = True 30 | 31 | default_args = { 32 | 'owner': DAG_OWNER_NAME, 33 | 'depends_on_past': False, 34 | 'email': ALERT_EMAIL_ADDRESSES, 35 | 'email_on_failure': True, 36 | 'email_on_retry': False, 37 | 'start_date': START_DATE, 38 | 'retries': 1, 39 | 'retry_delay': timedelta(minutes=1) 40 | } 41 | 42 | dag = DAG( 43 | DAG_ID, 44 | default_args=default_args, 45 | schedule_interval=SCHEDULE_INTERVAL, 46 | start_date=START_DATE, 47 | tags=['teamclairvoyant', 'airflow-maintenance-dags'] 48 | ) 49 | if hasattr(dag, 'doc_md'): 50 | dag.doc_md = __doc__ 51 | if hasattr(dag, 'catchup'): 52 | dag.catchup = False 53 | 54 | 55 | def clear_missing_dags_fn(**context): 56 | 57 | logging.info("Starting to run Clear Process") 58 | 59 | try: 60 | host_name = socket.gethostname() 61 | host_ip = socket.gethostbyname(host_name) 62 | logging.info("Running on Machine with Host Name: " + host_name) 63 | logging.info("Running on Machine with IP: " + host_ip) 64 | except Exception as e: 65 | print("Unable to get Host Name and IP: " + str(e)) 66 | 67 | session = settings.Session() 68 | 69 | logging.info("Configurations:") 70 | logging.info("enable_delete: " + str(ENABLE_DELETE)) 71 | logging.info("session: " + str(session)) 72 | logging.info("") 73 | 74 | dags = session.query(DagModel).all() 75 | entries_to_delete = [] 76 | for dag in dags: 77 | # Check if it is a zip-file 78 | if dag.fileloc is not None and '.zip/' in dag.fileloc: 79 | index = dag.fileloc.rfind('.zip/') + len('.zip') 80 | fileloc = dag.fileloc[0:index] 81 | else: 82 | fileloc = dag.fileloc 83 | 84 | if fileloc is None: 85 | logging.info( 86 | "After checking DAG '" + str(dag) + 87 | "', the fileloc was set to None so assuming the Python " + 88 | "definition file DOES NOT exist" 89 | ) 90 | entries_to_delete.append(dag) 91 | elif not os.path.exists(fileloc): 92 | logging.info( 93 | "After checking DAG '" + str(dag) + 94 | "', the Python definition file DOES NOT exist: " + fileloc 95 | ) 96 | entries_to_delete.append(dag) 97 | else: 98 | logging.info( 99 | "After checking DAG '" + str(dag) + 100 | "', the Python definition file does exist: " + fileloc 101 | ) 102 | 103 | logging.info("Process will be Deleting the DAG(s) from the DB:") 104 | for entry in entries_to_delete: 105 | logging.info("\tEntry: " + str(entry)) 106 | logging.info( 107 | "Process will be Deleting " + str(len(entries_to_delete)) + " DAG(s)" 108 | ) 109 | 110 | if ENABLE_DELETE: 111 | logging.info("Performing Delete...") 112 | for entry in entries_to_delete: 113 | session.delete(entry) 114 | session.commit() 115 | logging.info("Finished Performing Delete") 116 | else: 117 | logging.warn("You're opted to skip deleting the DAG entries!!!") 118 | 119 | logging.info("Finished Running Clear Process") 120 | 121 | 122 | clear_missing_dags = PythonOperator( 123 | task_id='clear_missing_dags', 124 | python_callable=clear_missing_dags_fn, 125 | provide_context=True, 126 | dag=dag) 127 | -------------------------------------------------------------------------------- /sla-miss-report/README.md: -------------------------------------------------------------------------------- 1 | # Airflow SLA Miss Report 2 | 3 | - [About](#about) 4 | - [Daily SLA Misses (timeframe: `long`)](#daily-sla-misses-timeframe-long) 5 | - [Hourly SLA Misses (timeframe: `short`)](#hourly-sla-misses-timeframe-short) 6 | - [DAG SLA Misses (timeframe: `short, medium, long`)](#dag-sla-misses-timeframe-short-medium-long) 7 | - [Sample Email](#sample-email) 8 | - [Sample Airflow Task Logs](#sample-airflow-task-logs) 9 | - [Architecture](#architecture) 10 | - [Requirements](#requirements) 11 | - [Deployment](#deployment) 12 | - [References](#references) 13 | 14 | 15 | ### About 16 | Airflow allows users to define [SLAs](https://github.com/teamclairvoyant/airflow-maintenance-dags/blob/teamclairvoyant/sla-miss-report/sla-miss-report/README.md) at DAG & task levels to track instances where processes are running longer than usual. However, making sense of the data is a challenge. 17 | 18 | The `airflow-sla-miss-report` DAG consolidates the data from the metadata tables and provides meaningful insights to ensure SLAs are met when set. 19 | 20 | The DAG utilizes **three (3) timeframes** (default: `short`: 1d, `medium`: 3d, `long`: 7d) to calculate the following KPIs: 21 | 22 | #### Daily SLA Misses (timeframe: `long`) 23 | Following details broken down on a daily basis for the provided long timeframe (e.g. 7 days): 24 | ``` 25 | SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs 26 | Top Violator (%): task that violated its SLA the most as a percentage of its total runs 27 | Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day 28 | ``` 29 | 30 | #### Hourly SLA Misses (timeframe: `short`) 31 | Following details broken down on an hourly basis for the provided short timeframe (e.g. 1 day): 32 | ``` 33 | SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs 34 | Top Violator (%): task that violated its SLA the most as a percentage of its total runs 35 | Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day 36 | Longest Running Task: task that took the longest time to execute within the hour window 37 | Average Task Queue Time (s): avg time taken for tasks in `queued` state; can be used to detect scheduling bottlenecks 38 | ``` 39 | 40 | #### DAG SLA Misses (timeframe: `short, medium, long`) 41 | Following details broken down on a task level for all timeframes: 42 | ``` 43 | Current SLA (s): current defined SLA for the task 44 | Short, Medium, Long Timeframe SLA miss % (avg execution time): % of tasks that missed their SLAs & their avg execution times over the respective timeframes 45 | ``` 46 | 47 | #### **Sample Email** 48 | ![Airflow SLA miss Email Report Output1](https://user-images.githubusercontent.com/32403237/193700720-24b88202-edae-4199-a7f3-0e46e54e0d5d.png) 49 | 50 | #### **Sample Airflow Task Logs** 51 | ![Airflow SLA miss Email Report Output2](https://user-images.githubusercontent.com/32403237/194130208-da532d3a-3ff4-4dbd-9c94-574ef42b2ee8.png) 52 | 53 | 54 | ### Architecture 55 | The process reads data from the Airflow metadata database to calculate SLA misses based on the defined DAG/task level SLAs using information. 56 | The following metadata tables are utilized: 57 | - `SerializedDag`: retrieve defined DAG & task SLAs 58 | - `DagRuns`: details about each DAG run 59 | - `TaskInstances`: details about each task instance in a DAG run 60 | 61 | ![Airflow SLA Process Flow Architecture](https://user-images.githubusercontent.com/8946659/191114560-2368e2df-916a-4f66-b1ac-b6cfe0b35a47.png) 62 | 63 | ### Requirements 64 | - Python: 3.7 and above 65 | - Pip packages: `pandas` 66 | - Airflow: v2.3 and above 67 | - Airflow metadata tables: `DagRuns`, `TaskInstances`, `SerializedDag` 68 | - [SMTP details](https://airflow.apache.org/docs/apache-airflow/stable/howto/email-config.html#using-default-smtp) in `airflow.cfg` for sending emails 69 | 70 | ### Deployment 71 | 1. Login to the machine running Airflow 72 | 2. Navigate to the `dags` directory 73 | 3. Copy the `airflow-sla-miss-report.py` file to the `dags` directory. Here's a fast way: 74 | ``` 75 | wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/sla-miss-report/airflow-sla-miss-report.py 76 | ``` 77 | 4. Update the global variables in the DAG with the desired values: 78 | ``` 79 | EMAIL_ADDRESSES (optional): list of recipient emails to send the SLA report 80 | SHORT_TIMEFRAME_IN_DAYS: duration in days of the short timeframe to calculate SLA metrics (default: 1) 81 | MEDIUM_TIMEFRAME_IN_DAYS: duration in days of the medium timeframe to calculate SLA metrics (default: 3) 82 | LONG_TIMEFRAME_IN_DAYS: duration in days of the long timeframe to calculate SLA metrics (default: 7) 83 | ``` 84 | 5. Enable the DAG in the Airflow Webserver -------------------------------------------------------------------------------- /log-cleanup/airflow-log-cleanup-pwdless-ssh.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically clean 3 | out the task logs to avoid those getting too big. 4 | airflow trigger_dag --conf '[curly-braces]"maxLogAgeInDays":30[curly-braces]' airflow-log-cleanup 5 | --conf options: 6 | maxLogAgeInDays: - Optional 7 | """ 8 | import logging 9 | import os 10 | import time 11 | from datetime import timedelta 12 | 13 | import airflow 14 | from airflow.configuration import conf 15 | from airflow.models import DAG, Variable 16 | from airflow.operators.bash_operator import BashOperator 17 | from airflow.operators.dummy_operator import DummyOperator 18 | 19 | # airflow-log-cleanup 20 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 21 | START_DATE = airflow.utils.dates.days_ago(1) 22 | try: 23 | BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER").rstrip("/") 24 | except Exception as e: 25 | BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER").rstrip("/") 26 | # How often to Run. @daily - Once a day at Midnight 27 | SCHEDULE_INTERVAL = "@daily" 28 | # Who is listed as the owner of this DAG in the Airflow Web Server 29 | DAG_OWNER_NAME = "operations" 30 | # List of email address to send email alerts to if this job fails 31 | ALERT_EMAIL_ADDRESSES = [] 32 | # Length to retain the log files if not already provided in the conf. If this 33 | # is set to 30, the job will remove those files that are 30 days old or older 34 | DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get( 35 | "airflow_log_cleanup__max_log_age_in_days", 30 36 | ) 37 | # Whether the job should delete the logs or not. Included if you want to 38 | # temporarily avoid deleting the logs 39 | ENABLE_DELETE = False 40 | 41 | AIRFLOW_HOSTS = "localhost" # comma separated list of host(s) 42 | 43 | TEMP_LOG_CLEANUP_SCRIPT_PATH = "/tmp/airflow_log_cleanup.sh" 44 | DIRECTORIES_TO_DELETE = [BASE_LOG_FOLDER] 45 | ENABLE_DELETE_CHILD_LOG = Variable.get( 46 | "airflow_log_cleanup__enable_delete_child_log", "False" 47 | ) 48 | 49 | logging.info("ENABLE_DELETE_CHILD_LOG " + ENABLE_DELETE_CHILD_LOG) 50 | 51 | if not BASE_LOG_FOLDER or BASE_LOG_FOLDER.strip() == "": 52 | raise ValueError( 53 | "BASE_LOG_FOLDER variable is empty in airflow.cfg. It can be found " 54 | "under the [core] (<2.0.0) section or [logging] (>=2.0.0) in the cfg file. " 55 | "Kindly provide an appropriate directory path." 56 | ) 57 | 58 | if ENABLE_DELETE_CHILD_LOG.lower() == "true": 59 | try: 60 | CHILD_PROCESS_LOG_DIRECTORY = conf.get( 61 | "scheduler", "CHILD_PROCESS_LOG_DIRECTORY" 62 | ) 63 | if CHILD_PROCESS_LOG_DIRECTORY != ' ': 64 | DIRECTORIES_TO_DELETE.append(CHILD_PROCESS_LOG_DIRECTORY) 65 | except Exception as e: 66 | logging.exception( 67 | "Could not obtain CHILD_PROCESS_LOG_DIRECTORY from " + 68 | "Airflow Configurations: " + str(e) 69 | ) 70 | 71 | default_args = { 72 | 'owner': DAG_OWNER_NAME, 73 | 'depends_on_past': False, 74 | 'email': ALERT_EMAIL_ADDRESSES, 75 | 'email_on_failure': True, 76 | 'email_on_retry': False, 77 | 'start_date': START_DATE, 78 | 'retries': 1, 79 | 'retry_delay': timedelta(minutes=1) 80 | } 81 | 82 | dag = DAG( 83 | DAG_ID, 84 | default_args=default_args, 85 | schedule_interval=SCHEDULE_INTERVAL, 86 | start_date=START_DATE, 87 | tags=['teamclairvoyant', 'airflow-maintenance-dags'] 88 | ) 89 | if hasattr(dag, 'doc_md'): 90 | dag.doc_md = __doc__ 91 | if hasattr(dag, 'catchup'): 92 | dag.catchup = False 93 | 94 | log_cleanup = """ 95 | echo "Getting Configurations..." 96 | 97 | BASE_LOG_FOLDER=$1 98 | MAX_LOG_AGE_IN_DAYS=$2 99 | ENABLE_DELETE=$3 100 | 101 | echo "Finished Getting Configurations" 102 | echo "" 103 | 104 | echo "Configurations:" 105 | echo "BASE_LOG_FOLDER: \'${BASE_LOG_FOLDER}\'" 106 | echo "MAX_LOG_AGE_IN_DAYS: \'${MAX_LOG_AGE_IN_DAYS}\'" 107 | echo "ENABLE_DELETE: \'${ENABLE_DELETE}\'" 108 | 109 | cleanup() { 110 | echo "Executing Find Statement: $1" 111 | FILES_MARKED_FOR_DELETE=$(eval $1) 112 | echo "Process will be Deleting the following files or directories:" 113 | echo "${FILES_MARKED_FOR_DELETE}" 114 | echo "Process will be Deleting $(echo "${FILES_MARKED_FOR_DELETE}" | 115 | grep -v \'^$\' | wc -l) files or directories" 116 | 117 | echo "" 118 | if [ "${ENABLE_DELETE}" == "true" ]; then 119 | if [ "${FILES_MARKED_FOR_DELETE}" != "" ]; then 120 | echo "Executing Delete Statement: $2" 121 | eval $2 122 | DELETE_STMT_EXIT_CODE=$? 123 | if [ "${DELETE_STMT_EXIT_CODE}" != "0" ]; then 124 | echo "Delete process failed with exit code \'${DELETE_STMT_EXIT_CODE}\'" 125 | 126 | exit ${DELETE_STMT_EXIT_CODE} 127 | fi 128 | else 129 | echo "WARN: No files or directories to Delete" 130 | fi 131 | else 132 | echo "WARN: You have opted to skip deleting the files or directories" 133 | fi 134 | } 135 | 136 | echo "" 137 | echo "Running Cleanup Process..." 138 | 139 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type f -mtime +${MAX_LOG_AGE_IN_DAYS}" 140 | DELETE_STMT="${FIND_STATEMENT} -exec rm -f {} \;" 141 | 142 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 143 | CLEANUP_EXIT_CODE=$? 144 | 145 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type d -empty" 146 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;" 147 | 148 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 149 | CLEANUP_EXIT_CODE=$? 150 | 151 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/* -type d -empty" 152 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;" 153 | 154 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 155 | CLEANUP_EXIT_CODE=$? 156 | 157 | echo "Finished Running Cleanup Process" 158 | """ 159 | 160 | create_log_cleanup_script = BashOperator( 161 | task_id=f'create_log_cleanup_script', 162 | bash_command=f""" 163 | echo '{log_cleanup}' > {TEMP_LOG_CLEANUP_SCRIPT_PATH} 164 | chmod +x {TEMP_LOG_CLEANUP_SCRIPT_PATH} 165 | current_host=$(echo $HOSTNAME) 166 | echo "Current Host: $current_host" 167 | hosts_string={AIRFLOW_HOSTS} 168 | echo "All Scheduler Hosts: $hosts_string" 169 | IFS=',' read -ra host_array <<< "$hosts_string" 170 | for host in "${{host_array[@]}}" 171 | do 172 | if [ "$host" != "$current_host" ]; then 173 | echo "Copying log_cleanup script to $host..." 174 | scp {TEMP_LOG_CLEANUP_SCRIPT_PATH} $host:{TEMP_LOG_CLEANUP_SCRIPT_PATH} 175 | echo "Making the script executable..." 176 | ssh -o StrictHostKeyChecking=no $host "chmod +x {TEMP_LOG_CLEANUP_SCRIPT_PATH}" 177 | fi 178 | done 179 | """, 180 | dag=dag) 181 | 182 | for host in AIRFLOW_HOSTS.split(","): 183 | for DIR_ID, DIRECTORY in enumerate(DIRECTORIES_TO_DELETE): 184 | LOG_CLEANUP_COMMAND = f'{TEMP_LOG_CLEANUP_SCRIPT_PATH} {DIRECTORY} {DEFAULT_MAX_LOG_AGE_IN_DAYS} {str(ENABLE_DELETE).lower()}' 185 | cleanup_task = BashOperator( 186 | task_id=f'airflow_log_cleanup_{host}_dir_{DIR_ID}', 187 | bash_command=f""" 188 | echo "Executing cleanup script..." 189 | ssh -o StrictHostKeyChecking=no {host} "{LOG_CLEANUP_COMMAND}" 190 | echo "Removing cleanup script..." 191 | ssh -o StrictHostKeyChecking=no {host} "rm {TEMP_LOG_CLEANUP_SCRIPT_PATH}" 192 | """, 193 | dag=dag) 194 | 195 | cleanup_task.set_upstream(create_log_cleanup_script) 196 | -------------------------------------------------------------------------------- /log-cleanup/airflow-log-cleanup.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically clean 3 | out the task logs to avoid those getting too big. 4 | airflow trigger_dag --conf '[curly-braces]"maxLogAgeInDays":30[curly-braces]' airflow-log-cleanup 5 | --conf options: 6 | maxLogAgeInDays: - Optional 7 | """ 8 | import logging 9 | import os 10 | from datetime import timedelta 11 | 12 | import airflow 13 | import jinja2 14 | from airflow.configuration import conf 15 | from airflow.models import DAG, Variable 16 | from airflow.operators.bash_operator import BashOperator 17 | from airflow.operators.dummy_operator import DummyOperator 18 | 19 | # airflow-log-cleanup 20 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 21 | START_DATE = airflow.utils.dates.days_ago(1) 22 | try: 23 | BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER").rstrip("/") 24 | except Exception as e: 25 | BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER").rstrip("/") 26 | # How often to Run. @daily - Once a day at Midnight 27 | SCHEDULE_INTERVAL = "@daily" 28 | # Who is listed as the owner of this DAG in the Airflow Web Server 29 | DAG_OWNER_NAME = "operations" 30 | # List of email address to send email alerts to if this job fails 31 | ALERT_EMAIL_ADDRESSES = [] 32 | # Length to retain the log files if not already provided in the conf. If this 33 | # is set to 30, the job will remove those files that are 30 days old or older 34 | DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get( 35 | "airflow_log_cleanup__max_log_age_in_days", 30 36 | ) 37 | # Whether the job should delete the logs or not. Included if you want to 38 | # temporarily avoid deleting the logs 39 | ENABLE_DELETE = True 40 | # The number of worker nodes you have in Airflow. Will attempt to run this 41 | # process for however many workers there are so that each worker gets its 42 | # logs cleared. 43 | NUMBER_OF_WORKERS = 1 44 | DIRECTORIES_TO_DELETE = [BASE_LOG_FOLDER] 45 | ENABLE_DELETE_CHILD_LOG = Variable.get( 46 | "airflow_log_cleanup__enable_delete_child_log", "False" 47 | ) 48 | LOG_CLEANUP_PROCESS_LOCK_FILE = "/tmp/airflow_log_cleanup_worker.lock" 49 | logging.info("ENABLE_DELETE_CHILD_LOG " + ENABLE_DELETE_CHILD_LOG) 50 | 51 | if not BASE_LOG_FOLDER or BASE_LOG_FOLDER.strip() == "": 52 | raise ValueError( 53 | "BASE_LOG_FOLDER variable is empty in airflow.cfg. It can be found " 54 | "under the [core] (<2.0.0) section or [logging] (>=2.0.0) in the cfg file. " 55 | "Kindly provide an appropriate directory path." 56 | ) 57 | 58 | if ENABLE_DELETE_CHILD_LOG.lower() == "true": 59 | try: 60 | CHILD_PROCESS_LOG_DIRECTORY = conf.get( 61 | "scheduler", "CHILD_PROCESS_LOG_DIRECTORY" 62 | ) 63 | if CHILD_PROCESS_LOG_DIRECTORY != ' ': 64 | DIRECTORIES_TO_DELETE.append(CHILD_PROCESS_LOG_DIRECTORY) 65 | except Exception as e: 66 | logging.exception( 67 | "Could not obtain CHILD_PROCESS_LOG_DIRECTORY from " + 68 | "Airflow Configurations: " + str(e) 69 | ) 70 | 71 | default_args = { 72 | 'owner': DAG_OWNER_NAME, 73 | 'depends_on_past': False, 74 | 'email': ALERT_EMAIL_ADDRESSES, 75 | 'email_on_failure': True, 76 | 'email_on_retry': False, 77 | 'start_date': START_DATE, 78 | 'retries': 1, 79 | 'retry_delay': timedelta(minutes=1) 80 | } 81 | 82 | dag = DAG( 83 | DAG_ID, 84 | default_args=default_args, 85 | schedule_interval=SCHEDULE_INTERVAL, 86 | start_date=START_DATE, 87 | tags=['teamclairvoyant', 'airflow-maintenance-dags'], 88 | template_undefined=jinja2.Undefined 89 | ) 90 | if hasattr(dag, 'doc_md'): 91 | dag.doc_md = __doc__ 92 | if hasattr(dag, 'catchup'): 93 | dag.catchup = False 94 | 95 | start = DummyOperator( 96 | task_id='start', 97 | dag=dag) 98 | 99 | log_cleanup = """ 100 | 101 | echo "Getting Configurations..." 102 | BASE_LOG_FOLDER="{{params.directory}}" 103 | WORKER_SLEEP_TIME="{{params.sleep_time}}" 104 | 105 | sleep ${WORKER_SLEEP_TIME}s 106 | 107 | MAX_LOG_AGE_IN_DAYS="{{dag_run.conf.maxLogAgeInDays}}" 108 | if [ "${MAX_LOG_AGE_IN_DAYS}" == "" ]; then 109 | echo "maxLogAgeInDays conf variable isn't included. Using Default '""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """'." 110 | MAX_LOG_AGE_IN_DAYS='""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """' 111 | fi 112 | ENABLE_DELETE=""" + str("true" if ENABLE_DELETE else "false") + """ 113 | echo "Finished Getting Configurations" 114 | echo "" 115 | 116 | echo "Configurations:" 117 | echo "BASE_LOG_FOLDER: '${BASE_LOG_FOLDER}'" 118 | echo "MAX_LOG_AGE_IN_DAYS: '${MAX_LOG_AGE_IN_DAYS}'" 119 | echo "ENABLE_DELETE: '${ENABLE_DELETE}'" 120 | 121 | cleanup() { 122 | echo "Executing Find Statement: $1" 123 | FILES_MARKED_FOR_DELETE=`eval $1` 124 | echo "Process will be Deleting the following File(s)/Directory(s):" 125 | echo "${FILES_MARKED_FOR_DELETE}" 126 | echo "Process will be Deleting `echo "${FILES_MARKED_FOR_DELETE}" | \ 127 | grep -v '^$' | wc -l` File(s)/Directory(s)" \ 128 | # "grep -v '^$'" - removes empty lines. 129 | # "wc -l" - Counts the number of lines 130 | echo "" 131 | if [ "${ENABLE_DELETE}" == "true" ]; 132 | then 133 | if [ "${FILES_MARKED_FOR_DELETE}" != "" ]; 134 | then 135 | echo "Executing Delete Statement: $2" 136 | eval $2 137 | DELETE_STMT_EXIT_CODE=$? 138 | if [ "${DELETE_STMT_EXIT_CODE}" != "0" ]; then 139 | echo "Delete process failed with exit code \ 140 | '${DELETE_STMT_EXIT_CODE}'" 141 | 142 | echo "Removing lock file..." 143 | rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ 144 | if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then 145 | echo "Error removing the lock file. \ 146 | Check file permissions.\ 147 | To re-run the DAG, ensure that the lock file has been \ 148 | deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)." 149 | exit ${REMOVE_LOCK_FILE_EXIT_CODE} 150 | fi 151 | exit ${DELETE_STMT_EXIT_CODE} 152 | fi 153 | else 154 | echo "WARN: No File(s)/Directory(s) to Delete" 155 | fi 156 | else 157 | echo "WARN: You're opted to skip deleting the File(s)/Directory(s)!!!" 158 | fi 159 | } 160 | 161 | 162 | if [ ! -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ ]; then 163 | 164 | echo "Lock file not found on this node! \ 165 | Creating it to prevent collisions..." 166 | touch """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ 167 | CREATE_LOCK_FILE_EXIT_CODE=$? 168 | if [ "${CREATE_LOCK_FILE_EXIT_CODE}" != "0" ]; then 169 | echo "Error creating the lock file. \ 170 | Check if the airflow user can create files under tmp directory. \ 171 | Exiting..." 172 | exit ${CREATE_LOCK_FILE_EXIT_CODE} 173 | fi 174 | 175 | echo "" 176 | echo "Running Cleanup Process..." 177 | 178 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type f -mtime \ 179 | +${MAX_LOG_AGE_IN_DAYS}" 180 | DELETE_STMT="${FIND_STATEMENT} -exec rm -f {} \;" 181 | 182 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 183 | CLEANUP_EXIT_CODE=$? 184 | 185 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type d -empty" 186 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;" 187 | 188 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 189 | CLEANUP_EXIT_CODE=$? 190 | 191 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/* -type d -empty" 192 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;" 193 | 194 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 195 | CLEANUP_EXIT_CODE=$? 196 | 197 | echo "Finished Running Cleanup Process" 198 | 199 | echo "Deleting lock file..." 200 | rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ 201 | REMOVE_LOCK_FILE_EXIT_CODE=$? 202 | if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then 203 | echo "Error removing the lock file. Check file permissions. To re-run the DAG, ensure that the lock file has been deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)." 204 | exit ${REMOVE_LOCK_FILE_EXIT_CODE} 205 | fi 206 | 207 | else 208 | echo "Another task is already deleting logs on this worker node. \ 209 | Skipping it!" 210 | echo "If you believe you're receiving this message in error, kindly check \ 211 | if """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ exists and delete it." 212 | exit 0 213 | fi 214 | 215 | """ 216 | 217 | for log_cleanup_id in range(1, NUMBER_OF_WORKERS + 1): 218 | 219 | for dir_id, directory in enumerate(DIRECTORIES_TO_DELETE): 220 | 221 | log_cleanup_op = BashOperator( 222 | task_id='log_cleanup_worker_num_' + 223 | str(log_cleanup_id) + '_dir_' + str(dir_id), 224 | bash_command=log_cleanup, 225 | params={ 226 | "directory": str(directory), 227 | "sleep_time": int(log_cleanup_id)*3}, 228 | dag=dag) 229 | 230 | log_cleanup_op.set_upstream(start) 231 | -------------------------------------------------------------------------------- /backup-configs/airflow-backup-configs.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically take 3 | backups of various Airflow configurations and files. 4 | 5 | airflow trigger_dag airflow-backup-configs 6 | 7 | """ 8 | from airflow.models import DAG, Variable 9 | from airflow.operators.python_operator import PythonOperator 10 | from airflow.configuration import conf 11 | from datetime import datetime, timedelta 12 | import os 13 | import airflow 14 | import logging 15 | import subprocess 16 | # airflow-backup-configs 17 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 18 | # How often to Run. @daily - Once a day at Midnight 19 | START_DATE = airflow.utils.dates.days_ago(1) 20 | # Who is listed as the owner of this DAG in the Airflow Web Server 21 | SCHEDULE_INTERVAL = "@daily" 22 | # List of email address to send email alerts to if this job fails 23 | DAG_OWNER_NAME = "operations" 24 | ALERT_EMAIL_ADDRESSES = [] 25 | # Format options: https://www.tutorialspoint.com/python/time_strftime.htm 26 | BACKUP_FOLDER_DATE_FORMAT = "%Y%m%d%H%M%S" 27 | BACKUP_HOME_DIRECTORY = Variable.get("airflow_backup_config__backup_home_directory", "/tmp/airflow_backups") 28 | BACKUPS_ENABLED = { 29 | "dag_directory": True, 30 | "log_directory": True, 31 | "airflow_cfg": True, 32 | "pip_packages": True 33 | } 34 | # How many backups to retain (not including the one that was just taken) 35 | BACKUP_RETENTION_COUNT = 7 36 | 37 | default_args = { 38 | 'owner': DAG_OWNER_NAME, 39 | 'email': ALERT_EMAIL_ADDRESSES, 40 | 'email_on_failure': True, 41 | 'email_on_retry': False, 42 | 'start_date': START_DATE, 43 | 'retries': 1, 44 | 'retry_delay': timedelta(minutes=1) 45 | } 46 | 47 | dag = DAG( 48 | DAG_ID, 49 | default_args=default_args, 50 | schedule_interval=SCHEDULE_INTERVAL, 51 | start_date=START_DATE, 52 | tags=['teamclairvoyant', 'airflow-maintenance-dags'] 53 | ) 54 | if hasattr(dag, 'doc_md'): 55 | dag.doc_md = __doc__ 56 | if hasattr(dag, 'catchup'): 57 | dag.catchup = False 58 | 59 | 60 | def print_configuration_fn(**context): 61 | logging.info("Executing print_configuration_fn") 62 | 63 | logging.info("Loading Configurations...") 64 | BACKUP_FOLDER_DATE = datetime.now().strftime(BACKUP_FOLDER_DATE_FORMAT) 65 | BACKUP_DIRECTORY = BACKUP_HOME_DIRECTORY + "/" + BACKUP_FOLDER_DATE + "/" 66 | 67 | logging.info("Configurations:") 68 | logging.info( 69 | "BACKUP_FOLDER_DATE_FORMAT: " + str(BACKUP_FOLDER_DATE_FORMAT) 70 | ) 71 | logging.info("BACKUP_FOLDER_DATE: " + str(BACKUP_FOLDER_DATE)) 72 | logging.info("BACKUP_HOME_DIRECTORY: " + str(BACKUP_HOME_DIRECTORY)) 73 | logging.info("BACKUP_DIRECTORY: " + str(BACKUP_DIRECTORY)) 74 | logging.info( 75 | "BACKUP_RETENTION_COUNT: " + str(BACKUP_RETENTION_COUNT) 76 | ) 77 | logging.info("") 78 | 79 | logging.info("Pushing to XCom for Downstream Processes") 80 | context["ti"].xcom_push( 81 | key="backup_configs.backup_home_directory", 82 | value=BACKUP_HOME_DIRECTORY 83 | ) 84 | context["ti"].xcom_push( 85 | key="backup_configs.backup_directory", 86 | value=BACKUP_DIRECTORY 87 | ) 88 | context["ti"].xcom_push( 89 | key="backup_configs.backup_retention_count", 90 | value=BACKUP_RETENTION_COUNT 91 | ) 92 | 93 | 94 | print_configuration_op = PythonOperator( 95 | task_id='print_configuration', 96 | python_callable=print_configuration_fn, 97 | provide_context=True, 98 | dag=dag) 99 | 100 | 101 | def execute_shell_cmd(cmd): 102 | logging.info("Executing Command: `" + cmd + "`") 103 | proc = subprocess.Popen(cmd, shell=True, universal_newlines=True) 104 | proc.communicate() 105 | exit_code = proc.returncode 106 | if exit_code != 0: 107 | exit(exit_code) 108 | 109 | 110 | def delete_old_backups_fn(**context): 111 | logging.info("Executing delete_old_backups_fn") 112 | 113 | logging.info("Loading Configurations...") 114 | BACKUP_HOME_DIRECTORY = context["ti"].xcom_pull( 115 | task_ids=print_configuration_op.task_id, 116 | key='backup_configs.backup_home_directory' 117 | ) 118 | BACKUP_RETENTION_COUNT = context["ti"].xcom_pull( 119 | task_ids=print_configuration_op.task_id, 120 | key='backup_configs.backup_retention_count' 121 | ) 122 | 123 | logging.info("Configurations:") 124 | logging.info("BACKUP_HOME_DIRECTORY: " + str(BACKUP_HOME_DIRECTORY)) 125 | logging.info("BACKUP_RETENTION_COUNT: " + str(BACKUP_RETENTION_COUNT)) 126 | logging.info("") 127 | 128 | if BACKUP_RETENTION_COUNT < 0: 129 | logging.info( 130 | "Retention is less then 0. Assuming to allow infinite backups. " 131 | "Skipping..." 132 | ) 133 | return 134 | 135 | backup_folders = [ 136 | os.path.join(BACKUP_HOME_DIRECTORY, f) 137 | for f in os.listdir(BACKUP_HOME_DIRECTORY) 138 | if os.path.isdir(os.path.join(BACKUP_HOME_DIRECTORY, f)) 139 | ] 140 | backup_folders.reverse() 141 | logging.info("backup_folders: " + str(backup_folders)) 142 | logging.info("") 143 | 144 | cnt = 0 145 | for backup_folder in backup_folders: 146 | logging.info( 147 | "cnt = " + str(cnt) + ", backup_folder = " + str(backup_folder) 148 | ) 149 | if cnt > BACKUP_RETENTION_COUNT: 150 | logging.info("Deleting Backup Folder: " + str(backup_folder)) 151 | execute_shell_cmd("rm -rf " + str(backup_folder)) 152 | cnt += 1 153 | 154 | 155 | delete_old_backups_op = PythonOperator( 156 | task_id='delete_old_backups', 157 | python_callable=delete_old_backups_fn, 158 | provide_context=True, 159 | dag=dag) 160 | 161 | 162 | def general_backup_fn(**context): 163 | logging.info("Executing general_backup_fn") 164 | 165 | logging.info("Loading Configurations...") 166 | PATH_TO_BACKUP = context["params"].get("path_to_backup") 167 | TARGET_DIRECTORY_NAME = context["params"].get("target_directory_name") 168 | BACKUP_DIRECTORY = context["ti"].xcom_pull( 169 | task_ids=print_configuration_op.task_id, 170 | key='backup_configs.backup_directory' 171 | ) 172 | 173 | logging.info("Configurations:") 174 | logging.info("PATH_TO_BACKUP: " + str(PATH_TO_BACKUP)) 175 | logging.info("TARGET_DIRECTORY_NAME: " + str(TARGET_DIRECTORY_NAME)) 176 | logging.info("BACKUP_DIRECTORY: " + str(BACKUP_DIRECTORY)) 177 | logging.info("") 178 | 179 | execute_shell_cmd("mkdir -p " + str(BACKUP_DIRECTORY)) 180 | 181 | execute_shell_cmd( 182 | "cp -r -n " + str(PATH_TO_BACKUP) + " " + str(BACKUP_DIRECTORY) + 183 | (TARGET_DIRECTORY_NAME if TARGET_DIRECTORY_NAME is not None else "") 184 | ) 185 | 186 | 187 | def pip_packages_backup_fn(**context): 188 | logging.info("Executing pip_packages_backup_fn") 189 | 190 | logging.info("Loading Configurations...") 191 | 192 | BACKUP_DIRECTORY = context["ti"].xcom_pull( 193 | task_ids=print_configuration_op.task_id, 194 | key='backup_configs.backup_directory' 195 | ) 196 | 197 | logging.info("Configurations:") 198 | logging.info("BACKUP_DIRECTORY: " + str(BACKUP_DIRECTORY)) 199 | logging.info("") 200 | if not os.path.exists(BACKUP_DIRECTORY): 201 | os.makedirs(BACKUP_DIRECTORY) 202 | execute_shell_cmd("pip freeze > " + BACKUP_DIRECTORY + "pip_freeze.out") 203 | 204 | 205 | if BACKUPS_ENABLED.get("dag_directory"): 206 | backup_op = PythonOperator( 207 | task_id='backup_dag_directory', 208 | python_callable=general_backup_fn, 209 | params={"path_to_backup": conf.get("core", "DAGS_FOLDER")}, 210 | provide_context=True, 211 | dag=dag) 212 | print_configuration_op.set_downstream(backup_op) 213 | backup_op.set_downstream(delete_old_backups_op) 214 | 215 | if BACKUPS_ENABLED.get("log_directory"): 216 | try: 217 | BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER") 218 | except Exception as e: 219 | BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER") 220 | 221 | backup_op = PythonOperator( 222 | task_id='backup_log_directory', 223 | python_callable=general_backup_fn, 224 | params={ 225 | "path_to_backup": BASE_LOG_FOLDER, 226 | "target_directory_name": "logs" 227 | }, 228 | provide_context=True, 229 | dag=dag) 230 | print_configuration_op.set_downstream(backup_op) 231 | backup_op.set_downstream(delete_old_backups_op) 232 | 233 | if BACKUPS_ENABLED.get("airflow_cfg"): 234 | backup_op = PythonOperator( 235 | task_id='backup_airflow_cfg', 236 | python_callable=general_backup_fn, 237 | params={ 238 | "path_to_backup": (os.environ.get('AIRFLOW_HOME') if os.environ.get('AIRFLOW_HOME') is not None else "~/airflow/") + "/airflow.cfg" 239 | }, 240 | provide_context=True, 241 | dag=dag) 242 | print_configuration_op.set_downstream(backup_op) 243 | backup_op.set_downstream(delete_old_backups_op) 244 | 245 | if BACKUPS_ENABLED.get("pip_packages"): 246 | backup_op = PythonOperator( 247 | task_id='backup_pip_packages', 248 | python_callable=pip_packages_backup_fn, 249 | provide_context=True, 250 | dag=dag) 251 | print_configuration_op.set_downstream(backup_op) 252 | backup_op.set_downstream(delete_old_backups_op) 253 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /db-cleanup/airflow-db-cleanup.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically clean 3 | out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid 4 | having too much data in your Airflow MetaStore. 5 | 6 | airflow trigger_dag --conf '[curly-braces]"maxDBEntryAgeInDays":30[curly-braces]' airflow-db-cleanup 7 | 8 | --conf options: 9 | maxDBEntryAgeInDays: - Optional 10 | 11 | """ 12 | import airflow 13 | from airflow import settings 14 | from airflow.configuration import conf 15 | from airflow.models import DAG, DagTag, DagModel, DagRun, Log, XCom, SlaMiss, TaskInstance, Variable 16 | try: 17 | from airflow.jobs import BaseJob 18 | except Exception as e: 19 | from airflow.jobs.base_job import BaseJob 20 | from airflow.operators.python_operator import PythonOperator 21 | from datetime import datetime, timedelta 22 | import dateutil.parser 23 | import logging 24 | import os 25 | from sqlalchemy import func, and_ 26 | from sqlalchemy.exc import ProgrammingError 27 | from sqlalchemy.orm import load_only 28 | 29 | try: 30 | # airflow.utils.timezone is available from v1.10 onwards 31 | from airflow.utils import timezone 32 | now = timezone.utcnow 33 | except ImportError: 34 | now = datetime.utcnow 35 | 36 | # airflow-db-cleanup 37 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 38 | START_DATE = airflow.utils.dates.days_ago(1) 39 | # How often to Run. @daily - Once a day at Midnight (UTC) 40 | SCHEDULE_INTERVAL = "@daily" 41 | # Who is listed as the owner of this DAG in the Airflow Web Server 42 | DAG_OWNER_NAME = "operations" 43 | # List of email address to send email alerts to if this job fails 44 | ALERT_EMAIL_ADDRESSES = [] 45 | # Length to retain the log files if not already provided in the conf. If this 46 | # is set to 30, the job will remove those files that arE 30 days old or older. 47 | 48 | DEFAULT_MAX_DB_ENTRY_AGE_IN_DAYS = int( 49 | Variable.get("airflow_db_cleanup__max_db_entry_age_in_days", 30) 50 | ) 51 | # Prints the database entries which will be getting deleted; set to False to avoid printing large lists and slowdown process 52 | PRINT_DELETES = True 53 | # Whether the job should delete the db entries or not. Included if you want to 54 | # temporarily avoid deleting the db entries. 55 | ENABLE_DELETE = True 56 | 57 | # get dag model last schedule run 58 | try: 59 | dag_model_last_scheduler_run = DagModel.last_scheduler_run 60 | except AttributeError: 61 | dag_model_last_scheduler_run = DagModel.last_parsed_time 62 | 63 | # List of all the objects that will be deleted. Comment out the DB objects you 64 | # want to skip. 65 | DATABASE_OBJECTS = [ 66 | { 67 | "airflow_db_model": BaseJob, 68 | "age_check_column": BaseJob.latest_heartbeat, 69 | "keep_last": False, 70 | "keep_last_filters": None, 71 | "keep_last_group_by": None 72 | }, 73 | { 74 | "airflow_db_model": DagRun, 75 | "age_check_column": DagRun.execution_date, 76 | "keep_last": True, 77 | "keep_last_filters": [DagRun.external_trigger.is_(False)], 78 | "keep_last_group_by": DagRun.dag_id 79 | }, 80 | { 81 | "airflow_db_model": TaskInstance, 82 | "age_check_column": TaskInstance.execution_date, 83 | "keep_last": False, 84 | "keep_last_filters": None, 85 | "keep_last_group_by": None 86 | }, 87 | { 88 | "airflow_db_model": Log, 89 | "age_check_column": Log.dttm, 90 | "keep_last": False, 91 | "keep_last_filters": None, 92 | "keep_last_group_by": None 93 | }, 94 | { 95 | "airflow_db_model": XCom, 96 | "age_check_column": XCom.execution_date, 97 | "keep_last": False, 98 | "keep_last_filters": None, 99 | "keep_last_group_by": None 100 | }, 101 | { 102 | "airflow_db_model": SlaMiss, 103 | "age_check_column": SlaMiss.execution_date, 104 | "keep_last": False, 105 | "keep_last_filters": None, 106 | "keep_last_group_by": None 107 | }, 108 | { 109 | "airflow_db_model": DagModel, 110 | "age_check_column": dag_model_last_scheduler_run, 111 | "keep_last": False, 112 | "keep_last_filters": None, 113 | "keep_last_group_by": None 114 | }] 115 | 116 | # Check for TaskReschedule model 117 | try: 118 | from airflow.models import TaskReschedule 119 | DATABASE_OBJECTS.append({ 120 | "airflow_db_model": TaskReschedule, 121 | "age_check_column": TaskReschedule.execution_date, 122 | "keep_last": False, 123 | "keep_last_filters": None, 124 | "keep_last_group_by": None 125 | }) 126 | 127 | except Exception as e: 128 | logging.error(e) 129 | 130 | # Check for TaskFail model 131 | try: 132 | from airflow.models import TaskFail 133 | DATABASE_OBJECTS.append({ 134 | "airflow_db_model": TaskFail, 135 | "age_check_column": TaskFail.execution_date, 136 | "keep_last": False, 137 | "keep_last_filters": None, 138 | "keep_last_group_by": None 139 | }) 140 | 141 | except Exception as e: 142 | logging.error(e) 143 | 144 | # Check for RenderedTaskInstanceFields model 145 | try: 146 | from airflow.models import RenderedTaskInstanceFields 147 | DATABASE_OBJECTS.append({ 148 | "airflow_db_model": RenderedTaskInstanceFields, 149 | "age_check_column": RenderedTaskInstanceFields.execution_date, 150 | "keep_last": False, 151 | "keep_last_filters": None, 152 | "keep_last_group_by": None 153 | }) 154 | 155 | except Exception as e: 156 | logging.error(e) 157 | 158 | # Check for ImportError model 159 | try: 160 | from airflow.models import ImportError 161 | DATABASE_OBJECTS.append({ 162 | "airflow_db_model": ImportError, 163 | "age_check_column": ImportError.timestamp, 164 | "keep_last": False, 165 | "keep_last_filters": None, 166 | "keep_last_group_by": None 167 | }) 168 | 169 | except Exception as e: 170 | logging.error(e) 171 | 172 | # Check for celery executor 173 | airflow_executor = str(conf.get("core", "executor")) 174 | logging.info("Airflow Executor: " + str(airflow_executor)) 175 | if(airflow_executor == "CeleryExecutor"): 176 | logging.info("Including Celery Modules") 177 | try: 178 | from celery.backends.database.models import Task, TaskSet 179 | DATABASE_OBJECTS.extend(( 180 | { 181 | "airflow_db_model": Task, 182 | "age_check_column": Task.date_done, 183 | "keep_last": False, 184 | "keep_last_filters": None, 185 | "keep_last_group_by": None 186 | }, 187 | { 188 | "airflow_db_model": TaskSet, 189 | "age_check_column": TaskSet.date_done, 190 | "keep_last": False, 191 | "keep_last_filters": None, 192 | "keep_last_group_by": None 193 | })) 194 | 195 | except Exception as e: 196 | logging.error(e) 197 | 198 | session = settings.Session() 199 | 200 | default_args = { 201 | 'owner': DAG_OWNER_NAME, 202 | 'depends_on_past': False, 203 | 'email': ALERT_EMAIL_ADDRESSES, 204 | 'email_on_failure': True, 205 | 'email_on_retry': False, 206 | 'start_date': START_DATE, 207 | 'retries': 1, 208 | 'retry_delay': timedelta(minutes=1) 209 | } 210 | 211 | dag = DAG( 212 | DAG_ID, 213 | default_args=default_args, 214 | schedule_interval=SCHEDULE_INTERVAL, 215 | start_date=START_DATE, 216 | tags=['teamclairvoyant', 'airflow-maintenance-dags'] 217 | ) 218 | if hasattr(dag, 'doc_md'): 219 | dag.doc_md = __doc__ 220 | if hasattr(dag, 'catchup'): 221 | dag.catchup = False 222 | 223 | 224 | def print_configuration_function(**context): 225 | logging.info("Loading Configurations...") 226 | dag_run_conf = context.get("dag_run").conf 227 | logging.info("dag_run.conf: " + str(dag_run_conf)) 228 | max_db_entry_age_in_days = None 229 | if dag_run_conf: 230 | max_db_entry_age_in_days = dag_run_conf.get( 231 | "maxDBEntryAgeInDays", None 232 | ) 233 | logging.info("maxDBEntryAgeInDays from dag_run.conf: " + str(dag_run_conf)) 234 | if (max_db_entry_age_in_days is None or max_db_entry_age_in_days < 1): 235 | logging.info( 236 | "maxDBEntryAgeInDays conf variable isn't included or Variable " + 237 | "value is less than 1. Using Default '" + 238 | str(DEFAULT_MAX_DB_ENTRY_AGE_IN_DAYS) + "'" 239 | ) 240 | max_db_entry_age_in_days = DEFAULT_MAX_DB_ENTRY_AGE_IN_DAYS 241 | max_date = now() + timedelta(-max_db_entry_age_in_days) 242 | logging.info("Finished Loading Configurations") 243 | logging.info("") 244 | 245 | logging.info("Configurations:") 246 | logging.info("max_db_entry_age_in_days: " + str(max_db_entry_age_in_days)) 247 | logging.info("max_date: " + str(max_date)) 248 | logging.info("enable_delete: " + str(ENABLE_DELETE)) 249 | logging.info("session: " + str(session)) 250 | logging.info("") 251 | 252 | logging.info("Setting max_execution_date to XCom for Downstream Processes") 253 | context["ti"].xcom_push(key="max_date", value=max_date.isoformat()) 254 | 255 | 256 | print_configuration = PythonOperator( 257 | task_id='print_configuration', 258 | python_callable=print_configuration_function, 259 | provide_context=True, 260 | dag=dag) 261 | 262 | 263 | def cleanup_function(**context): 264 | 265 | logging.info("Retrieving max_execution_date from XCom") 266 | max_date = context["ti"].xcom_pull( 267 | task_ids=print_configuration.task_id, key="max_date" 268 | ) 269 | max_date = dateutil.parser.parse(max_date) # stored as iso8601 str in xcom 270 | 271 | airflow_db_model = context["params"].get("airflow_db_model") 272 | state = context["params"].get("state") 273 | age_check_column = context["params"].get("age_check_column") 274 | keep_last = context["params"].get("keep_last") 275 | keep_last_filters = context["params"].get("keep_last_filters") 276 | keep_last_group_by = context["params"].get("keep_last_group_by") 277 | 278 | logging.info("Configurations:") 279 | logging.info("max_date: " + str(max_date)) 280 | logging.info("enable_delete: " + str(ENABLE_DELETE)) 281 | logging.info("session: " + str(session)) 282 | logging.info("airflow_db_model: " + str(airflow_db_model)) 283 | logging.info("state: " + str(state)) 284 | logging.info("age_check_column: " + str(age_check_column)) 285 | logging.info("keep_last: " + str(keep_last)) 286 | logging.info("keep_last_filters: " + str(keep_last_filters)) 287 | logging.info("keep_last_group_by: " + str(keep_last_group_by)) 288 | 289 | logging.info("") 290 | 291 | logging.info("Running Cleanup Process...") 292 | 293 | try: 294 | query = session.query(airflow_db_model).options( 295 | load_only(age_check_column) 296 | ) 297 | 298 | logging.info("INITIAL QUERY : " + str(query)) 299 | 300 | if keep_last: 301 | 302 | subquery = session.query(func.max(DagRun.execution_date)) 303 | # workaround for MySQL "table specified twice" issue 304 | # https://github.com/teamclairvoyant/airflow-maintenance-dags/issues/41 305 | if keep_last_filters is not None: 306 | for entry in keep_last_filters: 307 | subquery = subquery.filter(entry) 308 | 309 | logging.info("SUB QUERY [keep_last_filters]: " + str(subquery)) 310 | 311 | if keep_last_group_by is not None: 312 | subquery = subquery.group_by(keep_last_group_by) 313 | logging.info( 314 | "SUB QUERY [keep_last_group_by]: " + str(subquery)) 315 | 316 | subquery = subquery.from_self() 317 | 318 | query = query.filter( 319 | and_(age_check_column.notin_(subquery)), 320 | and_(age_check_column <= max_date) 321 | ) 322 | 323 | else: 324 | query = query.filter(age_check_column <= max_date,) 325 | 326 | if PRINT_DELETES: 327 | entries_to_delete = query.all() 328 | 329 | logging.info("Query: " + str(query)) 330 | logging.info( 331 | "Process will be Deleting the following " + 332 | str(airflow_db_model.__name__) + "(s):" 333 | ) 334 | for entry in entries_to_delete: 335 | logging.info( 336 | "\tEntry: " + str(entry) + ", Date: " + 337 | str(entry.__dict__[str(age_check_column).split(".")[1]]) 338 | ) 339 | 340 | logging.info( 341 | "Process will be Deleting " + str(len(entries_to_delete)) + " " + 342 | str(airflow_db_model.__name__) + "(s)" 343 | ) 344 | else: 345 | logging.warn( 346 | "You've opted to skip printing the db entries to be deleted. Set PRINT_DELETES to True to show entries!!!") 347 | 348 | if ENABLE_DELETE: 349 | logging.info('Performing Delete...') 350 | if airflow_db_model.__name__ == 'DagModel': 351 | logging.info('Deleting tags...') 352 | ids_query = query.from_self().with_entities(DagModel.dag_id) 353 | tags_query = session.query(DagTag).filter(DagTag.dag_id.in_(ids_query)) 354 | logging.info('Tags delete Query: ' + str(tags_query)) 355 | tags_query.delete(synchronize_session=False) 356 | # using bulk delete 357 | query.delete(synchronize_session=False) 358 | session.commit() 359 | logging.info('Finished Performing Delete') 360 | else: 361 | logging.warn( 362 | "You've opted to skip deleting the db entries. Set ENABLE_DELETE to True to delete entries!!!") 363 | 364 | logging.info("Finished Running Cleanup Process") 365 | 366 | except ProgrammingError as e: 367 | logging.error(e) 368 | logging.error(str(airflow_db_model) + 369 | " is not present in the metadata. Skipping...") 370 | 371 | 372 | for db_object in DATABASE_OBJECTS: 373 | 374 | cleanup_op = PythonOperator( 375 | task_id='cleanup_' + str(db_object["airflow_db_model"].__name__), 376 | python_callable=cleanup_function, 377 | params=db_object, 378 | provide_context=True, 379 | dag=dag 380 | ) 381 | 382 | print_configuration.set_downstream(cleanup_op) 383 | -------------------------------------------------------------------------------- /kill-halted-tasks/airflow-kill-halted-tasks.py: -------------------------------------------------------------------------------- 1 | """ 2 | A maintenance workflow that you can deploy into Airflow to periodically kill 3 | off tasks that are running in the background that don't correspond to a running 4 | task in the DB. 5 | 6 | This is useful because when you kill off a DAG Run or Task through the Airflow 7 | Web Server, the task still runs in the background on one of the executors until 8 | the task is complete. 9 | 10 | airflow trigger_dag airflow-kill-halted-tasks 11 | 12 | """ 13 | from airflow.models import DAG, DagModel, DagRun, TaskInstance 14 | from airflow import settings 15 | from airflow.operators.python_operator \ 16 | import PythonOperator, ShortCircuitOperator 17 | from airflow.operators.email_operator import EmailOperator 18 | from sqlalchemy import and_ 19 | from datetime import datetime, timedelta 20 | import os 21 | import re 22 | import logging 23 | import pytz 24 | import airflow 25 | 26 | 27 | # airflow-kill-halted-tasks 28 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 29 | START_DATE = airflow.utils.dates.days_ago(1) 30 | # How often to Run. @daily - Once a day at Midnight. @hourly - Once an Hour. 31 | SCHEDULE_INTERVAL = "@hourly" 32 | # Who is listed as the owner of this DAG in the Airflow Web Server 33 | DAG_OWNER_NAME = "operations" 34 | # List of email address to send email alerts to if this job fails 35 | ALERT_EMAIL_ADDRESSES = [] 36 | # Whether to send out an email whenever a process was killed during a DAG Run 37 | # or not 38 | SEND_PROCESS_KILLED_EMAIL = True 39 | # Subject of the email that is sent out when a task is killed by the DAG 40 | PROCESS_KILLED_EMAIL_SUBJECT = DAG_ID + " - Tasks were Killed" 41 | # List of email address to send emails to when a task is killed by the DAG 42 | PROCESS_KILLED_EMAIL_ADDRESSES = [] 43 | # Whether the job should delete the db entries or not. Included if you want to 44 | # temporarily avoid deleting the db entries. 45 | ENABLE_KILL = True 46 | # Whether to print out certain statements meant for debugging 47 | DEBUG = False 48 | 49 | default_args = { 50 | 'owner': DAG_OWNER_NAME, 51 | 'email': ALERT_EMAIL_ADDRESSES, 52 | 'email_on_failure': True, 53 | 'email_on_retry': False, 54 | 'start_date': START_DATE, 55 | 'retries': 1, 56 | 'retry_delay': timedelta(minutes=1) 57 | } 58 | 59 | dag = DAG( 60 | DAG_ID, 61 | default_args=default_args, 62 | schedule_interval=SCHEDULE_INTERVAL, 63 | start_date=START_DATE, 64 | tags=['teamclairvoyant', 'airflow-maintenance-dags'] 65 | ) 66 | if hasattr(dag, 'doc_md'): 67 | dag.doc_md = __doc__ 68 | if hasattr(dag, 'catchup'): 69 | dag.catchup = False 70 | 71 | 72 | uid_regex = "(\w+)" 73 | pid_regex = "(\w+)" 74 | ppid_regex = "(\w+)" 75 | processor_scheduling_regex = "(\w+)" 76 | start_time_regex = "([\w:.]+)" 77 | tty_regex = "([\w?/]+)" 78 | cpu_time_regex = "([\w:.]+)" 79 | command_regex = "(.+)" 80 | 81 | # When Search Command is: ps -o pid -o cmd -u `whoami` | grep 'airflow run' 82 | full_regex = '\s*' + pid_regex + '\s+' + command_regex 83 | 84 | airflow_run_regex = '.*run\s([\w._-]*)\s([\w._-]*)\s([\w:.-]*).*' 85 | 86 | 87 | def parse_process_linux_string(line): 88 | if DEBUG: 89 | logging.info("DEBUG: Processing Line: " + str(line)) 90 | full_regex_match = re.search(full_regex, line) 91 | if DEBUG: 92 | for index in range(0, (len(full_regex_match.groups()) + 1)): 93 | group = full_regex_match.group(index) 94 | logging.info( 95 | "DEBUG: index: " + str(index) + ", group: " + str(group) 96 | ) 97 | pid = full_regex_match.group(1) 98 | command = full_regex_match.group(2).strip() 99 | process = {"pid": pid, "command": command} 100 | 101 | if DEBUG: 102 | logging.info("DEBUG: Processing Command: " + str(command)) 103 | airflow_run_regex_match = re.search(airflow_run_regex, command) 104 | if DEBUG: 105 | for index in range(0, (len(airflow_run_regex_match.groups()) + 1)): 106 | group = airflow_run_regex_match.group(index) 107 | logging.info( 108 | "DEBUG: index: " + str(index) + ", group: " + str(group) 109 | ) 110 | process["airflow_dag_id"] = airflow_run_regex_match.group(1) 111 | process["airflow_task_id"] = airflow_run_regex_match.group(2) 112 | process["airflow_execution_date"] = airflow_run_regex_match.group(3) 113 | return process 114 | 115 | 116 | def kill_halted_tasks_function(**context): 117 | logging.info("Getting Configurations...") 118 | airflow_version = airflow.__version__ 119 | session = settings.Session() 120 | 121 | logging.info("Finished Getting Configurations\n") 122 | 123 | logging.info("Configurations:") 124 | logging.info( 125 | "send_process_killed_email: " + str(SEND_PROCESS_KILLED_EMAIL) 126 | ) 127 | logging.info( 128 | "process_killed_email_subject: " + str(PROCESS_KILLED_EMAIL_SUBJECT) 129 | ) 130 | logging.info( 131 | "process_killed_email_addresses: " + 132 | str(PROCESS_KILLED_EMAIL_ADDRESSES) 133 | ) 134 | logging.info("enable_kill: " + str(ENABLE_KILL)) 135 | logging.info("debug: " + str(DEBUG)) 136 | logging.info("session: " + str(session)) 137 | logging.info("airflow_version: " + str(airflow_version)) 138 | logging.info("") 139 | 140 | logging.info("Running Cleanup Process...") 141 | logging.info("") 142 | 143 | process_search_command = ( 144 | "ps -o pid -o cmd -u `whoami` | grep 'airflow run'" 145 | ) 146 | logging.info("Running Search Process: " + process_search_command) 147 | search_output = os.popen(process_search_command).read() 148 | logging.info("Search Process Output: ") 149 | logging.info(search_output) 150 | 151 | logging.info( 152 | "Filtering out: Empty Lines, Grep processes, and this DAGs Run." 153 | ) 154 | search_output_filtered = [ 155 | line for line in search_output.split("\n") if line is not None 156 | and line.strip() != "" and ' grep ' not in line 157 | and DAG_ID not in line 158 | ] 159 | logging.info("Search Process Output (with Filter): ") 160 | for line in search_output_filtered: 161 | logging.info(line) 162 | logging.info("") 163 | 164 | logging.info("Searching through running processes...") 165 | airflow_timezone_not_required_versions = ['1.7', '1.8', '1.9'] 166 | processes_to_kill = [] 167 | for line in search_output_filtered: 168 | logging.info("") 169 | process = parse_process_linux_string(line=line) 170 | 171 | logging.info("Checking: " + str(process)) 172 | exec_date_str = (process["airflow_execution_date"]).replace("T", " ") 173 | if '.' not in exec_date_str: 174 | # Add milliseconds if they are missing. 175 | exec_date_str = exec_date_str + '.0' 176 | execution_date_to_search_for = datetime.strptime( 177 | exec_date_str, '%Y-%m-%d %H:%M:%S.%f' 178 | ) 179 | # apache-airflow version >= 1.10 requires datetime field values with 180 | # timezone 181 | if airflow_version[:3] not in airflow_timezone_not_required_versions: 182 | execution_date_to_search_for = pytz.utc.localize( 183 | execution_date_to_search_for 184 | ) 185 | 186 | logging.info( 187 | "Execution Date to Search For: " + 188 | str(execution_date_to_search_for) 189 | ) 190 | 191 | # Checking to make sure the DAG is available and active 192 | if DEBUG: 193 | logging.info("DEBUG: Listing All DagModels: ") 194 | for dag in session.query(DagModel).all(): 195 | logging.info( 196 | "DEBUG: dag: " + str(dag) + ", dag.is_active: " + 197 | str(dag.is_active) 198 | ) 199 | logging.info("") 200 | logging.info( 201 | "Getting dag where DagModel.dag_id == '" + 202 | str(process["airflow_dag_id"]) + "'" 203 | ) 204 | dag = session.query(DagModel).filter( 205 | DagModel.dag_id == process["airflow_dag_id"] 206 | ).first() 207 | logging.info("dag: " + str(dag)) 208 | if dag is None: 209 | kill_reason = "DAG was not found in metastore." 210 | process["kill_reason"] = kill_reason 211 | processes_to_kill.append(process) 212 | logging.warn(kill_reason) 213 | logging.warn("Marking process to be killed.") 214 | continue 215 | logging.info("dag.is_active: " + str(dag.is_active)) 216 | if not dag.is_active: # is the dag active? 217 | kill_reason = "DAG was found to be Disabled." 218 | process["kill_reason"] = kill_reason 219 | processes_to_kill.append(process) 220 | logging.warn(kill_reason) 221 | logging.warn("Marking process to be killed.") 222 | continue 223 | 224 | # Checking to make sure the DagRun is available and in a running state 225 | if DEBUG: 226 | dag_run_relevant_states = ["queued", "running", "up_for_retry"] 227 | logging.info( 228 | "DEBUG: Listing All Relevant DAG Runs (With State: " + 229 | str(dag_run_relevant_states) + "): " 230 | ) 231 | for dag_run in session.query(DagRun).filter( 232 | DagRun.state.in_(dag_run_relevant_states) 233 | ).all(): 234 | logging.info( 235 | "DEBUG: dag_run: " + str(dag_run) + ", dag_run.state: " + 236 | str(dag_run.state) 237 | ) 238 | logging.info("") 239 | logging.info( 240 | "Getting dag_run where DagRun.dag_id == '" + 241 | str(process["airflow_dag_id"]) + 242 | "' AND DagRun.execution_date == '" + 243 | str(execution_date_to_search_for) + "'" 244 | ) 245 | 246 | dag_run = session.query(DagRun).filter( 247 | and_( 248 | DagRun.dag_id == process["airflow_dag_id"], 249 | DagRun.execution_date == execution_date_to_search_for, 250 | ) 251 | ).first() 252 | 253 | logging.info("dag_run: " + str(dag_run)) 254 | if dag_run is None: 255 | kill_reason = "DAG RUN was not found in metastore." 256 | process["kill_reason"] = kill_reason 257 | processes_to_kill.append(process) 258 | logging.warn(kill_reason) 259 | logging.warn("Marking process to be killed.") 260 | continue 261 | logging.info("dag_run.state: " + str(dag_run.state)) 262 | dag_run_states_required = ["running"] 263 | # is the dag_run in a running state? 264 | if dag_run.state not in dag_run_states_required: 265 | kill_reason = ( 266 | "DAG RUN was found to not be in the states '" + 267 | str(dag_run_states_required) + 268 | "', but rather was in the state '" + str(dag_run.state) + "'." 269 | ) 270 | process["kill_reason"] = kill_reason 271 | processes_to_kill.append(process) 272 | logging.warn(kill_reason) 273 | logging.warn("Marking process to be killed.") 274 | continue 275 | 276 | # Checking to ensure TaskInstance is available and in a running state 277 | if DEBUG: 278 | task_instance_relevant_states = [ 279 | "queued", "running", "up_for_retry" 280 | ] 281 | logging.info( 282 | "DEBUG: Listing All Relevant TaskInstances (With State: " + 283 | str(task_instance_relevant_states) + "): " 284 | ) 285 | for task_instance in session.query(TaskInstance).filter( 286 | TaskInstance.state.in_(task_instance_relevant_states) 287 | ).all(): 288 | logging.info( 289 | "DEBUG: task_instance: " + str(task_instance) + 290 | ", task_instance.state: " + str(task_instance.state) 291 | ) 292 | logging.info("") 293 | logging.info( 294 | "Getting task_instance where TaskInstance.dag_id == '" + 295 | str(process["airflow_dag_id"]) + 296 | "' AND TaskInstance.task_id == '" + 297 | str(process["airflow_task_id"]) + 298 | "' AND TaskInstance.execution_date == '" + 299 | str(execution_date_to_search_for) + "'" 300 | ) 301 | 302 | task_instance = session.query(TaskInstance).filter( 303 | and_( 304 | TaskInstance.dag_id == process["airflow_dag_id"], 305 | TaskInstance.task_id == process["airflow_task_id"], 306 | TaskInstance.execution_date == execution_date_to_search_for, 307 | ) 308 | ).first() 309 | 310 | logging.info("task_instance: " + str(task_instance)) 311 | if task_instance is None: 312 | kill_reason = ( 313 | "Task Instance was not found in metastore. Marking process " 314 | "to be killed." 315 | ) 316 | process["kill_reason"] = kill_reason 317 | processes_to_kill.append(process) 318 | logging.warn(kill_reason) 319 | logging.warn("Marking process to be killed.") 320 | continue 321 | logging.info("task_instance.state: " + str(task_instance.state)) 322 | task_instance_states_required = ["queued", "running", "up_for_retry"] 323 | # is task_instance queued, running or up for retry? 324 | if task_instance.state not in task_instance_states_required: 325 | kill_reason = ( 326 | "The TaskInstance was found to not be in the states '" + 327 | str(task_instance_states_required) + 328 | "', but rather was in the state '" + 329 | str(task_instance.state) + "'." 330 | ) 331 | process["kill_reason"] = kill_reason 332 | processes_to_kill.append(process) 333 | logging.warn(kill_reason) 334 | logging.warn("Marking process to be killed.") 335 | continue 336 | 337 | # Listing processes that will be killed 338 | logging.info("") 339 | logging.info("Processes Marked to Kill: ") 340 | if len(processes_to_kill) > 0: 341 | for process in processes_to_kill: 342 | logging.info(str(process)) 343 | else: 344 | logging.info("No Processes Marked to Kill Found") 345 | 346 | # Killing the processes 347 | logging.info("") 348 | if ENABLE_KILL: 349 | logging.info("Performing Kill...") 350 | if len(processes_to_kill) > 0: 351 | for process in processes_to_kill: 352 | logging.info("Killing Process: " + str(process)) 353 | kill_command = "kill -9 " + str(process["pid"]) 354 | logging.info("Running Command: " + str(kill_command)) 355 | output = os.popen(kill_command).read() 356 | logging.info("kill output: " + str(output)) 357 | context['ti'].xcom_push( 358 | key='kill_halted_tasks.processes_to_kill', 359 | value=processes_to_kill 360 | ) 361 | logging.info("Finished Performing Kill") 362 | else: 363 | logging.info("No Processes Marked to Kill Found") 364 | else: 365 | logging.warn("You're opted to skip killing the processes!!!") 366 | 367 | logging.info("") 368 | logging.info("Finished Running Cleanup Process") 369 | 370 | 371 | kill_halted_tasks_op = PythonOperator( 372 | task_id='kill_halted_tasks', 373 | python_callable=kill_halted_tasks_function, 374 | provide_context=True, 375 | dag=dag) 376 | 377 | 378 | def branch_function(**context): 379 | logging.info( 380 | "Deciding whether to send an email about tasks that were killed by " + 381 | "this DAG..." 382 | ) 383 | logging.info( 384 | "SEND_PROCESS_KILLED_EMAIL: '" + 385 | str(SEND_PROCESS_KILLED_EMAIL) + "'" 386 | ) 387 | logging.info( 388 | "PROCESS_KILLED_EMAIL_ADDRESSES: " + 389 | str(PROCESS_KILLED_EMAIL_ADDRESSES) 390 | ) 391 | logging.info("ENABLE_KILL: " + str(ENABLE_KILL)) 392 | 393 | if not SEND_PROCESS_KILLED_EMAIL: 394 | logging.info( 395 | "Skipping sending an email since SEND_PROCESS_KILLED_EMAIL is " + 396 | "set to false" 397 | ) 398 | # False = short circuit the dag and don't execute downstream tasks 399 | return False 400 | if len(PROCESS_KILLED_EMAIL_ADDRESSES) == 0: 401 | logging.info( 402 | "Skipping sending an email since PROCESS_KILLED_EMAIL_ADDRESSES " + 403 | "is empty" 404 | ) 405 | # False = short circuit the dag and don't execute downstream tasks 406 | return False 407 | 408 | processes_to_kill = context['ti'].xcom_pull( 409 | task_ids=kill_halted_tasks_op.task_id, 410 | key='kill_halted_tasks.processes_to_kill' 411 | ) 412 | logging.info("processes_to_kill from xcom_pull: " + str(processes_to_kill)) 413 | if processes_to_kill is not None and len(processes_to_kill) > 0: 414 | logging.info("There were processes to kill") 415 | if ENABLE_KILL: 416 | logging.info("enable_kill is set to true") 417 | logging.info( 418 | "Opting to send an email to alert the users that processes " + 419 | "were killed" 420 | ) 421 | # True = don't short circuit the dag and execute downstream tasks 422 | return True 423 | else: 424 | logging.info("enable_kill is set to False") 425 | else: 426 | logging.info("Processes to kill list was either None or Empty") 427 | 428 | logging.info( 429 | "Opting to skip sending an email since no processes were killed" 430 | ) 431 | # False = short circuit the dag and don't execute downstream tasks 432 | return False 433 | 434 | 435 | email_or_not_branch = ShortCircuitOperator( 436 | task_id="email_or_not_branch", 437 | python_callable=branch_function, 438 | provide_context=True, 439 | dag=dag) 440 | 441 | 442 | send_processes_killed_email = EmailOperator( 443 | task_id="send_processes_killed_email", 444 | to=PROCESS_KILLED_EMAIL_ADDRESSES, 445 | subject=PROCESS_KILLED_EMAIL_SUBJECT, 446 | html_content=""" 447 | 448 | 449 |
This is not a failure alert!
450 |

Dag Run Information

451 | 452 | 453 | 454 | 457 | 460 | 461 | 462 | 465 |
ID: {{ dag_run.id }}
DAG ID: {{ dag_run.dag_id }}
Execution Date: 455 | {{ dag_run.execution_date }} 456 |
Start Date: 458 | {{ dag_run.start_date }} 459 |
End Date: {{ dag_run.end_date }}
Run ID: {{ dag_run.run_id }}
External Trigger: 463 | {{ dag_run.external_trigger }} 464 |
466 |

Task Instance Information

467 | 468 | 471 | 474 | 477 | 480 | 483 | 486 | 489 | 492 | 497 |
Task ID: 469 | {{ task_instance.task_id }} 470 |
Execution Date: 472 | {{ task_instance.execution_date }} 473 |
Start Date: 475 | {{ task_instance.start_date }} 476 |
End Date: 478 | {{ task_instance.end_date }} 479 |
Host Name: 481 | {{ task_instance.hostname }} 482 |
Unix Name: 484 | {{ task_instance.unixname }} 485 |
Job ID: 487 | {{ task_instance.job_id }} 488 |
Queued Date Time: 490 | {{ task_instance.queued_dttm }} 491 |
Log URL: 493 | 494 | {{ task_instance.log_url }} 495 | 496 |
498 |

Processes Killed

499 |
    500 | {% for process_killed in task_instance.xcom_pull( 501 | task_ids='kill_halted_tasks', 502 | key='kill_halted_tasks.processes_to_kill' 503 | ) %} 504 |
  • Process {{loop.index}}
  • 505 |
      506 | {% for key, value in process_killed.iteritems() %} 507 |
    • {{ key }}: {{ value }}
    • 508 | {% endfor %} 509 |
    510 | {% endfor %} 511 |
512 | 513 | 514 | """, 515 | dag=dag) 516 | 517 | 518 | kill_halted_tasks_op.set_downstream(email_or_not_branch) 519 | email_or_not_branch.set_downstream(send_processes_killed_email) 520 | -------------------------------------------------------------------------------- /sla-miss-report/airflow-sla-miss-report.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import json 4 | import os 5 | 6 | import airflow 7 | from airflow import settings 8 | from airflow.models import DAG, DagRun, TaskInstance 9 | from airflow.models.serialized_dag import SerializedDagModel 10 | from airflow.operators.python import PythonOperator 11 | from airflow.utils.email import send_email 12 | from datetime import date, datetime, timedelta 13 | 14 | ################################ 15 | # CONFIGURATIONS 16 | ################################ 17 | 18 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 19 | START_DATE = airflow.utils.dates.days_ago(1) 20 | # How often to Run. @daily - Once a day at Midnight 21 | SCHEDULE_INTERVAL = "@daily" 22 | # Who is listed as the owner of this DAG in the Airflow Web Server 23 | DAG_OWNER = "operations" 24 | # List of email address to send the SLA report & the email subject 25 | EMAIL_ADDRESSES = [] 26 | EMAIL_SUBJECT = f'Airflow SLA Report - {date.today().strftime("%b %d, %Y")}' 27 | # Timeframes to calculate the metrics on in days 28 | SHORT_TIMEFRAME_IN_DAYS = 1 29 | MEDIUM_TIMEFRAME_IN_DAYS = 3 30 | LONG_TIMEFRAME_IN_DAYS = 7 31 | 32 | ################################ 33 | # END CONFIGURATIONS 34 | ################################ 35 | 36 | # Setting up a variable to calculate today's date. 37 | dt = date.today() 38 | today = datetime.combine(dt, datetime.min.time()) 39 | 40 | # Calculating duration intervals between the defined timeframes and today 41 | short_timeframe_start_date = today - timedelta(days=SHORT_TIMEFRAME_IN_DAYS) 42 | medium_timeframe_start_date = today - timedelta(days=MEDIUM_TIMEFRAME_IN_DAYS) 43 | long_timeframe_start_date = today - timedelta(days=LONG_TIMEFRAME_IN_DAYS) 44 | 45 | pd.options.display.max_columns = None 46 | 47 | 48 | def retrieve_metadata(): 49 | """Retrieve data from taskinstance, dagrun and serialized dag tables to do some processing to create base tables. 50 | 51 | Returns: 52 | dataframe: Base tables sla_run_detail and serialized_dags_slas for further processing. 53 | """ 54 | try: 55 | pd.set_option("display.max_colwidth", None) 56 | 57 | session = settings.Session() 58 | taskinstance = session.query( 59 | TaskInstance.task_id, 60 | TaskInstance.dag_id, 61 | TaskInstance.run_id, 62 | TaskInstance.state, 63 | TaskInstance.start_date, 64 | TaskInstance.end_date, 65 | TaskInstance.duration, 66 | TaskInstance.operator, 67 | TaskInstance.queued_dttm, 68 | ).all() 69 | taskinstance_df = pd.DataFrame(taskinstance) 70 | taskinstance_df["run_date"] = pd.to_datetime(taskinstance_df["start_date"]).dt.date 71 | taskinstance_df["run_date_hour"] = pd.to_datetime(taskinstance_df["start_date"]).dt.hour 72 | taskinstance_df["task_queue_time"] = (taskinstance_df["start_date"] - 73 | taskinstance_df["queued_dttm"]).dt.total_seconds() 74 | taskinstance_df = taskinstance_df[taskinstance_df["task_queue_time"] > 0] 75 | 76 | dagrun = session.query(DagRun.dag_id, DagRun.run_id, DagRun.data_interval_end).all() 77 | dagrun_df = pd.DataFrame(dagrun) 78 | dagrun_df = dagrun_df.rename(columns={"data_interval_end": "actual_start_time"}) 79 | 80 | if "_data" in dir(SerializedDagModel): 81 | serializeddag = session.query(SerializedDagModel._data).all() 82 | data_col = "_data" 83 | else: 84 | serializeddag = session.query(SerializedDagModel.data).all() 85 | data_col = "data" 86 | 87 | serializeddag_df = pd.DataFrame(serializeddag) 88 | serializeddag_json_normalize = pd.json_normalize( 89 | pd.DataFrame(serializeddag_df[data_col].apply(json.dumps).apply(json.loads).values.tolist())["dag"], 90 | "tasks", ["_dag_id"]) 91 | serializeddag_filtered = serializeddag_json_normalize[["_dag_id", "task_id", "sla"]] 92 | serializeddag_filtered = serializeddag_filtered.rename(columns={"_dag_id": "dag_id"}) 93 | serialized_dags_slas = serializeddag_filtered[serializeddag_filtered["sla"].notnull()] 94 | 95 | run_detail = pd.merge( 96 | dagrun_df[["dag_id", "run_id", "actual_start_time"]], 97 | taskinstance_df[[ 98 | "task_id", 99 | "dag_id", 100 | "run_id", 101 | "start_date", 102 | "end_date", 103 | "duration", 104 | "task_queue_time", 105 | "state", 106 | ]], 107 | on=["run_id", "dag_id"], 108 | ) 109 | 110 | sla_run = pd.merge(run_detail, serialized_dags_slas, on=["task_id", "dag_id"]) 111 | sla_run_detail = sla_run.loc[sla_run["sla"].isnull() == False] 112 | sla_run_detail["sla_missed"] = np.where(sla_run_detail["duration"] > sla_run_detail["sla"], 1, 0) 113 | sla_run_detail["run_date_hour"] = pd.to_datetime(sla_run_detail["start_date"]).dt.hour 114 | # sla_run_detail["start_dt"] = sla_run_detail["start_date"].dt.date 115 | sla_run_detail["start_dt"] = sla_run_detail["start_date"].dt.strftime("%A, %b %d") 116 | sla_run_detail["start_date"] = pd.to_datetime(sla_run_detail["start_date"]).dt.tz_localize(None) 117 | 118 | return sla_run_detail, serialized_dags_slas 119 | 120 | except: 121 | no_metadata_found() 122 | 123 | 124 | def sla_miss_count_df(input_df, timeframe): 125 | """Group the data based on dagid and taskid and calculate its count and avg duration 126 | 127 | Args: 128 | input_df (dataframe): sla_run_detail base table 129 | timeframe (integer): Timeframes entered by the user according to which KPI's will be calculated 130 | 131 | Returns: 132 | dataframes: Intermediate output dataframes required for further processing of data 133 | """ 134 | df1 = input_df[input_df["duration"] > input_df["sla"]][input_df["start_date"].between(timeframe, today)] 135 | df2 = df1.groupby(["dag_id", "task_id"]).size().to_frame(name="size").reset_index() 136 | df3 = df1.groupby(["dag_id", "task_id"])["duration"].mean().reset_index() 137 | return df2, df3 138 | 139 | 140 | def sla_miss_pct(input_df1, input_df2): 141 | """Calculate SLA miss % 142 | 143 | Args: 144 | input_df1 (dataframe): dataframe consisting of filtered records as per duration and SLA misses grouped by DagId and TaskId 145 | input_df2 (dataframe): dataframe consisting of all the records as per duration and SLA misses grouped by DagId and TaskId 146 | 147 | Returns: 148 | String containing the SLA miss % 149 | """ 150 | 151 | sla_pct = (np.nan_to_num( 152 | ((input_df1["size"].sum() * 100) / (input_df2["total_count"].sum())), 153 | 0, 154 | ).round(2)) 155 | return sla_pct 156 | 157 | 158 | def sla_total_counts_df(input_df): 159 | """Group the data based on dagid and taskid and calculate its count 160 | 161 | Args: 162 | input_df (dataframe): base SLA run table 163 | 164 | Returns: 165 | Dataframe containing the total count of SLA grouped by dag_id and task_id 166 | """ 167 | df = (input_df.groupby(["dag_id", 168 | "task_id"]).size().to_frame(name="total_count").sort_values("total_count", 169 | ascending=False).reset_index()) 170 | return df 171 | 172 | 173 | def sla_run_counts_df(input_df, timeframe): 174 | """Filters the sla_run_detail dataframe between the current date and the timeframe mentioned 175 | 176 | Args: 177 | input_df (dataframe): base SLA run table 178 | 179 | Returns: 180 | dataframe: missed SLAs within provided timeframe 181 | """ 182 | tf = input_df[input_df["start_date"].between(timeframe, today)] 183 | return tf 184 | 185 | 186 | def sla_daily_miss(sla_run_detail): 187 | """SLA miss table which gives us details about the date, SLA miss % on that date and top DAG violators for the long timeframe. 188 | 189 | Args: 190 | sla_run_detail (dataframe): Table consiting of details of all the dag runs that happened 191 | 192 | Returns: 193 | dataframe: sla_daily_miss output dataframe 194 | """ 195 | try: 196 | 197 | sla_pastweek_run_count_df = sla_run_detail[sla_run_detail["start_date"].between( 198 | long_timeframe_start_date, today)] 199 | 200 | daily_sla_miss_count = sla_run_detail[sla_run_detail["duration"] > sla_run_detail["sla"]][ 201 | sla_run_detail["start_date"].between(long_timeframe_start_date, today)].sort_values(["start_date"]) 202 | 203 | daily_sla_miss_count_datewise = (daily_sla_miss_count.groupby( 204 | ["start_dt"]).size().to_frame(name="slamiss_count_datewise").reset_index()) 205 | daily_sla_count_df = (daily_sla_miss_count.groupby(["start_dt", "dag_id", 206 | "task_id"]).size().to_frame(name="size").reset_index()) 207 | daily_sla_totalcount_datewise = (sla_pastweek_run_count_df.groupby( 208 | ["start_dt"]).size().to_frame(name="total_count").sort_values("start_dt", ascending=False).reset_index()) 209 | daily_sla_totalcount_datewise_taskwise = (sla_pastweek_run_count_df.groupby( 210 | ["start_dt", "dag_id", 211 | "task_id"]).size().to_frame(name="totalcount").sort_values("start_dt", ascending=False).reset_index()) 212 | daily_sla_miss_pct_df = pd.merge(daily_sla_miss_count_datewise, daily_sla_totalcount_datewise, on=["start_dt"]) 213 | daily_sla_miss_pct_df["sla_miss_percent"] = (daily_sla_miss_pct_df["slamiss_count_datewise"] * 100 / 214 | daily_sla_miss_pct_df["total_count"]).round(2) 215 | daily_sla_miss_pct_df["sla_miss_percent(missed_tasks/total_tasks)"] = daily_sla_miss_pct_df.apply( 216 | lambda x: "%s%s(%s/%s)" % (x["sla_miss_percent"], "% ", x["slamiss_count_datewise"], x["total_count"]), 217 | axis=1, 218 | ) 219 | 220 | daily_sla_miss_percent = daily_sla_miss_pct_df.filter( 221 | ["start_dt", "sla_miss_percent(missed_tasks/total_tasks)"], axis=1) 222 | daily_sla_miss_df_pct1 = pd.merge( 223 | daily_sla_count_df, 224 | daily_sla_totalcount_datewise_taskwise, 225 | on=["start_dt", "dag_id", "task_id"], 226 | ) 227 | daily_sla_miss_df_pct1["pct_violator"] = (daily_sla_miss_df_pct1["size"] * 100 / 228 | daily_sla_miss_df_pct1["totalcount"]).round(2) 229 | daily_sla_miss_df_pct_kpi = (daily_sla_miss_df_pct1.sort_values("pct_violator", 230 | ascending=False).groupby("start_dt", 231 | sort=False).head(1)) 232 | 233 | daily_sla_miss_df_pct_kpi["top_pct_violator"] = daily_sla_miss_df_pct_kpi.apply( 234 | lambda x: "%s: %s (%s%s" % (x["dag_id"], x["task_id"], x["pct_violator"], "%)"), 235 | axis=1, 236 | ) 237 | 238 | daily_slamiss_percent_violator = daily_sla_miss_df_pct_kpi.filter(["start_dt", "top_pct_violator"], axis=1) 239 | daily_slamiss_df_absolute_kpi = (daily_sla_miss_df_pct1.sort_values("size", ascending=False).groupby( 240 | "start_dt", sort=False).head(1)) 241 | 242 | daily_slamiss_df_absolute_kpi["top_absolute_violator"] = daily_slamiss_df_absolute_kpi.apply( 243 | lambda x: "%s: %s (%s/%s)" % (x["dag_id"], x["task_id"], x["size"], x["totalcount"]), 244 | axis=1, 245 | ) 246 | 247 | daily_slamiss_absolute_violator = daily_slamiss_df_absolute_kpi.filter(["start_dt", "top_absolute_violator"], 248 | axis=1) 249 | daily_slamiss_pct_last7days = pd.merge( 250 | pd.merge(daily_sla_miss_percent, daily_slamiss_percent_violator, on="start_dt"), 251 | daily_slamiss_absolute_violator, 252 | on="start_dt", 253 | ).sort_values("start_dt", ascending=False) 254 | 255 | daily_slamiss_pct_last7days = daily_slamiss_pct_last7days.rename( 256 | columns={ 257 | "top_pct_violator": "Top Violator (%)", 258 | "top_absolute_violator": "Top Violator (absolute)", 259 | "start_dt": "Date", 260 | "sla_miss_percent(missed_tasks/total_tasks)": "SLA Miss % (Missed/Total Tasks)", 261 | }) 262 | return daily_slamiss_pct_last7days 263 | except: 264 | daily_slamiss_pct_last7days = pd.DataFrame( 265 | columns=["Date", "SLA Miss % (Missed/Total Tasks)", "Top Violator (%)", "Top Violator (absolute)"]) 266 | return daily_slamiss_pct_last7days 267 | 268 | 269 | def sla_hourly_miss(sla_run_detail): 270 | """Generate hourly SLA miss table giving us details about the hour, SLA miss % for that hour, top DAG violators 271 | and the longest running task and avg task queue time for the given short timeframe. 272 | 273 | Args: 274 | sla_run_detail (dataframe): Base table consiting of details of all the dag runs that happened 275 | 276 | Returns: 277 | datframe, list: observations_hourly_reccomendations list and sla_miss_percent_past_day_hourly dataframe 278 | """ 279 | try: 280 | 281 | sla_miss_count_past_day = sla_run_detail[sla_run_detail["duration"] > sla_run_detail["sla"]][ 282 | sla_run_detail["start_date"].between(short_timeframe_start_date, today)] 283 | 284 | sla_miss_count_hourly = (sla_miss_count_past_day.groupby( 285 | ["run_date_hour"]).size().to_frame(name="slamiss_count_hourwise").reset_index()) 286 | sla_count_df_past_day_hourly = (sla_miss_count_past_day.groupby(["run_date_hour", "dag_id", "task_id" 287 | ]).size().to_frame(name="size").reset_index()) 288 | sla_avg_execution_time_taskwise_hourly = (sla_miss_count_past_day.groupby( 289 | ["run_date_hour", "dag_id", "task_id"])["duration"].mean().reset_index()) 290 | sla_avg_execution_time_hourly = (sla_avg_execution_time_taskwise_hourly.sort_values( 291 | "duration", ascending=False).groupby("run_date_hour", sort=False).head(1)) 292 | 293 | sla_pastday_run_count_df = sla_run_detail[sla_run_detail["start_date"].between( 294 | short_timeframe_start_date, today)] 295 | sla_avg_queue_time_hourly = (sla_pastday_run_count_df.groupby(["run_date_hour" 296 | ])["task_queue_time"].mean().reset_index()) 297 | sla_totalcount_hourly = (sla_pastday_run_count_df.groupby( 298 | ["run_date_hour"]).size().to_frame(name="total_count").sort_values("run_date_hour", 299 | ascending=False).reset_index()) 300 | sla_totalcount_taskwise_hourly = (sla_pastday_run_count_df.groupby( 301 | ["run_date_hour", "dag_id", 302 | "task_id"]).size().to_frame(name="totalcount").sort_values("run_date_hour", ascending=False).reset_index()) 303 | sla_miss_pct_past_day_hourly = pd.merge(sla_miss_count_hourly, sla_totalcount_hourly, on=["run_date_hour"]) 304 | sla_miss_pct_past_day_hourly["sla_miss_percent"] = (sla_miss_pct_past_day_hourly["slamiss_count_hourwise"] * 305 | 100 / sla_miss_pct_past_day_hourly["total_count"]).round(2) 306 | 307 | sla_miss_pct_past_day_hourly["sla_miss_percent(missed_tasks/total_tasks)"] = sla_miss_pct_past_day_hourly.apply( 308 | lambda x: "%s%s(%s/%s)" % ( 309 | x["sla_miss_percent"].astype(int), 310 | "% ", 311 | x["slamiss_count_hourwise"].astype(int), 312 | x["total_count"].astype(int), 313 | ), 314 | axis=1, 315 | ) 316 | 317 | sla_highest_sla_miss_hour = (sla_miss_pct_past_day_hourly[["run_date_hour", "sla_miss_percent" 318 | ]].sort_values("sla_miss_percent", 319 | ascending=False).head(1)) 320 | sla_highest_tasks_hour = (sla_miss_pct_past_day_hourly[["run_date_hour", 321 | "total_count"]].sort_values("total_count", 322 | ascending=False).head(1)) 323 | 324 | sla_miss_percent_past_day = sla_miss_pct_past_day_hourly.filter( 325 | ["run_date_hour", "sla_miss_percent(missed_tasks/total_tasks)"], axis=1) 326 | 327 | sla_miss_temp_df_pct1_past_day = pd.merge( 328 | sla_count_df_past_day_hourly, 329 | sla_totalcount_taskwise_hourly, 330 | on=["run_date_hour", "dag_id", "task_id"], 331 | ) 332 | 333 | sla_miss_temp_df_pct1_past_day["pct_violator"] = (sla_miss_temp_df_pct1_past_day["size"] * 100 / 334 | sla_miss_temp_df_pct1_past_day["totalcount"]).round(2) 335 | sla_miss_pct_past_day_hourly = (sla_miss_temp_df_pct1_past_day.sort_values( 336 | "pct_violator", ascending=False).groupby("run_date_hour", sort=False).head(1)) 337 | 338 | sla_miss_pct_past_day_hourly["top_pct_violator"] = sla_miss_pct_past_day_hourly.apply( 339 | lambda x: "%s: %s (%s%s" % (x["dag_id"], x["task_id"], x["pct_violator"], "%)"), 340 | axis=1, 341 | ) 342 | 343 | sla_miss_percent_violator_past_day_hourly = sla_miss_pct_past_day_hourly.filter( 344 | ["run_date_hour", "top_pct_violator"], axis=1) 345 | sla_miss_absolute_kpi_past_day_hourly = (sla_miss_temp_df_pct1_past_day.sort_values( 346 | "size", ascending=False).groupby("run_date_hour", sort=False).head(1)) 347 | sla_miss_absolute_kpi_past_day_hourly["top_absolute_violator"] = sla_miss_absolute_kpi_past_day_hourly.apply( 348 | lambda x: "%s: %s (%s/%s)" % (x["dag_id"], x["task_id"], x["size"], x["totalcount"]), 349 | axis=1, 350 | ) 351 | 352 | sla_miss_absolute_violator_past_day_hourly = sla_miss_absolute_kpi_past_day_hourly.filter( 353 | ["run_date_hour", "top_absolute_violator"], axis=1) 354 | slamiss_pct_exectime = pd.merge( 355 | pd.merge( 356 | sla_miss_percent_past_day, 357 | sla_miss_percent_violator_past_day_hourly, 358 | on="run_date_hour", 359 | ), 360 | sla_miss_absolute_violator_past_day_hourly, 361 | on="run_date_hour", 362 | ).sort_values("run_date_hour", ascending=False) 363 | 364 | sla_avg_execution_time_hourly["duration"] = ( 365 | sla_avg_execution_time_hourly["duration"].round(0).astype(int).astype(str)) 366 | sla_avg_execution_time_hourly["longest_running_task"] = sla_avg_execution_time_hourly.apply( 367 | lambda x: "%s: %s (%ss)" % (x["dag_id"], x["task_id"], x["duration"]), axis=1) 368 | 369 | sla_longest_running_task_hourly = sla_avg_execution_time_hourly.filter( 370 | ["run_date_hour", "longest_running_task"], axis=1) 371 | 372 | sla_miss_pct = pd.merge(slamiss_pct_exectime, sla_longest_running_task_hourly, on=["run_date_hour"]) 373 | sla_miss_percent_past_day_hourly = pd.merge(sla_miss_pct, sla_avg_queue_time_hourly, on=["run_date_hour"]) 374 | sla_miss_percent_past_day_hourly["task_queue_time"] = ( 375 | sla_miss_percent_past_day_hourly["task_queue_time"].round(0).astype(int).apply(str)) 376 | sla_longest_queue_time_hourly = (sla_miss_percent_past_day_hourly[["run_date_hour", "task_queue_time" 377 | ]].sort_values("task_queue_time", 378 | ascending=False).head(1)) 379 | 380 | sla_miss_percent_past_day_hourly.rename( 381 | columns={ 382 | "task_queue_time": "Average Task Queue Time (s)", 383 | "longest_running_task": "Longest Running Task", 384 | "top_pct_violator": "Top Violator (%)", 385 | "top_absolute_violator": "Top Violator (absolute)", 386 | "run_date_hour": "Hour", 387 | "sla_miss_percent(missed_tasks/total_tasks)": "SLA miss % (Missed/Total Tasks)", 388 | }, 389 | inplace=True, 390 | ) 391 | 392 | obs1_hourlytrend = "Hour " + (sla_highest_sla_miss_hour["run_date_hour"].apply(str) + 393 | " had the highest percentage of SLA misses").to_string(index=False) 394 | obs2_hourlytrend = "Hour " + ( 395 | sla_longest_queue_time_hourly["run_date_hour"].apply(str) + " had the longest average queue time (" + 396 | sla_longest_queue_time_hourly["task_queue_time"].apply(str) + " seconds)").to_string(index=False) 397 | obs3_hourlytrend = "Hour " + (sla_highest_tasks_hour["run_date_hour"].apply(str) + 398 | " had the most tasks running").to_string(index=False) 399 | 400 | observations_hourly_reccomendations = [obs1_hourlytrend, obs2_hourlytrend, obs3_hourlytrend] 401 | return observations_hourly_reccomendations, sla_miss_percent_past_day_hourly 402 | except: 403 | sla_miss_percent_past_day_hourly = pd.DataFrame(columns=[ 404 | "SLA Miss % (Missed/Total Tasks)", 405 | "Top Violator (%)", 406 | "Top Violator (absolute)", 407 | "Longest Running Task", 408 | "Hour", 409 | "Average Task Queue Time (seconds)", 410 | ]) 411 | observations_hourly_reccomendations = "" 412 | return observations_hourly_reccomendations, sla_miss_percent_past_day_hourly 413 | 414 | 415 | def sla_dag_miss(sla_run_detail, serialized_dags_slas): 416 | """ 417 | Generate SLA dag miss table giving us details about the SLA miss % for the given timeframes along with the average execution time and 418 | reccomendations for weekly observations. 419 | 420 | Args: 421 | sla_run_detail (dataframe): Base table consiting of details of all the dag runs that happened 422 | serialized_dags_slas (dataframe): table consisting of all the dag details 423 | 424 | Returns: 425 | 2 lists consisting of sla_daily_miss and sla_dag_miss reccomendations and 1 dataframe consisting of sla_dag_miss reccomendation 426 | """ 427 | try: 428 | 429 | dag_sla_count_df_weekprior, dag_sla_count_df_weekprior_avgduration = sla_miss_count_df( 430 | sla_run_detail, long_timeframe_start_date) 431 | dag_sla_count_df_threedayprior, dag_sla_count_df_threedayprior_avgduration = sla_miss_count_df( 432 | sla_run_detail, medium_timeframe_start_date) 433 | dag_sla_count_df_onedayprior, dag_sla_count_df_onedayprior_avgduration = sla_miss_count_df( 434 | sla_run_detail, short_timeframe_start_date) 435 | 436 | dag_sla_run_count_week_prior = sla_run_counts_df(sla_run_detail, long_timeframe_start_date) 437 | dag_sla_run_count_three_day_prior = sla_run_counts_df(sla_run_detail, medium_timeframe_start_date) 438 | dag_sla_run_count_one_day_prior = sla_run_counts_df(sla_run_detail, short_timeframe_start_date) 439 | 440 | dag_sla_run_count_week_prior_success = ( 441 | dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "success"].groupby( 442 | ["dag_id", "task_id"]).size().to_frame(name="success_count").reset_index()) 443 | dag_sla_run_count_week_prior_failure = ( 444 | dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "failed"].groupby( 445 | ["dag_id", "task_id"]).size().to_frame(name="failure_count").reset_index()) 446 | 447 | dag_sla_run_count_week_prior_success_duration_stats = ( 448 | dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "success"].groupby( 449 | ["dag_id", "task_id"])["duration"].agg(["mean", "min", "max"]).reset_index()) 450 | dag_sla_run_count_week_prior_failure_duration_stats = ( 451 | dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "failed"].groupby( 452 | ["dag_id", "task_id"])["duration"].agg(["mean", "min", "max"]).reset_index()) 453 | 454 | dag_sla_totalcount_week_prior = sla_total_counts_df(dag_sla_run_count_week_prior) 455 | dag_sla_totalcount_three_day_prior = sla_total_counts_df(dag_sla_run_count_three_day_prior) 456 | dag_sla_totalcount_one_day_prior = sla_total_counts_df(dag_sla_run_count_one_day_prior) 457 | 458 | dag_obs5_sladpercent_weekprior = sla_miss_pct(dag_sla_count_df_weekprior, dag_sla_totalcount_week_prior) 459 | dag_obs6_sladpercent_threedayprior = sla_miss_pct(dag_sla_count_df_threedayprior, 460 | dag_sla_totalcount_three_day_prior) 461 | dag_obs7_sladpercent_onedayprior = sla_miss_pct(dag_sla_count_df_onedayprior, dag_sla_totalcount_one_day_prior) 462 | 463 | dag_obs7_sladetailed_week = f'In the past {str(LONG_TIMEFRAME_IN_DAYS)} days, {dag_obs5_sladpercent_weekprior}% of the tasks have missed their SLA' 464 | dag_obs6_sladetailed_threeday = f'In the past {str(MEDIUM_TIMEFRAME_IN_DAYS)} days, {dag_obs6_sladpercent_threedayprior}% of the tasks have missed their SLA' 465 | dag_obs5_sladetailed_oneday = f'In the past {str(SHORT_TIMEFRAME_IN_DAYS)} days, {dag_obs7_sladpercent_onedayprior}% of the tasks have missed their SLA' 466 | 467 | dag_sla_miss_pct_df_week_prior = pd.merge( 468 | pd.merge(dag_sla_count_df_weekprior, dag_sla_totalcount_week_prior, on=["dag_id", "task_id"]), 469 | dag_sla_count_df_weekprior_avgduration, 470 | on=["dag_id", "task_id"], 471 | ) 472 | dag_sla_miss_pct_df_threeday_prior = pd.merge( 473 | pd.merge( 474 | dag_sla_count_df_threedayprior, 475 | dag_sla_totalcount_three_day_prior, 476 | on=["dag_id", "task_id"], 477 | ), 478 | dag_sla_count_df_threedayprior_avgduration, 479 | on=["dag_id", "task_id"], 480 | ) 481 | dag_sla_miss_pct_df_oneday_prior = pd.merge( 482 | pd.merge( 483 | dag_sla_count_df_onedayprior, 484 | dag_sla_totalcount_one_day_prior, 485 | on=["dag_id", "task_id"], 486 | ), 487 | dag_sla_count_df_onedayprior_avgduration, 488 | on=["dag_id", "task_id"], 489 | ) 490 | 491 | dag_sla_miss_pct_df_week_prior["sla_miss_percent_week"] = ( 492 | dag_sla_miss_pct_df_week_prior["size"] * 100 / dag_sla_miss_pct_df_week_prior["total_count"]).round(2) 493 | dag_sla_miss_pct_df_threeday_prior["sla_miss_percent_three_day"] = ( 494 | dag_sla_miss_pct_df_threeday_prior["size"] * 100 / 495 | dag_sla_miss_pct_df_threeday_prior["total_count"]).round(2) 496 | dag_sla_miss_pct_df_oneday_prior["sla_miss_percent_one_day"] = ( 497 | dag_sla_miss_pct_df_oneday_prior["size"] * 100 / dag_sla_miss_pct_df_oneday_prior["total_count"]).round(2) 498 | 499 | dag_sla_miss_pct_df1 = dag_sla_miss_pct_df_week_prior.merge(dag_sla_miss_pct_df_threeday_prior, 500 | on=["dag_id", "task_id"], 501 | how="left") 502 | dag_sla_miss_pct_df2 = dag_sla_miss_pct_df1.merge(dag_sla_miss_pct_df_oneday_prior, 503 | on=["dag_id", "task_id"], 504 | how="left") 505 | dag_sla_miss_pct_df3 = dag_sla_miss_pct_df2.merge(serialized_dags_slas, on=["dag_id", "task_id"], how="left") 506 | 507 | dag_sla_miss_pct_detailed = dag_sla_miss_pct_df3.filter( 508 | [ 509 | "dag_id", 510 | "task_id", 511 | "sla", 512 | "sla_miss_percent_week", 513 | "duration_x", 514 | "sla_miss_percent_three_day", 515 | "duration_y", 516 | "sla_miss_percent_one_day", 517 | "duration", 518 | ], 519 | axis=1, 520 | ) 521 | 522 | float_column_names = dag_sla_miss_pct_detailed.select_dtypes(float).columns 523 | dag_sla_miss_pct_detailed[float_column_names] = dag_sla_miss_pct_detailed[float_column_names].fillna(0) 524 | 525 | round_int_column_names = ["duration_x", "duration_y", "duration"] 526 | dag_sla_miss_pct_detailed[round_int_column_names] = dag_sla_miss_pct_detailed[round_int_column_names].round( 527 | 0).astype(int) 528 | dag_sla_miss_pct_detailed["sla"] = dag_sla_miss_pct_detailed["sla"].astype(int) 529 | dag_sla_miss_pct_detailed["Dag: Task"] = (dag_sla_miss_pct_detailed["dag_id"].apply(str) + ": " + 530 | dag_sla_miss_pct_detailed["task_id"].apply(str)) 531 | 532 | short_timeframe_col_name = f'{SHORT_TIMEFRAME_IN_DAYS}-day SLA Miss % (avg execution time)' 533 | medium_timeframe_col_name = f'{MEDIUM_TIMEFRAME_IN_DAYS}-day SLA Miss % (avg execution time)' 534 | long_timeframe_col_name = f'{LONG_TIMEFRAME_IN_DAYS}-day SLA Miss % (avg execution time)' 535 | 536 | dag_sla_miss_pct_detailed[short_timeframe_col_name] = ( 537 | dag_sla_miss_pct_detailed["sla_miss_percent_one_day"].apply(str) + "% (" + 538 | dag_sla_miss_pct_detailed["duration"].apply(str) + "s)") 539 | 540 | dag_sla_miss_pct_detailed[medium_timeframe_col_name] = ( 541 | dag_sla_miss_pct_detailed["sla_miss_percent_three_day"].apply(str) + "% (" + 542 | dag_sla_miss_pct_detailed["duration_y"].apply(str) + "s)") 543 | 544 | dag_sla_miss_pct_detailed[long_timeframe_col_name] = ( 545 | dag_sla_miss_pct_detailed["sla_miss_percent_week"].apply(str) + "% (" + 546 | dag_sla_miss_pct_detailed["duration_x"].apply(str) + "s)") 547 | 548 | dag_sla_miss_pct_filtered = dag_sla_miss_pct_detailed.filter( 549 | [ 550 | "Dag: Task", 551 | "sla", 552 | short_timeframe_col_name, 553 | medium_timeframe_col_name, 554 | long_timeframe_col_name, 555 | ], 556 | axis=1, 557 | ).sort_values(by=[long_timeframe_col_name], ascending=False) 558 | 559 | dag_sla_miss_pct_filtered.rename(columns={"sla": "Current SLA (s)"}, inplace=True) 560 | 561 | dag_sla_miss_pct_recc1 = dag_sla_miss_pct_detailed.nlargest(3, ["sla_miss_percent_week"]).fillna(0) 562 | dag_sla_miss_pct_recc2 = dag_sla_miss_pct_recc1.filter( 563 | ["dag_id", "task_id", "sla", "sla_miss_percent_week", "Dag: Task"], axis=1).fillna(0) 564 | dag_sla_miss_pct_df4_recc3 = pd.merge( 565 | pd.merge( 566 | dag_sla_miss_pct_recc2, 567 | dag_sla_run_count_week_prior_success, 568 | on=["dag_id", "task_id"], 569 | ), 570 | dag_sla_run_count_week_prior_failure, 571 | on=["dag_id", "task_id"], 572 | how="left", 573 | ).fillna(0) 574 | dag_sla_miss_pct_df4_recc4 = pd.merge( 575 | pd.merge( 576 | dag_sla_miss_pct_df4_recc3, 577 | dag_sla_run_count_week_prior_success_duration_stats, 578 | on=["dag_id", "task_id"], 579 | how="left", 580 | ), 581 | dag_sla_run_count_week_prior_failure_duration_stats, 582 | on=["dag_id", "task_id"], 583 | how="left", 584 | ).fillna(0) 585 | dag_sla_miss_pct_df4_recc4["Recommendations"] = ( 586 | dag_sla_miss_pct_df4_recc4["Dag: Task"].apply(str) + " - Of the " + 587 | dag_sla_miss_pct_df4_recc4["sla_miss_percent_week"].apply(str) + 588 | "% of the tasks that missed their SLA of " + dag_sla_miss_pct_df4_recc4["sla"].apply(str) + " seconds, " + 589 | dag_sla_miss_pct_df4_recc4["success_count"].astype(int).apply(str) + " succeeded (min: " + 590 | dag_sla_miss_pct_df4_recc4["min_x"].round(0).astype(int).apply(str) + "s, avg: " + 591 | dag_sla_miss_pct_df4_recc4["mean_x"].round(0).astype(int).apply(str) + "s, max: " + 592 | dag_sla_miss_pct_df4_recc4["max_x"].round(0).astype(int).apply(str) + "s) & " + 593 | dag_sla_miss_pct_df4_recc4["failure_count"].astype(int).apply(str) + " failed (min: " + 594 | dag_sla_miss_pct_df4_recc4["min_y"].round(0).astype(int).apply(str) + "s, avg: " + 595 | dag_sla_miss_pct_df4_recc4["mean_y"].round(0).astype(int).apply(str) + "s, max: " + 596 | dag_sla_miss_pct_df4_recc4["max_y"].round(0).fillna(0).astype(int).apply(str) + "s)") 597 | 598 | daily_weeklytrend_observations_loop = [ 599 | dag_obs5_sladetailed_oneday, 600 | dag_obs6_sladetailed_threeday, 601 | dag_obs7_sladetailed_week, 602 | ] 603 | 604 | dag_sla_miss_trend = dag_sla_miss_pct_df4_recc4["Recommendations"].tolist() 605 | 606 | return daily_weeklytrend_observations_loop, dag_sla_miss_trend, dag_sla_miss_pct_filtered 607 | except: 608 | short_timeframe_col_name = f'{SHORT_TIMEFRAME_IN_DAYS}-Day SLA miss % (avg execution time)' 609 | medium_timeframe_col_name = f'{MEDIUM_TIMEFRAME_IN_DAYS}-Day SLA miss % (avg execution time)' 610 | long_timeframe_col_name = f'{LONG_TIMEFRAME_IN_DAYS}-Day SLA miss % (avg execution time)' 611 | daily_weeklytrend_observations_loop = "" 612 | dag_sla_miss_trend = "" 613 | dag_sla_miss_pct_filtered = pd.DataFrame(columns=[ 614 | "Dag: Task", 615 | "Current SLA", 616 | short_timeframe_col_name, 617 | medium_timeframe_col_name, 618 | long_timeframe_col_name, 619 | ]) 620 | return daily_weeklytrend_observations_loop, dag_sla_miss_trend, dag_sla_miss_pct_filtered 621 | 622 | 623 | def sla_miss_report(): 624 | """Embed all the resultant output datframes within html format and send the email report to the intented recipients.""" 625 | 626 | sla_run_detail, serialized_dags_slas = retrieve_metadata() 627 | daily_slamiss_pct_last7days = sla_daily_miss(sla_run_detail) 628 | observations_hourly_reccomendations, sla_miss_percent_past_day_hourly = sla_hourly_miss(sla_run_detail) 629 | daily_weeklytrend_observations_loop, dag_sla_miss_trend, dag_sla_miss_pct_filtered = sla_dag_miss( 630 | sla_run_detail, serialized_dags_slas) 631 | 632 | new_line = '\n' 633 | print(f""" 634 | ------------------- START OF REPORT ------------------- 635 | {EMAIL_SUBJECT} 636 | 637 | Daily SLA Misses 638 | {new_line.join(map(str, daily_weeklytrend_observations_loop))} 639 | 640 | {daily_slamiss_pct_last7days.to_markdown(index=False)} 641 | 642 | Hourly SLA Misses 643 | {new_line.join(map(str, observations_hourly_reccomendations))} 644 | 645 | {sla_miss_percent_past_day_hourly.to_markdown(index=False)} 646 | 647 | DAG SLA Misses 648 | {new_line.join(map(str, dag_sla_miss_trend))} 649 | 650 | {dag_sla_miss_pct_filtered.to_markdown(index=False)} 651 | 652 | ------------------- END OF REPORT ------------------- 653 | """) 654 | 655 | daily_weeklytrend_observations_loop = "".join([f"
  • {item}
  • " for item in daily_weeklytrend_observations_loop]) 656 | observations_hourly_reccomendations = "".join([f"
  • {item}
  • " for item in observations_hourly_reccomendations]) 657 | dag_sla_miss_trend = "".join([f"
  • {item}
  • " for item in dag_sla_miss_trend]) 658 | 659 | short_timeframe_print = f'Short: {SHORT_TIMEFRAME_IN_DAYS}d ({short_timeframe_start_date.strftime("%b %d")} - {(today - timedelta(days=1)).strftime("%b %d")})' 660 | medium_timeframe_print = f'Medium: {MEDIUM_TIMEFRAME_IN_DAYS}d ({medium_timeframe_start_date.strftime("%b %d")} - {(today - timedelta(days=1)).strftime("%b %d")})' 661 | long_timeframe_print = f'Long: {LONG_TIMEFRAME_IN_DAYS}d ({long_timeframe_start_date.strftime("%b %d")} - {(today - timedelta(days=1)).strftime("%b %d")})' 662 | timeframe_prints = f'{short_timeframe_print} | {medium_timeframe_print} | {long_timeframe_print}' 663 | 664 | html_content = f"""\ 665 | 666 | 667 | 692 | 693 | 694 | The following timeframes are used to generate this report. To change them, update the [SHORT, MEDIUM, LONG]_TIMEFRAME_IN_DAYS variables in airflow-sla-miss-report.py. 695 |



    696 | {timeframe_prints} 697 | 698 |

    Daily SLA Misses

    699 |

    Daily breakdown of SLA misses and the worst offenders over the past {LONG_TIMEFRAME_IN_DAYS} day(s).

    700 | {daily_weeklytrend_observations_loop} 701 | {daily_slamiss_pct_last7days.to_html(index=False)} 702 | 703 | 704 |

    Hourly SLA Misses

    705 |

    Hourly breakdown of tasks missing their SLAs and the worst offenders over the past {SHORT_TIMEFRAME_IN_DAYS} day(s). Useful for identifying scheduling bottlenecks.

    706 | {observations_hourly_reccomendations} 707 | {sla_miss_percent_past_day_hourly.to_html(index=False)} 708 | 709 |

    DAG SLA Misses

    710 |

    Task level breakdown showcasing the SLA miss percentage & average execution time over the past {SHORT_TIMEFRAME_IN_DAYS}, {MEDIUM_TIMEFRAME_IN_DAYS}, and {LONG_TIMEFRAME_IN_DAYS} day(s). Useful for identifying trends and updating defined SLAs to meet actual exectution times.

    711 | {dag_sla_miss_trend} 712 | {dag_sla_miss_pct_filtered.to_html(index=False)} 713 | 714 | 715 | 716 | """ 717 | if EMAIL_ADDRESSES: 718 | send_email(to=EMAIL_ADDRESSES, subject=EMAIL_SUBJECT, html_content=html_content) 719 | 720 | 721 | def no_metadata_found(): 722 | """Stock html email template to send if there is no data present in the base tables""" 723 | 724 | print("No Data Available. Check data is present in the airflow metadata database.") 725 | 726 | html_content = f"""\ 727 | 728 | 729 |

    No Data Available

    730 |

    Check data is present in the airflow metadata database.

    731 | 732 | 733 | """ 734 | if EMAIL_ADDRESSES: 735 | send_email(to=EMAIL_ADDRESSES, subject=EMAIL_SUBJECT, html_content=html_content) 736 | 737 | 738 | default_args = { 739 | 'owner': DAG_OWNER, 740 | 'depends_on_past': False, 741 | 'email': EMAIL_ADDRESSES, 742 | 'email_on_failure': True, 743 | 'email_on_retry': False, 744 | 'start_date': START_DATE, 745 | "retries": 1, 746 | "retry_delay": timedelta(minutes=5), 747 | } 748 | 749 | with DAG(DAG_ID, 750 | default_args=default_args, 751 | description="DAG generating the SLA miss report", 752 | schedule_interval=SCHEDULE_INTERVAL, 753 | start_date=START_DATE, 754 | tags=['teamclairvoyant', 'airflow-maintenance-dags']) as dag: 755 | sla_miss_report_task = PythonOperator(task_id="sla_miss_report", python_callable=sla_miss_report, dag=dag) 756 | --------------------------------------------------------------------------------