├── backup-configs
    ├── README.md
    └── airflow-backup-configs.py
├── delete-broken-dags
    ├── README.md
    └── airflow-delete-broken-dags.py
├── kill-halted-tasks
    ├── README.md
    └── airflow-kill-halted-tasks.py
├── clear-missing-dags
    ├── README.md
    └── airflow-clear-missing-dags.py
├── .gitignore
├── db-cleanup
    ├── README.md
    └── airflow-db-cleanup.py
├── README.md
├── log-cleanup
    ├── README.md
    ├── airflow-log-cleanup-pwdless-ssh.py
    └── airflow-log-cleanup.py
├── sla-miss-report
    ├── README.md
    └── airflow-sla-miss-report.py
└── LICENSE


/backup-configs/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow Backup Configs
 2 | 
 3 | A maintenance workflow that you can deploy into Airflow to periodically take backups of various Airflow configurations and files.
 4 | 
 5 | ## Deploy
 6 | 
 7 | 1. Login to the machine running Airflow
 8 | 
 9 | 2. Navigate to the dags directory
10 | 
11 | 3. Copy the airflow-backup-configs.py file to this dags directory
12 | 
13 |        a. Here's a fast way:
14 | 
15 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/backup-configs/airflow-backup-configs.py
16 |         
17 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES, BACKUP_FOLDER_DATE_FORMAT, BACKUP_HOME_DIRECTORY, BACKUPS_ENABLED, and BACKUP_RETENTION_COUNT) in the DAG with the desired values
18 | 
19 | 6. Enable the DAG in the Airflow Webserver
20 | 
21 | 


--------------------------------------------------------------------------------
/delete-broken-dags/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow Delete Broken DAGs
 2 | 
 3 | A maintenance workflow that you can deploy into Airflow to periodically delete DAG files and clean out entries in the
 4 | ImportError table for DAGs which Airflow cannot parse or import properly. This ensures that the ImportError table is cleaned every day.
 5 | 
 6 | ## Deploy
 7 | 
 8 | 1. Login to the machine running Airflow
 9 | 
10 | 2. Navigate to the dags directory
11 | 
12 | 3. Copy the airflow-delete-broken-dags.py file to this dags directory
13 | 
14 |        a. Here's a fast way:
15 | 
16 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/delete-broken-dags/airflow-delete-broken-dags.py
17 |         
18 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES and ENABLE_DELETE) in the DAG with the desired values
19 | 
20 | 5. Enable the DAG in the Airflow Webserver
21 | 


--------------------------------------------------------------------------------
/kill-halted-tasks/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow Kill Halted Tasks
 2 | 
 3 | A maintenance workflow that you can deploy into Airflow to periodically kill off tasks that are running in the background that don't correspond to a running task in the DB. 
 4 | 
 5 | This is useful because when you kill off a DAG Run or Task through the Airflow Web Server, the task still runs in the background on one of the executors until the task is complete.
 6 | 
 7 | ## Deploy
 8 | 
 9 | 1. Login to the machine running Airflow
10 | 
11 | 2. Navigate to the dags directory
12 | 
13 | 3. Copy the airflow-kill-halted-tasks.py file to this dags directory
14 | 
15 |        a. Here's a fast way:
16 | 
17 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/kill-halted-tasks/airflow-kill-halted-tasks.py
18 |         
19 | 4. Update the global variables in the DAG with the desired values 
20 | 
21 | 5. Enable the DAG in the Airflow Webserver
22 | 
23 | 


--------------------------------------------------------------------------------
/clear-missing-dags/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow Clear Missing DAGs
 2 | 
 3 | A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs.  
 4 | 
 5 | ## Deploy
 6 | 
 7 | 1. Login to the machine running Airflow
 8 | 
 9 | 2. Navigate to the dags directory
10 | 
11 | 3. Copy the airflow-clear-missing-dags.py file to this dags directory
12 | 
13 |        a. Here's a fast way:
14 | 
15 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/clear-missing-dags/airflow-clear-missing-dags.py
16 |         
17 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES and ENABLE_DELETE) in the DAG with the desired values
18 | 
19 | 5. Enable the DAG in the Airflow Webserver
20 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | # IDEA
92 | .idea
93 | 
94 | # DS-STORE REMOVAL
95 | .DS_Store
96 | 


--------------------------------------------------------------------------------
/db-cleanup/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow DB Cleanup
 2 | 
 3 | A maintenance workflow that you can deploy into Airflow to periodically clean out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid having too much data in your Airflow MetaStore.
 4 | 
 5 | ## Deploy
 6 | 
 7 | 1. Login to the machine running Airflow
 8 | 
 9 | 2. Navigate to the dags directory
10 | 
11 | 3. Copy the airflow-db-cleanup.py file to this dags directory
12 | 
13 |        a. Here's a fast way:
14 | 
15 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/db-cleanup/airflow-db-cleanup.py
16 |         
17 | 4. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES and ENABLE_DELETE) in the DAG with the desired values
18 | 
19 | 5. Modify the DATABASE_OBJECTS list to add/remove objects as needed. Each dictionary in the list features the following parameters:
20 |     - airflow_db_model: Model imported from airflow.models corresponding to a table in the airflow metadata database
21 |     - age_check_column: Column in the model/table to use for calculating max date of data deletion
22 |     - keep_last: Boolean to specify whether to preserve last run instance
23 |         - keep_last_filters: List of filters to preserve data from deleting during clean-up, such as DAG runs where the external trigger is set to 0. 
24 |         - keep_last_group_by: Option to specify column by which to group the database entries and perform aggregate functions.
25 | 
26 | 6. Create and Set the following Variables in the Airflow Web Server (Admin -> Variables)
27 | 
28 |     - airflow_db_cleanup__max_db_entry_age_in_days - integer - Length to retain the log files if not already provided in the conf. If this is set to 30, the job will remove those files that are 30 days old or older.
29 | 
30 | 7. Enable the DAG in the Airflow Webserver
31 | 
32 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # airflow-maintenance-dags
 2 | A series of DAGs/Workflows to help maintain the operation of Airflow
 3 | 
 4 | ## DAGs/Workflows
 5 | 
 6 | * [backup-configs](backup-configs)
 7 |     * A maintenance workflow that you can deploy into Airflow to periodically take backups of various Airflow configurations and files.
 8 | * [clear-missing-dags](clear-missing-dags)
 9 |     * A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs.
10 | * [db-cleanup](db-cleanup)
11 |     * A maintenance workflow that you can deploy into Airflow to periodically clean out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid having too much data in your Airflow MetaStore.
12 | * [kill-halted-tasks](kill-halted-tasks)
13 |     * A maintenance workflow that you can deploy into Airflow to periodically kill off tasks that are running in the background that don't correspond to a running task in the DB.
14 |     * This is useful because when you kill off a DAG Run or Task through the Airflow Web Server, the task still runs in the background on one of the executors until the task is complete.
15 | * [log-cleanup](log-cleanup)
16 |     * A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big.
17 | * [delete-broken-dags](delete-broken-dags)
18 |     * A maintenance workflow that you can deploy into Airflow to periodically delete DAG files and clean out entries in the ImportError table for DAGs which Airflow cannot parse or import properly. This ensures that the ImportError table is cleaned every day.
19 | * [sla-miss-report](sla-miss-report)
20 |     * DAG providing an extensive analysis report of SLA misses broken down on a daily, hourly, and task level
21 | 


--------------------------------------------------------------------------------
/log-cleanup/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow Log Cleanup
 2 | 
 3 | A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big.
 4 | 
 5 | - **airflow-log-cleanup.py**: Allows to delete logs by specifying the **number** of worker nodes. Does not guarantee log deletion of all nodes. 
 6 | - **airflow-log-cleanup-pwdless-ssh.py**: Allows to delete logs by specifying the list of worker nodes by their hostname. Requires the `airflow` user to have passwordless ssh to access all nodes.
 7 | 
 8 | ## Deploy
 9 | 
10 | 1. Login to the machine running Airflow
11 | 2. Navigate to the dags directory
12 | 3. Select the DAG to deploy (with or without SSH access) and follow the instructions
13 | 
14 | ### airflow-log-cleanup.py
15 | 
16 | 1. Copy the airflow-log-cleanup.py file to this dags directory
17 | 
18 |        a. Here's a fast way:
19 | 
20 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/log-cleanup/airflow-log-cleanup.py
21 | 
22 | 2. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES, ENABLE_DELETE and NUMBER_OF_WORKERS) in the DAG with the desired values
23 | 
24 | 3. Create and Set the following Variables in the Airflow Web Server (Admin -> Variables)
25 | 
26 |     - airflow_log_cleanup__max_log_age_in_days - integer - Length to retain the log files if not already provided in the conf. If this is set to 30, the job will remove those files that are 30 days old or older.
27 |     - airflow_log_cleanup__enable_delete_child_log - boolean (True/False) - Whether to delete files from the Child Log directory defined under [scheduler] in the airflow.cfg file
28 | 
29 | 4. Enable the DAG in the Airflow Webserver
30 | 
31 | ### airflow-log-cleanup-pwdless-ssh.py ###
32 | 
33 | 1. Copy the airflow-log-cleanup-pwdless-ssh.py file to this dags directory
34 | 
35 |        a. Here's a fast way:
36 | 
37 |                 $ wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/log-cleanup/airflow-log-cleanup-pwdless-ssh.py
38 | 
39 | 2. Update the global variables (SCHEDULE_INTERVAL, DAG_OWNER_NAME, ALERT_EMAIL_ADDRESSES, ENABLE_DELETE and AIRFLOW_HOSTS) in the DAG with the desired values
40 | 
41 | 3. Create and Set the following Variables in the Airflow Web Server (Admin -> Variables)
42 | 
43 |     - airflow_log_cleanup__max_log_age_in_days - integer - Length to retain the log files if not already provided in the conf. If this is set to 30, the job will remove those files that are 30 days old or older.
44 |     - airflow_log_cleanup__enable_delete_child_log - boolean (True/False) - Whether to delete files from the Child Log directory defined under [scheduler] in the airflow.cfg file
45 | 
46 | 4. Ensure the `airflow` user can passwordless SSH on the hosts listed in `AIRFLOW_HOSTS`
47 |    1. Create a public and private key SSH key on all the worker nodes. You can follow these instructions: https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys--2
48 |    2. Add the public key content to the ~/.ssh/authorized_keys file on all the other machines
49 | 
50 | 5.  Enable the DAG in the Airflow Webserver
51 | 


--------------------------------------------------------------------------------
/delete-broken-dags/airflow-delete-broken-dags.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically delete broken DAG file(s).
  3 | 
  4 | airflow trigger_dag airflow-delete-broken-dags
  5 | 
  6 | """
  7 | from airflow.models import DAG, ImportError
  8 | from airflow.operators.python_operator import PythonOperator
  9 | from airflow import settings
 10 | from datetime import timedelta
 11 | import os
 12 | import os.path
 13 | import socket
 14 | import logging
 15 | import airflow
 16 | 
 17 | 
 18 | # airflow-delete-broken-dags
 19 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 20 | START_DATE = airflow.utils.dates.days_ago(1)
 21 | # How often to Run. @daily - Once a day at Midnight
 22 | SCHEDULE_INTERVAL = "@daily"
 23 | # Who is listed as the owner of this DAG in the Airflow Web Server
 24 | DAG_OWNER_NAME = "operations"
 25 | # List of email address to send email alerts to if this job fails
 26 | ALERT_EMAIL_ADDRESSES = []
 27 | # Whether the job should delete the logs or not. Included if you want to
 28 | # temporarily avoid deleting the logs
 29 | ENABLE_DELETE = True
 30 | 
 31 | default_args = {
 32 |     'owner': DAG_OWNER_NAME,
 33 |     'email': ALERT_EMAIL_ADDRESSES,
 34 |     'email_on_failure': True,
 35 |     'email_on_retry': False,
 36 |     'start_date': START_DATE,
 37 |     'retries': 1,
 38 |     'retry_delay': timedelta(minutes=1)
 39 | }
 40 | 
 41 | dag = DAG(
 42 |     DAG_ID,
 43 |     default_args=default_args,
 44 |     schedule_interval=SCHEDULE_INTERVAL,
 45 |     start_date=START_DATE,
 46 |     tags=['teamclairvoyant', 'airflow-maintenance-dags']
 47 | )
 48 | if hasattr(dag, 'doc_md'):
 49 |     dag.doc_md = __doc__
 50 | if hasattr(dag, 'catchup'):
 51 |     dag.catchup = False
 52 | 
 53 | 
 54 | def delete_broken_dag_files(**context):
 55 | 
 56 |     logging.info("Starting to run Clear Process")
 57 | 
 58 |     try:
 59 |         host_name = socket.gethostname()
 60 |         host_ip = socket.gethostbyname(host_name)
 61 |         logging.info("Running on Machine with Host Name: " + host_name)
 62 |         logging.info("Running on Machine with IP: " + host_ip)
 63 |     except Exception as e:
 64 |         print("Unable to get Host Name and IP: " + str(e))
 65 | 
 66 |     session = settings.Session()
 67 | 
 68 |     logging.info("Configurations:")
 69 |     logging.info("enable_delete:            " + str(ENABLE_DELETE))
 70 |     logging.info("session:                  " + str(session))
 71 |     logging.info("")
 72 | 
 73 |     errors = session.query(ImportError).all()
 74 | 
 75 |     logging.info(
 76 |         "Process will be removing broken DAG file(s) from the file system:"
 77 |     )
 78 |     for error in errors:
 79 |         logging.info("\tFile: " + str(error.filename))
 80 |     logging.info(
 81 |         "Process will be Deleting " + str(len(errors)) + " DAG file(s)"
 82 |     )
 83 | 
 84 |     if ENABLE_DELETE:
 85 |         logging.info("Performing Delete...")
 86 |         for error in errors:
 87 |             if os.path.exists(error.filename):
 88 |                 os.remove(error.filename)
 89 |             session.delete(error)
 90 |         logging.info("Finished Performing Delete")
 91 |     else:
 92 |         logging.warn("You're opted to skip Deleting the DAG file(s)!!!")
 93 | 
 94 |     logging.info("Finished")
 95 | 
 96 | 
 97 | delete_broken_dag_files = PythonOperator(
 98 |     task_id='delete_broken_dag_files',
 99 |     python_callable=delete_broken_dag_files,
100 |     provide_context=True,
101 |     dag=dag)
102 | 


--------------------------------------------------------------------------------
/clear-missing-dags/airflow-clear-missing-dags.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs.
  3 | 
  4 | airflow trigger_dag airflow-clear-missing-dags
  5 | 
  6 | """
  7 | from airflow.models import DAG, DagModel
  8 | from airflow.operators.python_operator import PythonOperator
  9 | from airflow import settings
 10 | from datetime import timedelta
 11 | import os
 12 | import os.path
 13 | import socket
 14 | import logging
 15 | import airflow
 16 | 
 17 | 
 18 | # airflow-clear-missing-dags
 19 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 20 | START_DATE = airflow.utils.dates.days_ago(1)
 21 | # How often to Run. @daily - Once a day at Midnight
 22 | SCHEDULE_INTERVAL = "@daily"
 23 | # Who is listed as the owner of this DAG in the Airflow Web Server
 24 | DAG_OWNER_NAME = "operations"
 25 | # List of email address to send email alerts to if this job fails
 26 | ALERT_EMAIL_ADDRESSES = []
 27 | # Whether the job should delete the logs or not. Included if you want to
 28 | # temporarily avoid deleting the logs
 29 | ENABLE_DELETE = True
 30 | 
 31 | default_args = {
 32 |     'owner': DAG_OWNER_NAME,
 33 |     'depends_on_past': False,
 34 |     'email': ALERT_EMAIL_ADDRESSES,
 35 |     'email_on_failure': True,
 36 |     'email_on_retry': False,
 37 |     'start_date': START_DATE,
 38 |     'retries': 1,
 39 |     'retry_delay': timedelta(minutes=1)
 40 | }
 41 | 
 42 | dag = DAG(
 43 |     DAG_ID,
 44 |     default_args=default_args,
 45 |     schedule_interval=SCHEDULE_INTERVAL,
 46 |     start_date=START_DATE,
 47 |     tags=['teamclairvoyant', 'airflow-maintenance-dags']
 48 | )
 49 | if hasattr(dag, 'doc_md'):
 50 |     dag.doc_md = __doc__
 51 | if hasattr(dag, 'catchup'):
 52 |     dag.catchup = False
 53 | 
 54 | 
 55 | def clear_missing_dags_fn(**context):
 56 | 
 57 |     logging.info("Starting to run Clear Process")
 58 | 
 59 |     try:
 60 |         host_name = socket.gethostname()
 61 |         host_ip = socket.gethostbyname(host_name)
 62 |         logging.info("Running on Machine with Host Name: " + host_name)
 63 |         logging.info("Running on Machine with IP: " + host_ip)
 64 |     except Exception as e:
 65 |         print("Unable to get Host Name and IP: " + str(e))
 66 | 
 67 |     session = settings.Session()
 68 | 
 69 |     logging.info("Configurations:")
 70 |     logging.info("enable_delete:            " + str(ENABLE_DELETE))
 71 |     logging.info("session:                  " + str(session))
 72 |     logging.info("")
 73 | 
 74 |     dags = session.query(DagModel).all()
 75 |     entries_to_delete = []
 76 |     for dag in dags:
 77 |         # Check if it is a zip-file
 78 |         if dag.fileloc is not None and '.zip/' in dag.fileloc:
 79 |             index = dag.fileloc.rfind('.zip/') + len('.zip')
 80 |             fileloc = dag.fileloc[0:index]
 81 |         else:
 82 |             fileloc = dag.fileloc
 83 | 
 84 |         if fileloc is None:
 85 |             logging.info(
 86 |                 "After checking DAG '" + str(dag) +
 87 |                 "', the fileloc was set to None so assuming the Python " +
 88 |                 "definition file DOES NOT exist"
 89 |             )
 90 |             entries_to_delete.append(dag)
 91 |         elif not os.path.exists(fileloc):
 92 |             logging.info(
 93 |                 "After checking DAG '" + str(dag) +
 94 |                 "', the Python definition file DOES NOT exist: " + fileloc
 95 |             )
 96 |             entries_to_delete.append(dag)
 97 |         else:
 98 |             logging.info(
 99 |                 "After checking DAG '" + str(dag) +
100 |                 "', the Python definition file does exist: " + fileloc
101 |             )
102 | 
103 |     logging.info("Process will be Deleting the DAG(s) from the DB:")
104 |     for entry in entries_to_delete:
105 |         logging.info("\tEntry: " + str(entry))
106 |     logging.info(
107 |         "Process will be Deleting " + str(len(entries_to_delete)) + " DAG(s)"
108 |     )
109 | 
110 |     if ENABLE_DELETE:
111 |         logging.info("Performing Delete...")
112 |         for entry in entries_to_delete:
113 |             session.delete(entry)
114 |         session.commit()
115 |         logging.info("Finished Performing Delete")
116 |     else:
117 |         logging.warn("You're opted to skip deleting the DAG entries!!!")
118 | 
119 |     logging.info("Finished Running Clear Process")
120 | 
121 | 
122 | clear_missing_dags = PythonOperator(
123 |     task_id='clear_missing_dags',
124 |     python_callable=clear_missing_dags_fn,
125 |     provide_context=True,
126 |     dag=dag)
127 | 


--------------------------------------------------------------------------------
/sla-miss-report/README.md:
--------------------------------------------------------------------------------
 1 | # Airflow SLA Miss Report
 2 | 
 3 |   - [About](#about)
 4 |     - [Daily SLA Misses (timeframe: `long`)](#daily-sla-misses-timeframe-long)
 5 |     - [Hourly SLA Misses (timeframe: `short`)](#hourly-sla-misses-timeframe-short)
 6 |     - [DAG SLA Misses (timeframe: `short, medium, long`)](#dag-sla-misses-timeframe-short-medium-long)
 7 |     - [Sample Email](#sample-email)
 8 |     - [Sample Airflow Task Logs](#sample-airflow-task-logs)
 9 |   - [Architecture](#architecture)
10 |   - [Requirements](#requirements)
11 |   - [Deployment](#deployment)
12 |   - [References](#references)
13 | 
14 | 
15 | ### About
16 | Airflow allows users to define [SLAs](https://github.com/teamclairvoyant/airflow-maintenance-dags/blob/teamclairvoyant/sla-miss-report/sla-miss-report/README.md) at DAG & task levels to track instances where processes are running longer than usual. However, making sense of the data is a challenge.
17 | 
18 | The `airflow-sla-miss-report` DAG consolidates the data from the metadata tables and provides meaningful insights to ensure SLAs are met when set.
19 | 
20 | The DAG utilizes **three (3) timeframes** (default: `short`: 1d, `medium`: 3d, `long`: 7d) to calculate the following KPIs:
21 | 
22 | #### Daily SLA Misses (timeframe: `long`)
23 | Following details broken down on a daily basis for the provided long timeframe (e.g. 7 days):
24 | ```
25 |   SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
26 |   Top Violator (%): task that violated its SLA the most as a percentage of its total runs
27 |   Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
28 | ```
29 | 
30 | #### Hourly SLA Misses (timeframe: `short`)
31 | Following details broken down on an hourly basis for the provided short timeframe (e.g. 1 day):
32 | ```
33 |   SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
34 |   Top Violator (%): task that violated its SLA the most as a percentage of its total runs
35 |   Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
36 |   Longest Running Task: task that took the longest time to execute within the hour window
37 |   Average Task Queue Time (s): avg time taken for tasks in `queued` state; can be used to detect scheduling bottlenecks
38 | ```
39 | 
40 | #### DAG SLA Misses (timeframe: `short, medium, long`)
41 | Following details broken down on a task level for all timeframes:
42 | ```
43 |   Current SLA (s): current defined SLA for the task
44 |   Short, Medium, Long Timeframe SLA miss % (avg execution time): % of tasks that missed their SLAs & their avg execution times over the respective timeframes
45 | ```
46 | 
47 | #### **Sample Email**
48 | ![Airflow SLA miss Email Report Output1](https://user-images.githubusercontent.com/32403237/193700720-24b88202-edae-4199-a7f3-0e46e54e0d5d.png)
49 | 
50 | #### **Sample Airflow Task Logs**
51 | ![Airflow SLA miss Email Report Output2](https://user-images.githubusercontent.com/32403237/194130208-da532d3a-3ff4-4dbd-9c94-574ef42b2ee8.png)
52 | 
53 | 
54 | ### Architecture
55 | The process reads data from the Airflow metadata database to calculate SLA misses based on the defined DAG/task level SLAs using information.
56 | The following metadata tables are utilized:
57 | - `SerializedDag`: retrieve defined DAG & task SLAs
58 | - `DagRuns`: details about each DAG run
59 | - `TaskInstances`: details about each task instance in a DAG run
60 | 
61 | ![Airflow SLA Process Flow Architecture](https://user-images.githubusercontent.com/8946659/191114560-2368e2df-916a-4f66-b1ac-b6cfe0b35a47.png)
62 | 
63 | ### Requirements
64 | - Python: 3.7 and above
65 | - Pip packages: `pandas`
66 | - Airflow: v2.3 and above
67 | - Airflow metadata tables: `DagRuns`, `TaskInstances`, `SerializedDag`
68 | - [SMTP details](https://airflow.apache.org/docs/apache-airflow/stable/howto/email-config.html#using-default-smtp) in `airflow.cfg` for sending emails
69 | 
70 | ### Deployment
71 | 1. Login to the machine running Airflow
72 | 2. Navigate to the `dags` directory
73 | 3. Copy the `airflow-sla-miss-report.py` file to the `dags` directory. Here's a fast way:
74 |   ```
75 |   wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/sla-miss-report/airflow-sla-miss-report.py
76 |   ```
77 | 4. Update the global variables in the DAG with the desired values:
78 |   ```
79 |   EMAIL_ADDRESSES (optional): list of recipient emails to send the SLA report
80 |   SHORT_TIMEFRAME_IN_DAYS: duration in days of the short timeframe to calculate SLA metrics (default: 1)
81 |   MEDIUM_TIMEFRAME_IN_DAYS: duration in days of the medium timeframe to calculate SLA metrics (default: 3)
82 |   LONG_TIMEFRAME_IN_DAYS: duration in days of the long timeframe to calculate SLA metrics (default: 7)
83 |   ```
84 | 5. Enable the DAG in the Airflow Webserver


--------------------------------------------------------------------------------
/log-cleanup/airflow-log-cleanup-pwdless-ssh.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically clean
  3 | out the task logs to avoid those getting too big.
  4 | airflow trigger_dag --conf '[curly-braces]"maxLogAgeInDays":30[curly-braces]' airflow-log-cleanup
  5 | --conf options:
  6 |     maxLogAgeInDays:<INT> - Optional
  7 | """
  8 | import logging
  9 | import os
 10 | import time
 11 | from datetime import timedelta
 12 | 
 13 | import airflow
 14 | from airflow.configuration import conf
 15 | from airflow.models import DAG, Variable
 16 | from airflow.operators.bash_operator import BashOperator
 17 | from airflow.operators.dummy_operator import DummyOperator
 18 | 
 19 | # airflow-log-cleanup
 20 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 21 | START_DATE = airflow.utils.dates.days_ago(1)
 22 | try:
 23 |     BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER").rstrip("/")
 24 | except Exception as e:
 25 |     BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER").rstrip("/")
 26 | # How often to Run. @daily - Once a day at Midnight
 27 | SCHEDULE_INTERVAL = "@daily"
 28 | # Who is listed as the owner of this DAG in the Airflow Web Server
 29 | DAG_OWNER_NAME = "operations"
 30 | # List of email address to send email alerts to if this job fails
 31 | ALERT_EMAIL_ADDRESSES = []
 32 | # Length to retain the log files if not already provided in the conf. If this
 33 | # is set to 30, the job will remove those files that are 30 days old or older
 34 | DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get(
 35 |     "airflow_log_cleanup__max_log_age_in_days", 30
 36 | )
 37 | # Whether the job should delete the logs or not. Included if you want to
 38 | # temporarily avoid deleting the logs
 39 | ENABLE_DELETE = False
 40 | 
 41 | AIRFLOW_HOSTS = "localhost"  # comma separated list of host(s)
 42 | 
 43 | TEMP_LOG_CLEANUP_SCRIPT_PATH = "/tmp/airflow_log_cleanup.sh"
 44 | DIRECTORIES_TO_DELETE = [BASE_LOG_FOLDER]
 45 | ENABLE_DELETE_CHILD_LOG = Variable.get(
 46 |     "airflow_log_cleanup__enable_delete_child_log", "False"
 47 | )
 48 | 
 49 | logging.info("ENABLE_DELETE_CHILD_LOG  " + ENABLE_DELETE_CHILD_LOG)
 50 | 
 51 | if not BASE_LOG_FOLDER or BASE_LOG_FOLDER.strip() == "":
 52 |     raise ValueError(
 53 |         "BASE_LOG_FOLDER variable is empty in airflow.cfg. It can be found "
 54 |         "under the [core] (<2.0.0) section or [logging] (>=2.0.0) in the cfg file. "
 55 |         "Kindly provide an appropriate directory path."
 56 |     )
 57 | 
 58 | if ENABLE_DELETE_CHILD_LOG.lower() == "true":
 59 |     try:
 60 |         CHILD_PROCESS_LOG_DIRECTORY = conf.get(
 61 |             "scheduler", "CHILD_PROCESS_LOG_DIRECTORY"
 62 |         )
 63 |         if CHILD_PROCESS_LOG_DIRECTORY != ' ':
 64 |             DIRECTORIES_TO_DELETE.append(CHILD_PROCESS_LOG_DIRECTORY)
 65 |     except Exception as e:
 66 |         logging.exception(
 67 |             "Could not obtain CHILD_PROCESS_LOG_DIRECTORY from " +
 68 |             "Airflow Configurations: " + str(e)
 69 |         )
 70 | 
 71 | default_args = {
 72 |     'owner': DAG_OWNER_NAME,
 73 |     'depends_on_past': False,
 74 |     'email': ALERT_EMAIL_ADDRESSES,
 75 |     'email_on_failure': True,
 76 |     'email_on_retry': False,
 77 |     'start_date': START_DATE,
 78 |     'retries': 1,
 79 |     'retry_delay': timedelta(minutes=1)
 80 | }
 81 | 
 82 | dag = DAG(
 83 |     DAG_ID,
 84 |     default_args=default_args,
 85 |     schedule_interval=SCHEDULE_INTERVAL,
 86 |     start_date=START_DATE,
 87 |     tags=['teamclairvoyant', 'airflow-maintenance-dags']
 88 | )
 89 | if hasattr(dag, 'doc_md'):
 90 |     dag.doc_md = __doc__
 91 | if hasattr(dag, 'catchup'):
 92 |     dag.catchup = False
 93 | 
 94 | log_cleanup = """
 95 | echo "Getting Configurations..."
 96 | 
 97 | BASE_LOG_FOLDER=$1
 98 | MAX_LOG_AGE_IN_DAYS=$2
 99 | ENABLE_DELETE=$3
100 | 
101 | echo "Finished Getting Configurations"
102 | echo ""
103 | 
104 | echo "Configurations:"
105 | echo "BASE_LOG_FOLDER:      \'${BASE_LOG_FOLDER}\'"
106 | echo "MAX_LOG_AGE_IN_DAYS:  \'${MAX_LOG_AGE_IN_DAYS}\'"
107 | echo "ENABLE_DELETE:        \'${ENABLE_DELETE}\'"
108 | 
109 | cleanup() {
110 |     echo "Executing Find Statement: $1"
111 |     FILES_MARKED_FOR_DELETE=$(eval $1)
112 |     echo "Process will be Deleting the following files or directories:"
113 |     echo "${FILES_MARKED_FOR_DELETE}"
114 |     echo "Process will be Deleting $(echo "${FILES_MARKED_FOR_DELETE}" |
115 |         grep -v \'^$\' | wc -l) files or directories"
116 | 
117 |     echo ""
118 |     if [ "${ENABLE_DELETE}" == "true" ]; then
119 |         if [ "${FILES_MARKED_FOR_DELETE}" != "" ]; then
120 |             echo "Executing Delete Statement: $2"
121 |             eval $2
122 |             DELETE_STMT_EXIT_CODE=$?
123 |             if [ "${DELETE_STMT_EXIT_CODE}" != "0" ]; then
124 |                 echo "Delete process failed with exit code  \'${DELETE_STMT_EXIT_CODE}\'"
125 | 
126 |                 exit ${DELETE_STMT_EXIT_CODE}
127 |             fi
128 |         else
129 |             echo "WARN: No files or directories to Delete"
130 |         fi
131 |     else
132 |         echo "WARN: You have opted to skip deleting the files or directories"
133 |     fi
134 | }
135 | 
136 | echo ""
137 | echo "Running Cleanup Process..."
138 | 
139 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type f -mtime +${MAX_LOG_AGE_IN_DAYS}"
140 | DELETE_STMT="${FIND_STATEMENT} -exec rm -f {} \;"
141 | 
142 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
143 | CLEANUP_EXIT_CODE=$?
144 | 
145 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type d -empty"
146 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;"
147 | 
148 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
149 | CLEANUP_EXIT_CODE=$?
150 | 
151 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/* -type d -empty"
152 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;"
153 | 
154 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
155 | CLEANUP_EXIT_CODE=$?
156 | 
157 | echo "Finished Running Cleanup Process"
158 | """
159 | 
160 | create_log_cleanup_script = BashOperator(
161 |     task_id=f'create_log_cleanup_script',
162 |     bash_command=f"""
163 |     echo '{log_cleanup}' > {TEMP_LOG_CLEANUP_SCRIPT_PATH}
164 |     chmod +x {TEMP_LOG_CLEANUP_SCRIPT_PATH}
165 |     current_host=$(echo $HOSTNAME)
166 |     echo "Current Host: $current_host"
167 |     hosts_string={AIRFLOW_HOSTS}
168 |     echo "All Scheduler Hosts: $hosts_string"
169 |     IFS=',' read -ra host_array <<< "$hosts_string"
170 |     for host in "${{host_array[@]}}"
171 |     do
172 |         if [ "$host" != "$current_host" ]; then
173 |             echo "Copying log_cleanup script to $host..."
174 |             scp {TEMP_LOG_CLEANUP_SCRIPT_PATH} $host:{TEMP_LOG_CLEANUP_SCRIPT_PATH}
175 |             echo "Making the script executable..."
176 |             ssh -o StrictHostKeyChecking=no $host "chmod +x {TEMP_LOG_CLEANUP_SCRIPT_PATH}"
177 |         fi
178 |     done
179 |     """,
180 |     dag=dag)
181 | 
182 | for host in AIRFLOW_HOSTS.split(","):
183 |     for DIR_ID, DIRECTORY in enumerate(DIRECTORIES_TO_DELETE):
184 |         LOG_CLEANUP_COMMAND = f'{TEMP_LOG_CLEANUP_SCRIPT_PATH} {DIRECTORY} {DEFAULT_MAX_LOG_AGE_IN_DAYS} {str(ENABLE_DELETE).lower()}'
185 |         cleanup_task = BashOperator(
186 |             task_id=f'airflow_log_cleanup_{host}_dir_{DIR_ID}',
187 |             bash_command=f"""
188 |             echo "Executing cleanup script..."
189 |             ssh -o StrictHostKeyChecking=no {host} "{LOG_CLEANUP_COMMAND}"
190 |             echo "Removing cleanup script..."
191 |             ssh -o StrictHostKeyChecking=no {host} "rm {TEMP_LOG_CLEANUP_SCRIPT_PATH}"
192 |             """,
193 |             dag=dag)
194 | 
195 |         cleanup_task.set_upstream(create_log_cleanup_script)
196 | 


--------------------------------------------------------------------------------
/log-cleanup/airflow-log-cleanup.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically clean
  3 | out the task logs to avoid those getting too big.
  4 | airflow trigger_dag --conf '[curly-braces]"maxLogAgeInDays":30[curly-braces]' airflow-log-cleanup
  5 | --conf options:
  6 |     maxLogAgeInDays:<INT> - Optional
  7 | """
  8 | import logging
  9 | import os
 10 | from datetime import timedelta
 11 | 
 12 | import airflow
 13 | import jinja2
 14 | from airflow.configuration import conf
 15 | from airflow.models import DAG, Variable
 16 | from airflow.operators.bash_operator import BashOperator
 17 | from airflow.operators.dummy_operator import DummyOperator
 18 | 
 19 | # airflow-log-cleanup
 20 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 21 | START_DATE = airflow.utils.dates.days_ago(1)
 22 | try:
 23 |     BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER").rstrip("/")
 24 | except Exception as e:
 25 |     BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER").rstrip("/")
 26 | # How often to Run. @daily - Once a day at Midnight
 27 | SCHEDULE_INTERVAL = "@daily"
 28 | # Who is listed as the owner of this DAG in the Airflow Web Server
 29 | DAG_OWNER_NAME = "operations"
 30 | # List of email address to send email alerts to if this job fails
 31 | ALERT_EMAIL_ADDRESSES = []
 32 | # Length to retain the log files if not already provided in the conf. If this
 33 | # is set to 30, the job will remove those files that are 30 days old or older
 34 | DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get(
 35 |     "airflow_log_cleanup__max_log_age_in_days", 30
 36 | )
 37 | # Whether the job should delete the logs or not. Included if you want to
 38 | # temporarily avoid deleting the logs
 39 | ENABLE_DELETE = True
 40 | # The number of worker nodes you have in Airflow. Will attempt to run this
 41 | # process for however many workers there are so that each worker gets its
 42 | # logs cleared.
 43 | NUMBER_OF_WORKERS = 1
 44 | DIRECTORIES_TO_DELETE = [BASE_LOG_FOLDER]
 45 | ENABLE_DELETE_CHILD_LOG = Variable.get(
 46 |     "airflow_log_cleanup__enable_delete_child_log", "False"
 47 | )
 48 | LOG_CLEANUP_PROCESS_LOCK_FILE = "/tmp/airflow_log_cleanup_worker.lock"
 49 | logging.info("ENABLE_DELETE_CHILD_LOG  " + ENABLE_DELETE_CHILD_LOG)
 50 | 
 51 | if not BASE_LOG_FOLDER or BASE_LOG_FOLDER.strip() == "":
 52 |     raise ValueError(
 53 |         "BASE_LOG_FOLDER variable is empty in airflow.cfg. It can be found "
 54 |         "under the [core] (<2.0.0) section or [logging] (>=2.0.0) in the cfg file. "
 55 |         "Kindly provide an appropriate directory path."
 56 |     )
 57 | 
 58 | if ENABLE_DELETE_CHILD_LOG.lower() == "true":
 59 |     try:
 60 |         CHILD_PROCESS_LOG_DIRECTORY = conf.get(
 61 |             "scheduler", "CHILD_PROCESS_LOG_DIRECTORY"
 62 |         )
 63 |         if CHILD_PROCESS_LOG_DIRECTORY != ' ':
 64 |             DIRECTORIES_TO_DELETE.append(CHILD_PROCESS_LOG_DIRECTORY)
 65 |     except Exception as e:
 66 |         logging.exception(
 67 |             "Could not obtain CHILD_PROCESS_LOG_DIRECTORY from " +
 68 |             "Airflow Configurations: " + str(e)
 69 |         )
 70 | 
 71 | default_args = {
 72 |     'owner': DAG_OWNER_NAME,
 73 |     'depends_on_past': False,
 74 |     'email': ALERT_EMAIL_ADDRESSES,
 75 |     'email_on_failure': True,
 76 |     'email_on_retry': False,
 77 |     'start_date': START_DATE,
 78 |     'retries': 1,
 79 |     'retry_delay': timedelta(minutes=1)
 80 | }
 81 | 
 82 | dag = DAG(
 83 |     DAG_ID,
 84 |     default_args=default_args,
 85 |     schedule_interval=SCHEDULE_INTERVAL,
 86 |     start_date=START_DATE,
 87 |     tags=['teamclairvoyant', 'airflow-maintenance-dags'],
 88 |     template_undefined=jinja2.Undefined
 89 | )
 90 | if hasattr(dag, 'doc_md'):
 91 |     dag.doc_md = __doc__
 92 | if hasattr(dag, 'catchup'):
 93 |     dag.catchup = False
 94 | 
 95 | start = DummyOperator(
 96 |     task_id='start',
 97 |     dag=dag)
 98 | 
 99 | log_cleanup = """
100 | 
101 | echo "Getting Configurations..."
102 | BASE_LOG_FOLDER="{{params.directory}}"
103 | WORKER_SLEEP_TIME="{{params.sleep_time}}"
104 | 
105 | sleep ${WORKER_SLEEP_TIME}s
106 | 
107 | MAX_LOG_AGE_IN_DAYS="{{dag_run.conf.maxLogAgeInDays}}"
108 | if [ "${MAX_LOG_AGE_IN_DAYS}" == "" ]; then
109 |     echo "maxLogAgeInDays conf variable isn't included. Using Default '""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """'."
110 |     MAX_LOG_AGE_IN_DAYS='""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """'
111 | fi
112 | ENABLE_DELETE=""" + str("true" if ENABLE_DELETE else "false") + """
113 | echo "Finished Getting Configurations"
114 | echo ""
115 | 
116 | echo "Configurations:"
117 | echo "BASE_LOG_FOLDER:      '${BASE_LOG_FOLDER}'"
118 | echo "MAX_LOG_AGE_IN_DAYS:  '${MAX_LOG_AGE_IN_DAYS}'"
119 | echo "ENABLE_DELETE:        '${ENABLE_DELETE}'"
120 | 
121 | cleanup() {
122 |     echo "Executing Find Statement: $1"
123 |     FILES_MARKED_FOR_DELETE=`eval $1`
124 |     echo "Process will be Deleting the following File(s)/Directory(s):"
125 |     echo "${FILES_MARKED_FOR_DELETE}"
126 |     echo "Process will be Deleting `echo "${FILES_MARKED_FOR_DELETE}" | \
127 |     grep -v '^$' | wc -l` File(s)/Directory(s)"     \
128 |     # "grep -v '^$'" - removes empty lines.
129 |     # "wc -l" - Counts the number of lines
130 |     echo ""
131 |     if [ "${ENABLE_DELETE}" == "true" ];
132 |     then
133 |         if [ "${FILES_MARKED_FOR_DELETE}" != "" ];
134 |         then
135 |             echo "Executing Delete Statement: $2"
136 |             eval $2
137 |             DELETE_STMT_EXIT_CODE=$?
138 |             if [ "${DELETE_STMT_EXIT_CODE}" != "0" ]; then
139 |                 echo "Delete process failed with exit code \
140 |                     '${DELETE_STMT_EXIT_CODE}'"
141 | 
142 |                 echo "Removing lock file..."
143 |                 rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """
144 |                 if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then
145 |                     echo "Error removing the lock file. \
146 |                     Check file permissions.\
147 |                     To re-run the DAG, ensure that the lock file has been \
148 |                     deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)."
149 |                     exit ${REMOVE_LOCK_FILE_EXIT_CODE}
150 |                 fi
151 |                 exit ${DELETE_STMT_EXIT_CODE}
152 |             fi
153 |         else
154 |             echo "WARN: No File(s)/Directory(s) to Delete"
155 |         fi
156 |     else
157 |         echo "WARN: You're opted to skip deleting the File(s)/Directory(s)!!!"
158 |     fi
159 | }
160 | 
161 | 
162 | if [ ! -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ ]; then
163 | 
164 |     echo "Lock file not found on this node! \
165 |     Creating it to prevent collisions..."
166 |     touch """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """
167 |     CREATE_LOCK_FILE_EXIT_CODE=$?
168 |     if [ "${CREATE_LOCK_FILE_EXIT_CODE}" != "0" ]; then
169 |         echo "Error creating the lock file. \
170 |         Check if the airflow user can create files under tmp directory. \
171 |         Exiting..."
172 |         exit ${CREATE_LOCK_FILE_EXIT_CODE}
173 |     fi
174 | 
175 |     echo ""
176 |     echo "Running Cleanup Process..."
177 | 
178 |     FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type f -mtime \
179 |      +${MAX_LOG_AGE_IN_DAYS}"
180 |     DELETE_STMT="${FIND_STATEMENT} -exec rm -f {} \;"
181 | 
182 |     cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
183 |     CLEANUP_EXIT_CODE=$?
184 | 
185 |     FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type d -empty"
186 |     DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;"
187 | 
188 |     cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
189 |     CLEANUP_EXIT_CODE=$?
190 | 
191 |     FIND_STATEMENT="find ${BASE_LOG_FOLDER}/* -type d -empty"
192 |     DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;"
193 | 
194 |     cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
195 |     CLEANUP_EXIT_CODE=$?
196 | 
197 |     echo "Finished Running Cleanup Process"
198 | 
199 |     echo "Deleting lock file..."
200 |     rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """
201 |     REMOVE_LOCK_FILE_EXIT_CODE=$?
202 |     if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then
203 |         echo "Error removing the lock file. Check file permissions. To re-run the DAG, ensure that the lock file has been deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)."
204 |         exit ${REMOVE_LOCK_FILE_EXIT_CODE}
205 |     fi
206 | 
207 | else
208 |     echo "Another task is already deleting logs on this worker node. \
209 |     Skipping it!"
210 |     echo "If you believe you're receiving this message in error, kindly check \
211 |     if """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ exists and delete it."
212 |     exit 0
213 | fi
214 | 
215 | """
216 | 
217 | for log_cleanup_id in range(1, NUMBER_OF_WORKERS + 1):
218 | 
219 |     for dir_id, directory in enumerate(DIRECTORIES_TO_DELETE):
220 | 
221 |         log_cleanup_op = BashOperator(
222 |             task_id='log_cleanup_worker_num_' +
223 |             str(log_cleanup_id) + '_dir_' + str(dir_id),
224 |             bash_command=log_cleanup,
225 |             params={
226 |                 "directory": str(directory),
227 |                 "sleep_time": int(log_cleanup_id)*3},
228 |             dag=dag)
229 | 
230 |         log_cleanup_op.set_upstream(start)
231 | 


--------------------------------------------------------------------------------
/backup-configs/airflow-backup-configs.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically take
  3 | backups of various Airflow configurations and files.
  4 | 
  5 | airflow trigger_dag airflow-backup-configs
  6 | 
  7 | """
  8 | from airflow.models import DAG, Variable
  9 | from airflow.operators.python_operator import PythonOperator
 10 | from airflow.configuration import conf
 11 | from datetime import datetime, timedelta
 12 | import os
 13 | import airflow
 14 | import logging
 15 | import subprocess
 16 | # airflow-backup-configs
 17 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 18 | # How often to Run. @daily - Once a day at Midnight
 19 | START_DATE = airflow.utils.dates.days_ago(1)
 20 | # Who is listed as the owner of this DAG in the Airflow Web Server
 21 | SCHEDULE_INTERVAL = "@daily"
 22 | # List of email address to send email alerts to if this job fails
 23 | DAG_OWNER_NAME = "operations"
 24 | ALERT_EMAIL_ADDRESSES = []
 25 | # Format options: https://www.tutorialspoint.com/python/time_strftime.htm
 26 | BACKUP_FOLDER_DATE_FORMAT = "%Y%m%d%H%M%S"
 27 | BACKUP_HOME_DIRECTORY = Variable.get("airflow_backup_config__backup_home_directory", "/tmp/airflow_backups")
 28 | BACKUPS_ENABLED = {
 29 |     "dag_directory": True,
 30 |     "log_directory": True,
 31 |     "airflow_cfg": True,
 32 |     "pip_packages": True
 33 | }
 34 | # How many backups to retain (not including the one that was just taken)
 35 | BACKUP_RETENTION_COUNT = 7
 36 | 
 37 | default_args = {
 38 |     'owner': DAG_OWNER_NAME,
 39 |     'email': ALERT_EMAIL_ADDRESSES,
 40 |     'email_on_failure': True,
 41 |     'email_on_retry': False,
 42 |     'start_date': START_DATE,
 43 |     'retries': 1,
 44 |     'retry_delay': timedelta(minutes=1)
 45 | }
 46 | 
 47 | dag = DAG(
 48 |     DAG_ID,
 49 |     default_args=default_args,
 50 |     schedule_interval=SCHEDULE_INTERVAL,
 51 |     start_date=START_DATE,
 52 |     tags=['teamclairvoyant', 'airflow-maintenance-dags']
 53 | )
 54 | if hasattr(dag, 'doc_md'):
 55 |     dag.doc_md = __doc__
 56 | if hasattr(dag, 'catchup'):
 57 |     dag.catchup = False
 58 | 
 59 | 
 60 | def print_configuration_fn(**context):
 61 |     logging.info("Executing print_configuration_fn")
 62 | 
 63 |     logging.info("Loading Configurations...")
 64 |     BACKUP_FOLDER_DATE = datetime.now().strftime(BACKUP_FOLDER_DATE_FORMAT)
 65 |     BACKUP_DIRECTORY = BACKUP_HOME_DIRECTORY + "/" + BACKUP_FOLDER_DATE + "/"
 66 | 
 67 |     logging.info("Configurations:")
 68 |     logging.info(
 69 |         "BACKUP_FOLDER_DATE_FORMAT:    " + str(BACKUP_FOLDER_DATE_FORMAT)
 70 |     )
 71 |     logging.info("BACKUP_FOLDER_DATE:           " + str(BACKUP_FOLDER_DATE))
 72 |     logging.info("BACKUP_HOME_DIRECTORY:        " + str(BACKUP_HOME_DIRECTORY))
 73 |     logging.info("BACKUP_DIRECTORY:             " + str(BACKUP_DIRECTORY))
 74 |     logging.info(
 75 |         "BACKUP_RETENTION_COUNT:       " + str(BACKUP_RETENTION_COUNT)
 76 |     )
 77 |     logging.info("")
 78 | 
 79 |     logging.info("Pushing to XCom for Downstream Processes")
 80 |     context["ti"].xcom_push(
 81 |         key="backup_configs.backup_home_directory",
 82 |         value=BACKUP_HOME_DIRECTORY
 83 |     )
 84 |     context["ti"].xcom_push(
 85 |         key="backup_configs.backup_directory",
 86 |         value=BACKUP_DIRECTORY
 87 |     )
 88 |     context["ti"].xcom_push(
 89 |         key="backup_configs.backup_retention_count",
 90 |         value=BACKUP_RETENTION_COUNT
 91 |     )
 92 | 
 93 | 
 94 | print_configuration_op = PythonOperator(
 95 |     task_id='print_configuration',
 96 |     python_callable=print_configuration_fn,
 97 |     provide_context=True,
 98 |     dag=dag)
 99 | 
100 | 
101 | def execute_shell_cmd(cmd):
102 |     logging.info("Executing Command: `" + cmd + "`")
103 |     proc = subprocess.Popen(cmd, shell=True, universal_newlines=True)
104 |     proc.communicate()
105 |     exit_code = proc.returncode
106 |     if exit_code != 0:
107 |         exit(exit_code)
108 | 
109 | 
110 | def delete_old_backups_fn(**context):
111 |     logging.info("Executing delete_old_backups_fn")
112 | 
113 |     logging.info("Loading Configurations...")
114 |     BACKUP_HOME_DIRECTORY = context["ti"].xcom_pull(
115 |         task_ids=print_configuration_op.task_id,
116 |         key='backup_configs.backup_home_directory'
117 |     )
118 |     BACKUP_RETENTION_COUNT = context["ti"].xcom_pull(
119 |         task_ids=print_configuration_op.task_id,
120 |         key='backup_configs.backup_retention_count'
121 |     )
122 | 
123 |     logging.info("Configurations:")
124 |     logging.info("BACKUP_HOME_DIRECTORY:    " + str(BACKUP_HOME_DIRECTORY))
125 |     logging.info("BACKUP_RETENTION_COUNT:   " + str(BACKUP_RETENTION_COUNT))
126 |     logging.info("")
127 | 
128 |     if BACKUP_RETENTION_COUNT < 0:
129 |         logging.info(
130 |             "Retention is less then 0. Assuming to allow infinite backups. "
131 |             "Skipping..."
132 |         )
133 |         return
134 | 
135 |     backup_folders = [
136 |         os.path.join(BACKUP_HOME_DIRECTORY, f)
137 |         for f in os.listdir(BACKUP_HOME_DIRECTORY)
138 |         if os.path.isdir(os.path.join(BACKUP_HOME_DIRECTORY, f))
139 |     ]
140 |     backup_folders.reverse()
141 |     logging.info("backup_folders: " + str(backup_folders))
142 |     logging.info("")
143 | 
144 |     cnt = 0
145 |     for backup_folder in backup_folders:
146 |         logging.info(
147 |             "cnt = " + str(cnt) + ", backup_folder = " + str(backup_folder)
148 |         )
149 |         if cnt > BACKUP_RETENTION_COUNT:
150 |             logging.info("Deleting Backup Folder: " + str(backup_folder))
151 |             execute_shell_cmd("rm -rf " + str(backup_folder))
152 |         cnt += 1
153 | 
154 | 
155 | delete_old_backups_op = PythonOperator(
156 |     task_id='delete_old_backups',
157 |     python_callable=delete_old_backups_fn,
158 |     provide_context=True,
159 |     dag=dag)
160 | 
161 | 
162 | def general_backup_fn(**context):
163 |     logging.info("Executing general_backup_fn")
164 | 
165 |     logging.info("Loading Configurations...")
166 |     PATH_TO_BACKUP = context["params"].get("path_to_backup")
167 |     TARGET_DIRECTORY_NAME = context["params"].get("target_directory_name")
168 |     BACKUP_DIRECTORY = context["ti"].xcom_pull(
169 |         task_ids=print_configuration_op.task_id,
170 |         key='backup_configs.backup_directory'
171 |     )
172 | 
173 |     logging.info("Configurations:")
174 |     logging.info("PATH_TO_BACKUP:           " + str(PATH_TO_BACKUP))
175 |     logging.info("TARGET_DIRECTORY_NAME:    " + str(TARGET_DIRECTORY_NAME))
176 |     logging.info("BACKUP_DIRECTORY:         " + str(BACKUP_DIRECTORY))
177 |     logging.info("")
178 | 
179 |     execute_shell_cmd("mkdir -p " + str(BACKUP_DIRECTORY))
180 | 
181 |     execute_shell_cmd(
182 |         "cp -r -n " + str(PATH_TO_BACKUP) + " " + str(BACKUP_DIRECTORY) +
183 |         (TARGET_DIRECTORY_NAME if TARGET_DIRECTORY_NAME is not None else "")
184 |     )
185 | 
186 | 
187 | def pip_packages_backup_fn(**context):
188 |     logging.info("Executing pip_packages_backup_fn")
189 | 
190 |     logging.info("Loading Configurations...")
191 | 
192 |     BACKUP_DIRECTORY = context["ti"].xcom_pull(
193 |         task_ids=print_configuration_op.task_id,
194 |         key='backup_configs.backup_directory'
195 |     )
196 | 
197 |     logging.info("Configurations:")
198 |     logging.info("BACKUP_DIRECTORY:     " + str(BACKUP_DIRECTORY))
199 |     logging.info("")
200 |     if not os.path.exists(BACKUP_DIRECTORY):
201 |         os.makedirs(BACKUP_DIRECTORY)
202 |     execute_shell_cmd("pip freeze > " + BACKUP_DIRECTORY + "pip_freeze.out")
203 | 
204 | 
205 | if BACKUPS_ENABLED.get("dag_directory"):
206 |     backup_op = PythonOperator(
207 |         task_id='backup_dag_directory',
208 |         python_callable=general_backup_fn,
209 |         params={"path_to_backup": conf.get("core", "DAGS_FOLDER")},
210 |         provide_context=True,
211 |         dag=dag)
212 |     print_configuration_op.set_downstream(backup_op)
213 |     backup_op.set_downstream(delete_old_backups_op)
214 | 
215 | if BACKUPS_ENABLED.get("log_directory"):
216 |     try:
217 |         BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER")
218 |     except Exception as e:
219 |         BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER")
220 | 
221 |     backup_op = PythonOperator(
222 |         task_id='backup_log_directory',
223 |         python_callable=general_backup_fn,
224 |         params={
225 |             "path_to_backup": BASE_LOG_FOLDER,
226 |             "target_directory_name": "logs"
227 |         },
228 |         provide_context=True,
229 |         dag=dag)
230 |     print_configuration_op.set_downstream(backup_op)
231 |     backup_op.set_downstream(delete_old_backups_op)
232 | 
233 | if BACKUPS_ENABLED.get("airflow_cfg"):
234 |     backup_op = PythonOperator(
235 |         task_id='backup_airflow_cfg',
236 |         python_callable=general_backup_fn,
237 |         params={
238 |             "path_to_backup": (os.environ.get('AIRFLOW_HOME') if os.environ.get('AIRFLOW_HOME') is not None else "~/airflow/") + "/airflow.cfg"
239 |         },
240 |         provide_context=True,
241 |         dag=dag)
242 |     print_configuration_op.set_downstream(backup_op)
243 |     backup_op.set_downstream(delete_old_backups_op)
244 | 
245 | if BACKUPS_ENABLED.get("pip_packages"):
246 |     backup_op = PythonOperator(
247 |         task_id='backup_pip_packages',
248 |         python_callable=pip_packages_backup_fn,
249 |         provide_context=True,
250 |         dag=dag)
251 |     print_configuration_op.set_downstream(backup_op)
252 |     backup_op.set_downstream(delete_old_backups_op)
253 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "{}"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright {yyyy} {name of copyright owner}
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/db-cleanup/airflow-db-cleanup.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically clean
  3 | out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid
  4 | having too much data in your Airflow MetaStore.
  5 | 
  6 | airflow trigger_dag --conf '[curly-braces]"maxDBEntryAgeInDays":30[curly-braces]' airflow-db-cleanup
  7 | 
  8 | --conf options:
  9 |     maxDBEntryAgeInDays:<INT> - Optional
 10 | 
 11 | """
 12 | import airflow
 13 | from airflow import settings
 14 | from airflow.configuration import conf
 15 | from airflow.models import DAG, DagTag, DagModel, DagRun, Log, XCom, SlaMiss, TaskInstance, Variable
 16 | try:
 17 |     from airflow.jobs import BaseJob
 18 | except Exception as e:
 19 |     from airflow.jobs.base_job import BaseJob
 20 | from airflow.operators.python_operator import PythonOperator
 21 | from datetime import datetime, timedelta
 22 | import dateutil.parser
 23 | import logging
 24 | import os
 25 | from sqlalchemy import func, and_
 26 | from sqlalchemy.exc import ProgrammingError
 27 | from sqlalchemy.orm import load_only
 28 | 
 29 | try:
 30 |     # airflow.utils.timezone is available from v1.10 onwards
 31 |     from airflow.utils import timezone
 32 |     now = timezone.utcnow
 33 | except ImportError:
 34 |     now = datetime.utcnow
 35 | 
 36 | # airflow-db-cleanup
 37 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 38 | START_DATE = airflow.utils.dates.days_ago(1)
 39 | # How often to Run. @daily - Once a day at Midnight (UTC)
 40 | SCHEDULE_INTERVAL = "@daily"
 41 | # Who is listed as the owner of this DAG in the Airflow Web Server
 42 | DAG_OWNER_NAME = "operations"
 43 | # List of email address to send email alerts to if this job fails
 44 | ALERT_EMAIL_ADDRESSES = []
 45 | # Length to retain the log files if not already provided in the conf. If this
 46 | # is set to 30, the job will remove those files that arE 30 days old or older.
 47 | 
 48 | DEFAULT_MAX_DB_ENTRY_AGE_IN_DAYS = int(
 49 |     Variable.get("airflow_db_cleanup__max_db_entry_age_in_days", 30)
 50 | )
 51 | # Prints the database entries which will be getting deleted; set to False to avoid printing large lists and slowdown process
 52 | PRINT_DELETES = True
 53 | # Whether the job should delete the db entries or not. Included if you want to
 54 | # temporarily avoid deleting the db entries.
 55 | ENABLE_DELETE = True
 56 | 
 57 | # get dag model last schedule run
 58 | try:
 59 |     dag_model_last_scheduler_run = DagModel.last_scheduler_run
 60 | except AttributeError:
 61 |     dag_model_last_scheduler_run = DagModel.last_parsed_time
 62 | 
 63 | # List of all the objects that will be deleted. Comment out the DB objects you
 64 | # want to skip.
 65 | DATABASE_OBJECTS = [
 66 |     {
 67 |         "airflow_db_model": BaseJob,
 68 |         "age_check_column": BaseJob.latest_heartbeat,
 69 |         "keep_last": False,
 70 |         "keep_last_filters": None,
 71 |         "keep_last_group_by": None
 72 |     },
 73 |     {
 74 |         "airflow_db_model": DagRun,
 75 |         "age_check_column": DagRun.execution_date,
 76 |         "keep_last": True,
 77 |         "keep_last_filters": [DagRun.external_trigger.is_(False)],
 78 |         "keep_last_group_by": DagRun.dag_id
 79 |     },
 80 |     {
 81 |         "airflow_db_model": TaskInstance,
 82 |         "age_check_column": TaskInstance.execution_date,
 83 |         "keep_last": False,
 84 |         "keep_last_filters": None,
 85 |         "keep_last_group_by": None
 86 |     },
 87 |     {
 88 |         "airflow_db_model": Log,
 89 |         "age_check_column": Log.dttm,
 90 |         "keep_last": False,
 91 |         "keep_last_filters": None,
 92 |         "keep_last_group_by": None
 93 |     },
 94 |     {
 95 |         "airflow_db_model": XCom,
 96 |         "age_check_column": XCom.execution_date,
 97 |         "keep_last": False,
 98 |         "keep_last_filters": None,
 99 |         "keep_last_group_by": None
100 |     },
101 |     {
102 |         "airflow_db_model": SlaMiss,
103 |         "age_check_column": SlaMiss.execution_date,
104 |         "keep_last": False,
105 |         "keep_last_filters": None,
106 |         "keep_last_group_by": None
107 |     },
108 |     {
109 |         "airflow_db_model": DagModel,
110 |         "age_check_column": dag_model_last_scheduler_run,
111 |         "keep_last": False,
112 |         "keep_last_filters": None,
113 |         "keep_last_group_by": None
114 |     }]
115 | 
116 | # Check for TaskReschedule model
117 | try:
118 |     from airflow.models import TaskReschedule
119 |     DATABASE_OBJECTS.append({
120 |         "airflow_db_model": TaskReschedule,
121 |         "age_check_column": TaskReschedule.execution_date,
122 |         "keep_last": False,
123 |         "keep_last_filters": None,
124 |         "keep_last_group_by": None
125 |     })
126 | 
127 | except Exception as e:
128 |     logging.error(e)
129 | 
130 | # Check for TaskFail model
131 | try:
132 |     from airflow.models import TaskFail
133 |     DATABASE_OBJECTS.append({
134 |         "airflow_db_model": TaskFail,
135 |         "age_check_column": TaskFail.execution_date,
136 |         "keep_last": False,
137 |         "keep_last_filters": None,
138 |         "keep_last_group_by": None
139 |     })
140 | 
141 | except Exception as e:
142 |     logging.error(e)
143 | 
144 | # Check for RenderedTaskInstanceFields model
145 | try:
146 |     from airflow.models import RenderedTaskInstanceFields
147 |     DATABASE_OBJECTS.append({
148 |         "airflow_db_model": RenderedTaskInstanceFields,
149 |         "age_check_column": RenderedTaskInstanceFields.execution_date,
150 |         "keep_last": False,
151 |         "keep_last_filters": None,
152 |         "keep_last_group_by": None
153 |     })
154 | 
155 | except Exception as e:
156 |     logging.error(e)
157 | 
158 | # Check for ImportError model
159 | try:
160 |     from airflow.models import ImportError
161 |     DATABASE_OBJECTS.append({
162 |         "airflow_db_model": ImportError,
163 |         "age_check_column": ImportError.timestamp,
164 |         "keep_last": False,
165 |         "keep_last_filters": None,
166 |         "keep_last_group_by": None
167 |     })
168 | 
169 | except Exception as e:
170 |     logging.error(e)
171 | 
172 | # Check for celery executor
173 | airflow_executor = str(conf.get("core", "executor"))
174 | logging.info("Airflow Executor: " + str(airflow_executor))
175 | if(airflow_executor == "CeleryExecutor"):
176 |     logging.info("Including Celery Modules")
177 |     try:
178 |         from celery.backends.database.models import Task, TaskSet
179 |         DATABASE_OBJECTS.extend((
180 |             {
181 |                 "airflow_db_model": Task,
182 |                 "age_check_column": Task.date_done,
183 |                 "keep_last": False,
184 |                 "keep_last_filters": None,
185 |                 "keep_last_group_by": None
186 |             },
187 |             {
188 |                 "airflow_db_model": TaskSet,
189 |                 "age_check_column": TaskSet.date_done,
190 |                 "keep_last": False,
191 |                 "keep_last_filters": None,
192 |                 "keep_last_group_by": None
193 |             }))
194 | 
195 |     except Exception as e:
196 |         logging.error(e)
197 | 
198 | session = settings.Session()
199 | 
200 | default_args = {
201 |     'owner': DAG_OWNER_NAME,
202 |     'depends_on_past': False,
203 |     'email': ALERT_EMAIL_ADDRESSES,
204 |     'email_on_failure': True,
205 |     'email_on_retry': False,
206 |     'start_date': START_DATE,
207 |     'retries': 1,
208 |     'retry_delay': timedelta(minutes=1)
209 | }
210 | 
211 | dag = DAG(
212 |     DAG_ID,
213 |     default_args=default_args,
214 |     schedule_interval=SCHEDULE_INTERVAL,
215 |     start_date=START_DATE,
216 |     tags=['teamclairvoyant', 'airflow-maintenance-dags']
217 | )
218 | if hasattr(dag, 'doc_md'):
219 |     dag.doc_md = __doc__
220 | if hasattr(dag, 'catchup'):
221 |     dag.catchup = False
222 | 
223 | 
224 | def print_configuration_function(**context):
225 |     logging.info("Loading Configurations...")
226 |     dag_run_conf = context.get("dag_run").conf
227 |     logging.info("dag_run.conf: " + str(dag_run_conf))
228 |     max_db_entry_age_in_days = None
229 |     if dag_run_conf:
230 |         max_db_entry_age_in_days = dag_run_conf.get(
231 |             "maxDBEntryAgeInDays", None
232 |         )
233 |     logging.info("maxDBEntryAgeInDays from dag_run.conf: " + str(dag_run_conf))
234 |     if (max_db_entry_age_in_days is None or max_db_entry_age_in_days < 1):
235 |         logging.info(
236 |             "maxDBEntryAgeInDays conf variable isn't included or Variable " +
237 |             "value is less than 1. Using Default '" +
238 |             str(DEFAULT_MAX_DB_ENTRY_AGE_IN_DAYS) + "'"
239 |         )
240 |         max_db_entry_age_in_days = DEFAULT_MAX_DB_ENTRY_AGE_IN_DAYS
241 |     max_date = now() + timedelta(-max_db_entry_age_in_days)
242 |     logging.info("Finished Loading Configurations")
243 |     logging.info("")
244 | 
245 |     logging.info("Configurations:")
246 |     logging.info("max_db_entry_age_in_days: " + str(max_db_entry_age_in_days))
247 |     logging.info("max_date:                 " + str(max_date))
248 |     logging.info("enable_delete:            " + str(ENABLE_DELETE))
249 |     logging.info("session:                  " + str(session))
250 |     logging.info("")
251 | 
252 |     logging.info("Setting max_execution_date to XCom for Downstream Processes")
253 |     context["ti"].xcom_push(key="max_date", value=max_date.isoformat())
254 | 
255 | 
256 | print_configuration = PythonOperator(
257 |     task_id='print_configuration',
258 |     python_callable=print_configuration_function,
259 |     provide_context=True,
260 |     dag=dag)
261 | 
262 | 
263 | def cleanup_function(**context):
264 | 
265 |     logging.info("Retrieving max_execution_date from XCom")
266 |     max_date = context["ti"].xcom_pull(
267 |         task_ids=print_configuration.task_id, key="max_date"
268 |     )
269 |     max_date = dateutil.parser.parse(max_date)  # stored as iso8601 str in xcom
270 | 
271 |     airflow_db_model = context["params"].get("airflow_db_model")
272 |     state = context["params"].get("state")
273 |     age_check_column = context["params"].get("age_check_column")
274 |     keep_last = context["params"].get("keep_last")
275 |     keep_last_filters = context["params"].get("keep_last_filters")
276 |     keep_last_group_by = context["params"].get("keep_last_group_by")
277 | 
278 |     logging.info("Configurations:")
279 |     logging.info("max_date:                 " + str(max_date))
280 |     logging.info("enable_delete:            " + str(ENABLE_DELETE))
281 |     logging.info("session:                  " + str(session))
282 |     logging.info("airflow_db_model:         " + str(airflow_db_model))
283 |     logging.info("state:                    " + str(state))
284 |     logging.info("age_check_column:         " + str(age_check_column))
285 |     logging.info("keep_last:                " + str(keep_last))
286 |     logging.info("keep_last_filters:        " + str(keep_last_filters))
287 |     logging.info("keep_last_group_by:       " + str(keep_last_group_by))
288 | 
289 |     logging.info("")
290 | 
291 |     logging.info("Running Cleanup Process...")
292 | 
293 |     try:
294 |         query = session.query(airflow_db_model).options(
295 |             load_only(age_check_column)
296 |         )
297 | 
298 |         logging.info("INITIAL QUERY : " + str(query))
299 | 
300 |         if keep_last:
301 | 
302 |             subquery = session.query(func.max(DagRun.execution_date))
303 |             # workaround for MySQL "table specified twice" issue
304 |             # https://github.com/teamclairvoyant/airflow-maintenance-dags/issues/41
305 |             if keep_last_filters is not None:
306 |                 for entry in keep_last_filters:
307 |                     subquery = subquery.filter(entry)
308 | 
309 |                 logging.info("SUB QUERY [keep_last_filters]: " + str(subquery))
310 | 
311 |             if keep_last_group_by is not None:
312 |                 subquery = subquery.group_by(keep_last_group_by)
313 |                 logging.info(
314 |                     "SUB QUERY [keep_last_group_by]: " + str(subquery))
315 | 
316 |             subquery = subquery.from_self()
317 | 
318 |             query = query.filter(
319 |                 and_(age_check_column.notin_(subquery)),
320 |                 and_(age_check_column <= max_date)
321 |             )
322 | 
323 |         else:
324 |             query = query.filter(age_check_column <= max_date,)
325 | 
326 |         if PRINT_DELETES:
327 |             entries_to_delete = query.all()
328 | 
329 |             logging.info("Query: " + str(query))
330 |             logging.info(
331 |                 "Process will be Deleting the following " +
332 |                 str(airflow_db_model.__name__) + "(s):"
333 |             )
334 |             for entry in entries_to_delete:
335 |                 logging.info(
336 |                     "\tEntry: " + str(entry) + ", Date: " +
337 |                     str(entry.__dict__[str(age_check_column).split(".")[1]])
338 |                 )
339 | 
340 |             logging.info(
341 |                 "Process will be Deleting " + str(len(entries_to_delete)) + " " +
342 |                 str(airflow_db_model.__name__) + "(s)"
343 |             )
344 |         else:
345 |             logging.warn(
346 |                 "You've opted to skip printing the db entries to be deleted. Set PRINT_DELETES to True to show entries!!!")
347 | 
348 |         if ENABLE_DELETE:
349 |             logging.info('Performing Delete...')
350 |             if airflow_db_model.__name__ == 'DagModel':
351 |                 logging.info('Deleting tags...')
352 |                 ids_query = query.from_self().with_entities(DagModel.dag_id)
353 |                 tags_query = session.query(DagTag).filter(DagTag.dag_id.in_(ids_query))
354 |                 logging.info('Tags delete Query: ' + str(tags_query))
355 |                 tags_query.delete(synchronize_session=False)
356 |             # using bulk delete
357 |             query.delete(synchronize_session=False)
358 |             session.commit()
359 |             logging.info('Finished Performing Delete')
360 |         else:
361 |             logging.warn(
362 |                 "You've opted to skip deleting the db entries. Set ENABLE_DELETE to True to delete entries!!!")
363 | 
364 |         logging.info("Finished Running Cleanup Process")
365 | 
366 |     except ProgrammingError as e:
367 |         logging.error(e)
368 |         logging.error(str(airflow_db_model) +
369 |                       " is not present in the metadata. Skipping...")
370 | 
371 | 
372 | for db_object in DATABASE_OBJECTS:
373 | 
374 |     cleanup_op = PythonOperator(
375 |         task_id='cleanup_' + str(db_object["airflow_db_model"].__name__),
376 |         python_callable=cleanup_function,
377 |         params=db_object,
378 |         provide_context=True,
379 |         dag=dag
380 |     )
381 | 
382 |     print_configuration.set_downstream(cleanup_op)
383 | 


--------------------------------------------------------------------------------
/kill-halted-tasks/airflow-kill-halted-tasks.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A maintenance workflow that you can deploy into Airflow to periodically kill
  3 | off tasks that are running in the background that don't correspond to a running
  4 | task in the DB.
  5 | 
  6 | This is useful because when you kill off a DAG Run or Task through the Airflow
  7 | Web Server, the task still runs in the background on one of the executors until
  8 | the task is complete.
  9 | 
 10 | airflow trigger_dag airflow-kill-halted-tasks
 11 | 
 12 | """
 13 | from airflow.models import DAG, DagModel, DagRun, TaskInstance
 14 | from airflow import settings
 15 | from airflow.operators.python_operator \
 16 |     import PythonOperator, ShortCircuitOperator
 17 | from airflow.operators.email_operator import EmailOperator
 18 | from sqlalchemy import and_
 19 | from datetime import datetime, timedelta
 20 | import os
 21 | import re
 22 | import logging
 23 | import pytz
 24 | import airflow
 25 | 
 26 | 
 27 | # airflow-kill-halted-tasks
 28 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 29 | START_DATE = airflow.utils.dates.days_ago(1)
 30 | # How often to Run. @daily - Once a day at Midnight. @hourly - Once an Hour.
 31 | SCHEDULE_INTERVAL = "@hourly"
 32 | # Who is listed as the owner of this DAG in the Airflow Web Server
 33 | DAG_OWNER_NAME = "operations"
 34 | # List of email address to send email alerts to if this job fails
 35 | ALERT_EMAIL_ADDRESSES = []
 36 | # Whether to send out an email whenever a process was killed during a DAG Run
 37 | # or not
 38 | SEND_PROCESS_KILLED_EMAIL = True
 39 | # Subject of the email that is sent out when a task is killed by the DAG
 40 | PROCESS_KILLED_EMAIL_SUBJECT = DAG_ID + " - Tasks were Killed"
 41 | # List of email address to send emails to when a task is killed by the DAG
 42 | PROCESS_KILLED_EMAIL_ADDRESSES = []
 43 | # Whether the job should delete the db entries or not. Included if you want to
 44 | # temporarily avoid deleting the db entries.
 45 | ENABLE_KILL = True
 46 | # Whether to print out certain statements meant for debugging
 47 | DEBUG = False
 48 | 
 49 | default_args = {
 50 |     'owner': DAG_OWNER_NAME,
 51 |     'email': ALERT_EMAIL_ADDRESSES,
 52 |     'email_on_failure': True,
 53 |     'email_on_retry': False,
 54 |     'start_date': START_DATE,
 55 |     'retries': 1,
 56 |     'retry_delay': timedelta(minutes=1)
 57 | }
 58 | 
 59 | dag = DAG(
 60 |     DAG_ID,
 61 |     default_args=default_args,
 62 |     schedule_interval=SCHEDULE_INTERVAL,
 63 |     start_date=START_DATE,
 64 |     tags=['teamclairvoyant', 'airflow-maintenance-dags']
 65 | )
 66 | if hasattr(dag, 'doc_md'):
 67 |     dag.doc_md = __doc__
 68 | if hasattr(dag, 'catchup'):
 69 |     dag.catchup = False
 70 | 
 71 | 
 72 | uid_regex = "(\w+)"
 73 | pid_regex = "(\w+)"
 74 | ppid_regex = "(\w+)"
 75 | processor_scheduling_regex = "(\w+)"
 76 | start_time_regex = "([\w:.]+)"
 77 | tty_regex = "([\w?/]+)"
 78 | cpu_time_regex = "([\w:.]+)"
 79 | command_regex = "(.+)"
 80 | 
 81 | # When Search Command is:  ps -o pid -o cmd -u `whoami` | grep 'airflow run'
 82 | full_regex = '\s*' + pid_regex + '\s+' + command_regex
 83 | 
 84 | airflow_run_regex = '.*run\s([\w._-]*)\s([\w._-]*)\s([\w:.-]*).*'
 85 | 
 86 | 
 87 | def parse_process_linux_string(line):
 88 |     if DEBUG:
 89 |         logging.info("DEBUG: Processing Line: " + str(line))
 90 |     full_regex_match = re.search(full_regex, line)
 91 |     if DEBUG:
 92 |         for index in range(0, (len(full_regex_match.groups()) + 1)):
 93 |             group = full_regex_match.group(index)
 94 |             logging.info(
 95 |                 "DEBUG: index: " + str(index) + ", group: " + str(group)
 96 |             )
 97 |     pid = full_regex_match.group(1)
 98 |     command = full_regex_match.group(2).strip()
 99 |     process = {"pid": pid, "command": command}
100 | 
101 |     if DEBUG:
102 |         logging.info("DEBUG: Processing Command: " + str(command))
103 |     airflow_run_regex_match = re.search(airflow_run_regex, command)
104 |     if DEBUG:
105 |         for index in range(0, (len(airflow_run_regex_match.groups()) + 1)):
106 |             group = airflow_run_regex_match.group(index)
107 |             logging.info(
108 |                 "DEBUG: index: " + str(index) + ", group: " + str(group)
109 |             )
110 |     process["airflow_dag_id"] = airflow_run_regex_match.group(1)
111 |     process["airflow_task_id"] = airflow_run_regex_match.group(2)
112 |     process["airflow_execution_date"] = airflow_run_regex_match.group(3)
113 |     return process
114 | 
115 | 
116 | def kill_halted_tasks_function(**context):
117 |     logging.info("Getting Configurations...")
118 |     airflow_version = airflow.__version__
119 |     session = settings.Session()
120 | 
121 |     logging.info("Finished Getting Configurations\n")
122 | 
123 |     logging.info("Configurations:")
124 |     logging.info(
125 |         "send_process_killed_email:      " + str(SEND_PROCESS_KILLED_EMAIL)
126 |     )
127 |     logging.info(
128 |         "process_killed_email_subject:   " + str(PROCESS_KILLED_EMAIL_SUBJECT)
129 |     )
130 |     logging.info(
131 |         "process_killed_email_addresses: " +
132 |         str(PROCESS_KILLED_EMAIL_ADDRESSES)
133 |     )
134 |     logging.info("enable_kill:                    " + str(ENABLE_KILL))
135 |     logging.info("debug:                          " + str(DEBUG))
136 |     logging.info("session:                        " + str(session))
137 |     logging.info("airflow_version:                " + str(airflow_version))
138 |     logging.info("")
139 | 
140 |     logging.info("Running Cleanup Process...")
141 |     logging.info("")
142 | 
143 |     process_search_command = (
144 |         "ps -o pid -o cmd -u `whoami` | grep 'airflow run'"
145 |     )
146 |     logging.info("Running Search Process: " + process_search_command)
147 |     search_output = os.popen(process_search_command).read()
148 |     logging.info("Search Process Output: ")
149 |     logging.info(search_output)
150 | 
151 |     logging.info(
152 |         "Filtering out: Empty Lines, Grep processes, and this DAGs Run."
153 |     )
154 |     search_output_filtered = [
155 |         line for line in search_output.split("\n") if line is not None
156 |         and line.strip() != "" and ' grep ' not in line
157 |         and DAG_ID not in line
158 |     ]
159 |     logging.info("Search Process Output (with Filter): ")
160 |     for line in search_output_filtered:
161 |         logging.info(line)
162 |     logging.info("")
163 | 
164 |     logging.info("Searching through running processes...")
165 |     airflow_timezone_not_required_versions = ['1.7', '1.8', '1.9']
166 |     processes_to_kill = []
167 |     for line in search_output_filtered:
168 |         logging.info("")
169 |         process = parse_process_linux_string(line=line)
170 | 
171 |         logging.info("Checking: " + str(process))
172 |         exec_date_str = (process["airflow_execution_date"]).replace("T", " ")
173 |         if '.' not in exec_date_str:
174 |             # Add milliseconds if they are missing.
175 |             exec_date_str = exec_date_str + '.0'
176 |         execution_date_to_search_for = datetime.strptime(
177 |             exec_date_str, '%Y-%m-%d %H:%M:%S.%f'
178 |         )
179 |         # apache-airflow version >= 1.10 requires datetime field values with
180 |         # timezone
181 |         if airflow_version[:3] not in airflow_timezone_not_required_versions:
182 |             execution_date_to_search_for = pytz.utc.localize(
183 |                 execution_date_to_search_for
184 |             )
185 | 
186 |         logging.info(
187 |             "Execution Date to Search For: " +
188 |             str(execution_date_to_search_for)
189 |         )
190 | 
191 |         # Checking to make sure the DAG is available and active
192 |         if DEBUG:
193 |             logging.info("DEBUG: Listing All DagModels: ")
194 |             for dag in session.query(DagModel).all():
195 |                 logging.info(
196 |                     "DEBUG: dag: " + str(dag) + ", dag.is_active: " +
197 |                     str(dag.is_active)
198 |                 )
199 |             logging.info("")
200 |         logging.info(
201 |             "Getting dag where DagModel.dag_id == '" +
202 |             str(process["airflow_dag_id"]) + "'"
203 |         )
204 |         dag = session.query(DagModel).filter(
205 |             DagModel.dag_id == process["airflow_dag_id"]
206 |         ).first()
207 |         logging.info("dag: " + str(dag))
208 |         if dag is None:
209 |             kill_reason = "DAG was not found in metastore."
210 |             process["kill_reason"] = kill_reason
211 |             processes_to_kill.append(process)
212 |             logging.warn(kill_reason)
213 |             logging.warn("Marking process to be killed.")
214 |             continue
215 |         logging.info("dag.is_active: " + str(dag.is_active))
216 |         if not dag.is_active:  # is the dag active?
217 |             kill_reason = "DAG was found to be Disabled."
218 |             process["kill_reason"] = kill_reason
219 |             processes_to_kill.append(process)
220 |             logging.warn(kill_reason)
221 |             logging.warn("Marking process to be killed.")
222 |             continue
223 | 
224 |         # Checking to make sure the DagRun is available and in a running state
225 |         if DEBUG:
226 |             dag_run_relevant_states = ["queued", "running", "up_for_retry"]
227 |             logging.info(
228 |                 "DEBUG: Listing All Relevant DAG Runs (With State: " +
229 |                 str(dag_run_relevant_states) + "): "
230 |             )
231 |             for dag_run in session.query(DagRun).filter(
232 |                 DagRun.state.in_(dag_run_relevant_states)
233 |             ).all():
234 |                 logging.info(
235 |                     "DEBUG: dag_run: " + str(dag_run) + ", dag_run.state: " +
236 |                     str(dag_run.state)
237 |                 )
238 |             logging.info("")
239 |         logging.info(
240 |             "Getting dag_run where DagRun.dag_id == '" +
241 |             str(process["airflow_dag_id"]) +
242 |             "' AND DagRun.execution_date == '" +
243 |             str(execution_date_to_search_for) + "'"
244 |         )
245 | 
246 |         dag_run = session.query(DagRun).filter(
247 |             and_(
248 |                 DagRun.dag_id == process["airflow_dag_id"],
249 |                 DagRun.execution_date == execution_date_to_search_for,
250 |             )
251 |         ).first()
252 | 
253 |         logging.info("dag_run: " + str(dag_run))
254 |         if dag_run is None:
255 |             kill_reason = "DAG RUN was not found in metastore."
256 |             process["kill_reason"] = kill_reason
257 |             processes_to_kill.append(process)
258 |             logging.warn(kill_reason)
259 |             logging.warn("Marking process to be killed.")
260 |             continue
261 |         logging.info("dag_run.state: " + str(dag_run.state))
262 |         dag_run_states_required = ["running"]
263 |         # is the dag_run in a running state?
264 |         if dag_run.state not in dag_run_states_required:
265 |             kill_reason = (
266 |                 "DAG RUN was found to not be in the states '" +
267 |                 str(dag_run_states_required) +
268 |                 "', but rather was in the state '" + str(dag_run.state) + "'."
269 |             )
270 |             process["kill_reason"] = kill_reason
271 |             processes_to_kill.append(process)
272 |             logging.warn(kill_reason)
273 |             logging.warn("Marking process to be killed.")
274 |             continue
275 | 
276 |         # Checking to ensure TaskInstance is available and in a running state
277 |         if DEBUG:
278 |             task_instance_relevant_states = [
279 |                 "queued", "running", "up_for_retry"
280 |             ]
281 |             logging.info(
282 |                 "DEBUG: Listing All Relevant TaskInstances (With State: " +
283 |                 str(task_instance_relevant_states) + "): "
284 |             )
285 |             for task_instance in session.query(TaskInstance).filter(
286 |                 TaskInstance.state.in_(task_instance_relevant_states)
287 |             ).all():
288 |                 logging.info(
289 |                     "DEBUG: task_instance: " + str(task_instance) +
290 |                     ", task_instance.state: " + str(task_instance.state)
291 |                 )
292 |             logging.info("")
293 |         logging.info(
294 |             "Getting task_instance where TaskInstance.dag_id == '" +
295 |             str(process["airflow_dag_id"]) +
296 |             "' AND TaskInstance.task_id == '" +
297 |             str(process["airflow_task_id"]) +
298 |             "' AND TaskInstance.execution_date == '" +
299 |             str(execution_date_to_search_for) + "'"
300 |         )
301 | 
302 |         task_instance = session.query(TaskInstance).filter(
303 |             and_(
304 |                 TaskInstance.dag_id == process["airflow_dag_id"],
305 |                 TaskInstance.task_id == process["airflow_task_id"],
306 |                 TaskInstance.execution_date == execution_date_to_search_for,
307 |             )
308 |         ).first()
309 | 
310 |         logging.info("task_instance: " + str(task_instance))
311 |         if task_instance is None:
312 |             kill_reason = (
313 |                 "Task Instance was not found in metastore. Marking process "
314 |                 "to be killed."
315 |             )
316 |             process["kill_reason"] = kill_reason
317 |             processes_to_kill.append(process)
318 |             logging.warn(kill_reason)
319 |             logging.warn("Marking process to be killed.")
320 |             continue
321 |         logging.info("task_instance.state: " + str(task_instance.state))
322 |         task_instance_states_required = ["queued", "running", "up_for_retry"]
323 |         # is task_instance queued, running or up for retry?
324 |         if task_instance.state not in task_instance_states_required:
325 |             kill_reason = (
326 |                 "The TaskInstance was found to not be in the states '" +
327 |                 str(task_instance_states_required) +
328 |                 "', but rather was in the state '" +
329 |                 str(task_instance.state) + "'."
330 |             )
331 |             process["kill_reason"] = kill_reason
332 |             processes_to_kill.append(process)
333 |             logging.warn(kill_reason)
334 |             logging.warn("Marking process to be killed.")
335 |             continue
336 | 
337 |     # Listing processes that will be killed
338 |     logging.info("")
339 |     logging.info("Processes Marked to Kill: ")
340 |     if len(processes_to_kill) > 0:
341 |         for process in processes_to_kill:
342 |             logging.info(str(process))
343 |     else:
344 |         logging.info("No Processes Marked to Kill Found")
345 | 
346 |     # Killing the processes
347 |     logging.info("")
348 |     if ENABLE_KILL:
349 |         logging.info("Performing Kill...")
350 |         if len(processes_to_kill) > 0:
351 |             for process in processes_to_kill:
352 |                 logging.info("Killing Process: " + str(process))
353 |                 kill_command = "kill -9 " + str(process["pid"])
354 |                 logging.info("Running Command: " + str(kill_command))
355 |                 output = os.popen(kill_command).read()
356 |                 logging.info("kill output: " + str(output))
357 |             context['ti'].xcom_push(
358 |                 key='kill_halted_tasks.processes_to_kill',
359 |                 value=processes_to_kill
360 |             )
361 |             logging.info("Finished Performing Kill")
362 |         else:
363 |             logging.info("No Processes Marked to Kill Found")
364 |     else:
365 |         logging.warn("You're opted to skip killing the processes!!!")
366 | 
367 |     logging.info("")
368 |     logging.info("Finished Running Cleanup Process")
369 | 
370 | 
371 | kill_halted_tasks_op = PythonOperator(
372 |     task_id='kill_halted_tasks',
373 |     python_callable=kill_halted_tasks_function,
374 |     provide_context=True,
375 |     dag=dag)
376 | 
377 | 
378 | def branch_function(**context):
379 |     logging.info(
380 |         "Deciding whether to send an email about tasks that were killed by " +
381 |         "this DAG..."
382 |     )
383 |     logging.info(
384 |         "SEND_PROCESS_KILLED_EMAIL: '" +
385 |         str(SEND_PROCESS_KILLED_EMAIL) + "'"
386 |     )
387 |     logging.info(
388 |         "PROCESS_KILLED_EMAIL_ADDRESSES: " +
389 |         str(PROCESS_KILLED_EMAIL_ADDRESSES)
390 |     )
391 |     logging.info("ENABLE_KILL: " + str(ENABLE_KILL))
392 | 
393 |     if not SEND_PROCESS_KILLED_EMAIL:
394 |         logging.info(
395 |             "Skipping sending an email since SEND_PROCESS_KILLED_EMAIL is " +
396 |             "set to false"
397 |         )
398 |         # False = short circuit the dag and don't execute downstream tasks
399 |         return False
400 |     if len(PROCESS_KILLED_EMAIL_ADDRESSES) == 0:
401 |         logging.info(
402 |             "Skipping sending an email since PROCESS_KILLED_EMAIL_ADDRESSES " +
403 |             "is empty"
404 |         )
405 |         # False = short circuit the dag and don't execute downstream tasks
406 |         return False
407 | 
408 |     processes_to_kill = context['ti'].xcom_pull(
409 |         task_ids=kill_halted_tasks_op.task_id,
410 |         key='kill_halted_tasks.processes_to_kill'
411 |     )
412 |     logging.info("processes_to_kill from xcom_pull: " + str(processes_to_kill))
413 |     if processes_to_kill is not None and len(processes_to_kill) > 0:
414 |         logging.info("There were processes to kill")
415 |         if ENABLE_KILL:
416 |             logging.info("enable_kill is set to true")
417 |             logging.info(
418 |                 "Opting to send an email to alert the users that processes " +
419 |                 "were killed"
420 |             )
421 |             # True = don't short circuit the dag and execute downstream tasks
422 |             return True
423 |         else:
424 |             logging.info("enable_kill is set to False")
425 |     else:
426 |         logging.info("Processes to kill list was either None or Empty")
427 | 
428 |     logging.info(
429 |         "Opting to skip sending an email since no processes were killed"
430 |     )
431 |     # False = short circuit the dag and don't execute downstream tasks
432 |     return False
433 | 
434 | 
435 | email_or_not_branch = ShortCircuitOperator(
436 |     task_id="email_or_not_branch",
437 |     python_callable=branch_function,
438 |     provide_context=True,
439 |     dag=dag)
440 | 
441 | 
442 | send_processes_killed_email = EmailOperator(
443 |     task_id="send_processes_killed_email",
444 |     to=PROCESS_KILLED_EMAIL_ADDRESSES,
445 |     subject=PROCESS_KILLED_EMAIL_SUBJECT,
446 |     html_content="""
447 | <html>
448 |     <body>
449 |         <h6>This is not a failure alert!</h6>
450 |         <h2>Dag Run Information</h2>
451 |         <table>
452 |             <tr><td><b> ID: </b></td><td>{{ dag_run.id }}</td></tr>
453 |             <tr><td><b> DAG ID: </b></td><td>{{ dag_run.dag_id }}</td></tr>
454 |             <tr><td><b> Execution Date: </b></td><td>
455 |                 {{ dag_run.execution_date }}
456 |             </td></tr>
457 |             <tr><td><b> Start Date: </b></td><td>
458 |                 {{ dag_run.start_date }}
459 |             </td></tr>
460 |             <tr><td><b> End Date: </b></td><td>{{ dag_run.end_date }}</td></tr>
461 |             <tr><td><b> Run ID: </b></td><td>{{ dag_run.run_id }}</td></tr>
462 |             <tr><td><b> External Trigger: </b></td><td>
463 |                 {{ dag_run.external_trigger }}
464 |             </td></tr>
465 |         </table>
466 |         <h2>Task Instance Information</h2>
467 |         <table>
468 |             <tr><td><b> Task ID: </b></td><td>
469 |                 {{ task_instance.task_id }}
470 |             </td></tr>
471 |             <tr><td><b> Execution Date: </b></td><td>
472 |                 {{ task_instance.execution_date }}
473 |             </td></tr>
474 |             <tr><td><b> Start Date: </b></td><td>
475 |                 {{ task_instance.start_date }}
476 |             </td></tr>
477 |             <tr><td><b> End Date: </b></td><td>
478 |                 {{ task_instance.end_date }}
479 |             </td></tr>
480 |             <tr><td><b> Host Name: </b></td><td>
481 |                 {{ task_instance.hostname }}
482 |             </td></tr>
483 |             <tr><td><b> Unix Name: </b></td><td>
484 |                 {{ task_instance.unixname }}
485 |             </td></tr>
486 |             <tr><td><b> Job ID: </b></td><td>
487 |                 {{ task_instance.job_id }}
488 |             </td></tr>
489 |             <tr><td><b> Queued Date Time: </b></td><td>
490 |                 {{ task_instance.queued_dttm }}
491 |             </td></tr>
492 |             <tr><td><b> Log URL: </b></td><td>
493 |                 <a href="{{ task_instance.log_url }}">
494 |                     {{ task_instance.log_url }}
495 |                 </a>
496 |             </td></tr>
497 |         </table>
498 |         <h2>Processes Killed</h2>
499 |         <ul>
500 |         {% for process_killed in task_instance.xcom_pull(
501 |             task_ids='kill_halted_tasks',
502 |             key='kill_halted_tasks.processes_to_kill'
503 |         ) %}
504 |             <li>Process {{loop.index}}</li>
505 |             <ul>
506 |             {% for key, value in process_killed.iteritems() %}
507 |                 <li>{{ key }}: {{ value }}</li>
508 |             {% endfor %}
509 |             </ul>
510 |         {% endfor %}
511 |         </ul>
512 |     </body>
513 | </html>
514 |     """,
515 |     dag=dag)
516 | 
517 | 
518 | kill_halted_tasks_op.set_downstream(email_or_not_branch)
519 | email_or_not_branch.set_downstream(send_processes_killed_email)
520 | 


--------------------------------------------------------------------------------
/sla-miss-report/airflow-sla-miss-report.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import json
  4 | import os
  5 | 
  6 | import airflow
  7 | from airflow import settings
  8 | from airflow.models import DAG, DagRun, TaskInstance
  9 | from airflow.models.serialized_dag import SerializedDagModel
 10 | from airflow.operators.python import PythonOperator
 11 | from airflow.utils.email import send_email
 12 | from datetime import date, datetime, timedelta
 13 | 
 14 | ################################
 15 | # CONFIGURATIONS
 16 | ################################
 17 | 
 18 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 19 | START_DATE = airflow.utils.dates.days_ago(1)
 20 | # How often to Run. @daily - Once a day at Midnight
 21 | SCHEDULE_INTERVAL = "@daily"
 22 | # Who is listed as the owner of this DAG in the Airflow Web Server
 23 | DAG_OWNER = "operations"
 24 | # List of email address to send the SLA report & the email subject
 25 | EMAIL_ADDRESSES = []
 26 | EMAIL_SUBJECT = f'Airflow SLA Report - {date.today().strftime("%b %d, %Y")}'
 27 | # Timeframes to calculate the metrics on in days
 28 | SHORT_TIMEFRAME_IN_DAYS = 1
 29 | MEDIUM_TIMEFRAME_IN_DAYS = 3
 30 | LONG_TIMEFRAME_IN_DAYS = 7
 31 | 
 32 | ################################
 33 | # END CONFIGURATIONS
 34 | ################################
 35 | 
 36 | # Setting up a variable to calculate today's date.
 37 | dt = date.today()
 38 | today = datetime.combine(dt, datetime.min.time())
 39 | 
 40 | # Calculating duration intervals between the defined timeframes and today
 41 | short_timeframe_start_date = today - timedelta(days=SHORT_TIMEFRAME_IN_DAYS)
 42 | medium_timeframe_start_date = today - timedelta(days=MEDIUM_TIMEFRAME_IN_DAYS)
 43 | long_timeframe_start_date = today - timedelta(days=LONG_TIMEFRAME_IN_DAYS)
 44 | 
 45 | pd.options.display.max_columns = None
 46 | 
 47 | 
 48 | def retrieve_metadata():
 49 |     """Retrieve data from taskinstance, dagrun and serialized dag tables to do some processing to create base tables.
 50 | 
 51 |     Returns:
 52 |         dataframe: Base tables sla_run_detail and serialized_dags_slas for further processing.
 53 |     """
 54 |     try:
 55 |         pd.set_option("display.max_colwidth", None)
 56 | 
 57 |         session = settings.Session()
 58 |         taskinstance = session.query(
 59 |             TaskInstance.task_id,
 60 |             TaskInstance.dag_id,
 61 |             TaskInstance.run_id,
 62 |             TaskInstance.state,
 63 |             TaskInstance.start_date,
 64 |             TaskInstance.end_date,
 65 |             TaskInstance.duration,
 66 |             TaskInstance.operator,
 67 |             TaskInstance.queued_dttm,
 68 |         ).all()
 69 |         taskinstance_df = pd.DataFrame(taskinstance)
 70 |         taskinstance_df["run_date"] = pd.to_datetime(taskinstance_df["start_date"]).dt.date
 71 |         taskinstance_df["run_date_hour"] = pd.to_datetime(taskinstance_df["start_date"]).dt.hour
 72 |         taskinstance_df["task_queue_time"] = (taskinstance_df["start_date"] -
 73 |                                               taskinstance_df["queued_dttm"]).dt.total_seconds()
 74 |         taskinstance_df = taskinstance_df[taskinstance_df["task_queue_time"] > 0]
 75 | 
 76 |         dagrun = session.query(DagRun.dag_id, DagRun.run_id, DagRun.data_interval_end).all()
 77 |         dagrun_df = pd.DataFrame(dagrun)
 78 |         dagrun_df = dagrun_df.rename(columns={"data_interval_end": "actual_start_time"})
 79 | 
 80 |         if "_data" in dir(SerializedDagModel):
 81 |             serializeddag = session.query(SerializedDagModel._data).all()
 82 |             data_col = "_data"
 83 |         else:
 84 |             serializeddag = session.query(SerializedDagModel.data).all()
 85 |             data_col = "data"
 86 | 
 87 |         serializeddag_df = pd.DataFrame(serializeddag)
 88 |         serializeddag_json_normalize = pd.json_normalize(
 89 |             pd.DataFrame(serializeddag_df[data_col].apply(json.dumps).apply(json.loads).values.tolist())["dag"],
 90 |             "tasks", ["_dag_id"])
 91 |         serializeddag_filtered = serializeddag_json_normalize[["_dag_id", "task_id", "sla"]]
 92 |         serializeddag_filtered = serializeddag_filtered.rename(columns={"_dag_id": "dag_id"})
 93 |         serialized_dags_slas = serializeddag_filtered[serializeddag_filtered["sla"].notnull()]
 94 | 
 95 |         run_detail = pd.merge(
 96 |             dagrun_df[["dag_id", "run_id", "actual_start_time"]],
 97 |             taskinstance_df[[
 98 |                 "task_id",
 99 |                 "dag_id",
100 |                 "run_id",
101 |                 "start_date",
102 |                 "end_date",
103 |                 "duration",
104 |                 "task_queue_time",
105 |                 "state",
106 |             ]],
107 |             on=["run_id", "dag_id"],
108 |         )
109 | 
110 |         sla_run = pd.merge(run_detail, serialized_dags_slas, on=["task_id", "dag_id"])
111 |         sla_run_detail = sla_run.loc[sla_run["sla"].isnull() == False]
112 |         sla_run_detail["sla_missed"] = np.where(sla_run_detail["duration"] > sla_run_detail["sla"], 1, 0)
113 |         sla_run_detail["run_date_hour"] = pd.to_datetime(sla_run_detail["start_date"]).dt.hour
114 |         #       sla_run_detail["start_dt"] = sla_run_detail["start_date"].dt.date
115 |         sla_run_detail["start_dt"] = sla_run_detail["start_date"].dt.strftime("%A, %b %d")
116 |         sla_run_detail["start_date"] = pd.to_datetime(sla_run_detail["start_date"]).dt.tz_localize(None)
117 | 
118 |         return sla_run_detail, serialized_dags_slas
119 | 
120 |     except:
121 |         no_metadata_found()
122 | 
123 | 
124 | def sla_miss_count_df(input_df, timeframe):
125 |     """Group the data based on dagid and taskid and calculate its count and avg duration
126 | 
127 |     Args:
128 |         input_df (dataframe): sla_run_detail base table
129 |         timeframe (integer): Timeframes entered by the user according to which KPI's will be calculated
130 | 
131 |     Returns:
132 |         dataframes: Intermediate output dataframes required for further processing of data
133 |     """
134 |     df1 = input_df[input_df["duration"] > input_df["sla"]][input_df["start_date"].between(timeframe, today)]
135 |     df2 = df1.groupby(["dag_id", "task_id"]).size().to_frame(name="size").reset_index()
136 |     df3 = df1.groupby(["dag_id", "task_id"])["duration"].mean().reset_index()
137 |     return df2, df3
138 | 
139 | 
140 | def sla_miss_pct(input_df1, input_df2):
141 |     """Calculate SLA miss %
142 | 
143 |     Args:
144 |         input_df1 (dataframe): dataframe consisting of filtered records as per duration and SLA misses grouped by DagId and TaskId
145 |         input_df2 (dataframe): dataframe consisting of all the records as per duration and SLA misses grouped by DagId and TaskId
146 | 
147 |     Returns:
148 |         String containing the SLA miss %
149 |     """
150 | 
151 |     sla_pct = (np.nan_to_num(
152 |         ((input_df1["size"].sum() * 100) / (input_df2["total_count"].sum())),
153 |         0,
154 |     ).round(2))
155 |     return sla_pct
156 | 
157 | 
158 | def sla_total_counts_df(input_df):
159 |     """Group the data based on dagid and taskid and calculate its count
160 | 
161 |     Args:
162 |         input_df (dataframe): base SLA run table
163 | 
164 |     Returns:
165 |         Dataframe containing the total count of SLA grouped by dag_id and task_id
166 |     """
167 |     df = (input_df.groupby(["dag_id",
168 |                             "task_id"]).size().to_frame(name="total_count").sort_values("total_count",
169 |                                                                                         ascending=False).reset_index())
170 |     return df
171 | 
172 | 
173 | def sla_run_counts_df(input_df, timeframe):
174 |     """Filters the sla_run_detail dataframe between the current date and the timeframe mentioned
175 | 
176 |     Args:
177 |         input_df (dataframe): base SLA run table
178 | 
179 |     Returns:
180 |         dataframe: missed SLAs within provided timeframe
181 |     """
182 |     tf = input_df[input_df["start_date"].between(timeframe, today)]
183 |     return tf
184 | 
185 | 
186 | def sla_daily_miss(sla_run_detail):
187 |     """SLA miss table which gives us details about the date, SLA miss % on that date and top DAG violators for the long timeframe.
188 | 
189 |     Args:
190 |         sla_run_detail (dataframe): Table consiting of details of all the dag runs that happened
191 | 
192 |     Returns:
193 |         dataframe: sla_daily_miss output dataframe
194 |     """
195 |     try:
196 | 
197 |         sla_pastweek_run_count_df = sla_run_detail[sla_run_detail["start_date"].between(
198 |             long_timeframe_start_date, today)]
199 | 
200 |         daily_sla_miss_count = sla_run_detail[sla_run_detail["duration"] > sla_run_detail["sla"]][
201 |             sla_run_detail["start_date"].between(long_timeframe_start_date, today)].sort_values(["start_date"])
202 | 
203 |         daily_sla_miss_count_datewise = (daily_sla_miss_count.groupby(
204 |             ["start_dt"]).size().to_frame(name="slamiss_count_datewise").reset_index())
205 |         daily_sla_count_df = (daily_sla_miss_count.groupby(["start_dt", "dag_id",
206 |                                                             "task_id"]).size().to_frame(name="size").reset_index())
207 |         daily_sla_totalcount_datewise = (sla_pastweek_run_count_df.groupby(
208 |             ["start_dt"]).size().to_frame(name="total_count").sort_values("start_dt", ascending=False).reset_index())
209 |         daily_sla_totalcount_datewise_taskwise = (sla_pastweek_run_count_df.groupby(
210 |             ["start_dt", "dag_id",
211 |              "task_id"]).size().to_frame(name="totalcount").sort_values("start_dt", ascending=False).reset_index())
212 |         daily_sla_miss_pct_df = pd.merge(daily_sla_miss_count_datewise, daily_sla_totalcount_datewise, on=["start_dt"])
213 |         daily_sla_miss_pct_df["sla_miss_percent"] = (daily_sla_miss_pct_df["slamiss_count_datewise"] * 100 /
214 |                                                      daily_sla_miss_pct_df["total_count"]).round(2)
215 |         daily_sla_miss_pct_df["sla_miss_percent(missed_tasks/total_tasks)"] = daily_sla_miss_pct_df.apply(
216 |             lambda x: "%s%s(%s/%s)" % (x["sla_miss_percent"], "% ", x["slamiss_count_datewise"], x["total_count"]),
217 |             axis=1,
218 |         )
219 | 
220 |         daily_sla_miss_percent = daily_sla_miss_pct_df.filter(
221 |             ["start_dt", "sla_miss_percent(missed_tasks/total_tasks)"], axis=1)
222 |         daily_sla_miss_df_pct1 = pd.merge(
223 |             daily_sla_count_df,
224 |             daily_sla_totalcount_datewise_taskwise,
225 |             on=["start_dt", "dag_id", "task_id"],
226 |         )
227 |         daily_sla_miss_df_pct1["pct_violator"] = (daily_sla_miss_df_pct1["size"] * 100 /
228 |                                                   daily_sla_miss_df_pct1["totalcount"]).round(2)
229 |         daily_sla_miss_df_pct_kpi = (daily_sla_miss_df_pct1.sort_values("pct_violator",
230 |                                                                         ascending=False).groupby("start_dt",
231 |                                                                                                  sort=False).head(1))
232 | 
233 |         daily_sla_miss_df_pct_kpi["top_pct_violator"] = daily_sla_miss_df_pct_kpi.apply(
234 |             lambda x: "%s: %s (%s%s" % (x["dag_id"], x["task_id"], x["pct_violator"], "%)"),
235 |             axis=1,
236 |         )
237 | 
238 |         daily_slamiss_percent_violator = daily_sla_miss_df_pct_kpi.filter(["start_dt", "top_pct_violator"], axis=1)
239 |         daily_slamiss_df_absolute_kpi = (daily_sla_miss_df_pct1.sort_values("size", ascending=False).groupby(
240 |             "start_dt", sort=False).head(1))
241 | 
242 |         daily_slamiss_df_absolute_kpi["top_absolute_violator"] = daily_slamiss_df_absolute_kpi.apply(
243 |             lambda x: "%s: %s (%s/%s)" % (x["dag_id"], x["task_id"], x["size"], x["totalcount"]),
244 |             axis=1,
245 |         )
246 | 
247 |         daily_slamiss_absolute_violator = daily_slamiss_df_absolute_kpi.filter(["start_dt", "top_absolute_violator"],
248 |                                                                                axis=1)
249 |         daily_slamiss_pct_last7days = pd.merge(
250 |             pd.merge(daily_sla_miss_percent, daily_slamiss_percent_violator, on="start_dt"),
251 |             daily_slamiss_absolute_violator,
252 |             on="start_dt",
253 |         ).sort_values("start_dt", ascending=False)
254 | 
255 |         daily_slamiss_pct_last7days = daily_slamiss_pct_last7days.rename(
256 |             columns={
257 |                 "top_pct_violator": "Top Violator (%)",
258 |                 "top_absolute_violator": "Top Violator (absolute)",
259 |                 "start_dt": "Date",
260 |                 "sla_miss_percent(missed_tasks/total_tasks)": "SLA Miss % (Missed/Total Tasks)",
261 |             })
262 |         return daily_slamiss_pct_last7days
263 |     except:
264 |         daily_slamiss_pct_last7days = pd.DataFrame(
265 |             columns=["Date", "SLA Miss % (Missed/Total Tasks)", "Top Violator (%)", "Top Violator (absolute)"])
266 |         return daily_slamiss_pct_last7days
267 | 
268 | 
269 | def sla_hourly_miss(sla_run_detail):
270 |     """Generate hourly SLA miss table giving us details about the hour, SLA miss % for that hour, top DAG violators
271 |     and the longest running task and avg task queue time for the given short timeframe.
272 | 
273 |     Args:
274 |         sla_run_detail (dataframe): Base table consiting of details of all the dag runs that happened
275 | 
276 |     Returns:
277 |         datframe, list: observations_hourly_reccomendations list and  sla_miss_percent_past_day_hourly dataframe
278 |     """
279 |     try:
280 | 
281 |         sla_miss_count_past_day = sla_run_detail[sla_run_detail["duration"] > sla_run_detail["sla"]][
282 |             sla_run_detail["start_date"].between(short_timeframe_start_date, today)]
283 | 
284 |         sla_miss_count_hourly = (sla_miss_count_past_day.groupby(
285 |             ["run_date_hour"]).size().to_frame(name="slamiss_count_hourwise").reset_index())
286 |         sla_count_df_past_day_hourly = (sla_miss_count_past_day.groupby(["run_date_hour", "dag_id", "task_id"
287 |                                                                          ]).size().to_frame(name="size").reset_index())
288 |         sla_avg_execution_time_taskwise_hourly = (sla_miss_count_past_day.groupby(
289 |             ["run_date_hour", "dag_id", "task_id"])["duration"].mean().reset_index())
290 |         sla_avg_execution_time_hourly = (sla_avg_execution_time_taskwise_hourly.sort_values(
291 |             "duration", ascending=False).groupby("run_date_hour", sort=False).head(1))
292 | 
293 |         sla_pastday_run_count_df = sla_run_detail[sla_run_detail["start_date"].between(
294 |             short_timeframe_start_date, today)]
295 |         sla_avg_queue_time_hourly = (sla_pastday_run_count_df.groupby(["run_date_hour"
296 |                                                                        ])["task_queue_time"].mean().reset_index())
297 |         sla_totalcount_hourly = (sla_pastday_run_count_df.groupby(
298 |             ["run_date_hour"]).size().to_frame(name="total_count").sort_values("run_date_hour",
299 |                                                                                ascending=False).reset_index())
300 |         sla_totalcount_taskwise_hourly = (sla_pastday_run_count_df.groupby(
301 |             ["run_date_hour", "dag_id",
302 |              "task_id"]).size().to_frame(name="totalcount").sort_values("run_date_hour", ascending=False).reset_index())
303 |         sla_miss_pct_past_day_hourly = pd.merge(sla_miss_count_hourly, sla_totalcount_hourly, on=["run_date_hour"])
304 |         sla_miss_pct_past_day_hourly["sla_miss_percent"] = (sla_miss_pct_past_day_hourly["slamiss_count_hourwise"] *
305 |                                                             100 / sla_miss_pct_past_day_hourly["total_count"]).round(2)
306 | 
307 |         sla_miss_pct_past_day_hourly["sla_miss_percent(missed_tasks/total_tasks)"] = sla_miss_pct_past_day_hourly.apply(
308 |             lambda x: "%s%s(%s/%s)" % (
309 |                 x["sla_miss_percent"].astype(int),
310 |                 "% ",
311 |                 x["slamiss_count_hourwise"].astype(int),
312 |                 x["total_count"].astype(int),
313 |             ),
314 |             axis=1,
315 |         )
316 | 
317 |         sla_highest_sla_miss_hour = (sla_miss_pct_past_day_hourly[["run_date_hour", "sla_miss_percent"
318 |                                                                    ]].sort_values("sla_miss_percent",
319 |                                                                                   ascending=False).head(1))
320 |         sla_highest_tasks_hour = (sla_miss_pct_past_day_hourly[["run_date_hour",
321 |                                                                 "total_count"]].sort_values("total_count",
322 |                                                                                             ascending=False).head(1))
323 | 
324 |         sla_miss_percent_past_day = sla_miss_pct_past_day_hourly.filter(
325 |             ["run_date_hour", "sla_miss_percent(missed_tasks/total_tasks)"], axis=1)
326 | 
327 |         sla_miss_temp_df_pct1_past_day = pd.merge(
328 |             sla_count_df_past_day_hourly,
329 |             sla_totalcount_taskwise_hourly,
330 |             on=["run_date_hour", "dag_id", "task_id"],
331 |         )
332 | 
333 |         sla_miss_temp_df_pct1_past_day["pct_violator"] = (sla_miss_temp_df_pct1_past_day["size"] * 100 /
334 |                                                           sla_miss_temp_df_pct1_past_day["totalcount"]).round(2)
335 |         sla_miss_pct_past_day_hourly = (sla_miss_temp_df_pct1_past_day.sort_values(
336 |             "pct_violator", ascending=False).groupby("run_date_hour", sort=False).head(1))
337 | 
338 |         sla_miss_pct_past_day_hourly["top_pct_violator"] = sla_miss_pct_past_day_hourly.apply(
339 |             lambda x: "%s: %s (%s%s" % (x["dag_id"], x["task_id"], x["pct_violator"], "%)"),
340 |             axis=1,
341 |         )
342 | 
343 |         sla_miss_percent_violator_past_day_hourly = sla_miss_pct_past_day_hourly.filter(
344 |             ["run_date_hour", "top_pct_violator"], axis=1)
345 |         sla_miss_absolute_kpi_past_day_hourly = (sla_miss_temp_df_pct1_past_day.sort_values(
346 |             "size", ascending=False).groupby("run_date_hour", sort=False).head(1))
347 |         sla_miss_absolute_kpi_past_day_hourly["top_absolute_violator"] = sla_miss_absolute_kpi_past_day_hourly.apply(
348 |             lambda x: "%s: %s (%s/%s)" % (x["dag_id"], x["task_id"], x["size"], x["totalcount"]),
349 |             axis=1,
350 |         )
351 | 
352 |         sla_miss_absolute_violator_past_day_hourly = sla_miss_absolute_kpi_past_day_hourly.filter(
353 |             ["run_date_hour", "top_absolute_violator"], axis=1)
354 |         slamiss_pct_exectime = pd.merge(
355 |             pd.merge(
356 |                 sla_miss_percent_past_day,
357 |                 sla_miss_percent_violator_past_day_hourly,
358 |                 on="run_date_hour",
359 |             ),
360 |             sla_miss_absolute_violator_past_day_hourly,
361 |             on="run_date_hour",
362 |         ).sort_values("run_date_hour", ascending=False)
363 | 
364 |         sla_avg_execution_time_hourly["duration"] = (
365 |             sla_avg_execution_time_hourly["duration"].round(0).astype(int).astype(str))
366 |         sla_avg_execution_time_hourly["longest_running_task"] = sla_avg_execution_time_hourly.apply(
367 |             lambda x: "%s: %s (%ss)" % (x["dag_id"], x["task_id"], x["duration"]), axis=1)
368 | 
369 |         sla_longest_running_task_hourly = sla_avg_execution_time_hourly.filter(
370 |             ["run_date_hour", "longest_running_task"], axis=1)
371 | 
372 |         sla_miss_pct = pd.merge(slamiss_pct_exectime, sla_longest_running_task_hourly, on=["run_date_hour"])
373 |         sla_miss_percent_past_day_hourly = pd.merge(sla_miss_pct, sla_avg_queue_time_hourly, on=["run_date_hour"])
374 |         sla_miss_percent_past_day_hourly["task_queue_time"] = (
375 |             sla_miss_percent_past_day_hourly["task_queue_time"].round(0).astype(int).apply(str))
376 |         sla_longest_queue_time_hourly = (sla_miss_percent_past_day_hourly[["run_date_hour", "task_queue_time"
377 |                                                                            ]].sort_values("task_queue_time",
378 |                                                                                           ascending=False).head(1))
379 | 
380 |         sla_miss_percent_past_day_hourly.rename(
381 |             columns={
382 |                 "task_queue_time": "Average Task Queue Time (s)",
383 |                 "longest_running_task": "Longest Running Task",
384 |                 "top_pct_violator": "Top Violator (%)",
385 |                 "top_absolute_violator": "Top Violator (absolute)",
386 |                 "run_date_hour": "Hour",
387 |                 "sla_miss_percent(missed_tasks/total_tasks)": "SLA miss % (Missed/Total Tasks)",
388 |             },
389 |             inplace=True,
390 |         )
391 | 
392 |         obs1_hourlytrend = "Hour " + (sla_highest_sla_miss_hour["run_date_hour"].apply(str) +
393 |                                       " had the highest percentage of SLA misses").to_string(index=False)
394 |         obs2_hourlytrend = "Hour " + (
395 |             sla_longest_queue_time_hourly["run_date_hour"].apply(str) + " had the longest average queue time (" +
396 |             sla_longest_queue_time_hourly["task_queue_time"].apply(str) + " seconds)").to_string(index=False)
397 |         obs3_hourlytrend = "Hour " + (sla_highest_tasks_hour["run_date_hour"].apply(str) +
398 |                                       " had the most tasks running").to_string(index=False)
399 | 
400 |         observations_hourly_reccomendations = [obs1_hourlytrend, obs2_hourlytrend, obs3_hourlytrend]
401 |         return observations_hourly_reccomendations, sla_miss_percent_past_day_hourly
402 |     except:
403 |         sla_miss_percent_past_day_hourly = pd.DataFrame(columns=[
404 |             "SLA Miss % (Missed/Total Tasks)",
405 |             "Top Violator (%)",
406 |             "Top Violator (absolute)",
407 |             "Longest Running Task",
408 |             "Hour",
409 |             "Average Task Queue Time (seconds)",
410 |         ])
411 |         observations_hourly_reccomendations = ""
412 |         return observations_hourly_reccomendations, sla_miss_percent_past_day_hourly
413 | 
414 | 
415 | def sla_dag_miss(sla_run_detail, serialized_dags_slas):
416 |     """
417 |     Generate SLA dag miss table giving us details about the SLA miss % for the given timeframes along with the average execution time and
418 |     reccomendations for weekly observations.
419 | 
420 |     Args:
421 |         sla_run_detail (dataframe): Base table consiting of details of all the dag runs that happened
422 |         serialized_dags_slas (dataframe): table consisting of all the dag details
423 | 
424 |     Returns:
425 |         2 lists consisting of sla_daily_miss and sla_dag_miss reccomendations and 1 dataframe consisting of sla_dag_miss reccomendation
426 |     """
427 |     try:
428 | 
429 |         dag_sla_count_df_weekprior, dag_sla_count_df_weekprior_avgduration = sla_miss_count_df(
430 |             sla_run_detail, long_timeframe_start_date)
431 |         dag_sla_count_df_threedayprior, dag_sla_count_df_threedayprior_avgduration = sla_miss_count_df(
432 |             sla_run_detail, medium_timeframe_start_date)
433 |         dag_sla_count_df_onedayprior, dag_sla_count_df_onedayprior_avgduration = sla_miss_count_df(
434 |             sla_run_detail, short_timeframe_start_date)
435 | 
436 |         dag_sla_run_count_week_prior = sla_run_counts_df(sla_run_detail, long_timeframe_start_date)
437 |         dag_sla_run_count_three_day_prior = sla_run_counts_df(sla_run_detail, medium_timeframe_start_date)
438 |         dag_sla_run_count_one_day_prior = sla_run_counts_df(sla_run_detail, short_timeframe_start_date)
439 | 
440 |         dag_sla_run_count_week_prior_success = (
441 |             dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "success"].groupby(
442 |                 ["dag_id", "task_id"]).size().to_frame(name="success_count").reset_index())
443 |         dag_sla_run_count_week_prior_failure = (
444 |             dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "failed"].groupby(
445 |                 ["dag_id", "task_id"]).size().to_frame(name="failure_count").reset_index())
446 | 
447 |         dag_sla_run_count_week_prior_success_duration_stats = (
448 |             dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "success"].groupby(
449 |                 ["dag_id", "task_id"])["duration"].agg(["mean", "min", "max"]).reset_index())
450 |         dag_sla_run_count_week_prior_failure_duration_stats = (
451 |             dag_sla_run_count_week_prior[dag_sla_run_count_week_prior["state"] == "failed"].groupby(
452 |                 ["dag_id", "task_id"])["duration"].agg(["mean", "min", "max"]).reset_index())
453 | 
454 |         dag_sla_totalcount_week_prior = sla_total_counts_df(dag_sla_run_count_week_prior)
455 |         dag_sla_totalcount_three_day_prior = sla_total_counts_df(dag_sla_run_count_three_day_prior)
456 |         dag_sla_totalcount_one_day_prior = sla_total_counts_df(dag_sla_run_count_one_day_prior)
457 | 
458 |         dag_obs5_sladpercent_weekprior = sla_miss_pct(dag_sla_count_df_weekprior, dag_sla_totalcount_week_prior)
459 |         dag_obs6_sladpercent_threedayprior = sla_miss_pct(dag_sla_count_df_threedayprior,
460 |                                                           dag_sla_totalcount_three_day_prior)
461 |         dag_obs7_sladpercent_onedayprior = sla_miss_pct(dag_sla_count_df_onedayprior, dag_sla_totalcount_one_day_prior)
462 | 
463 |         dag_obs7_sladetailed_week = f'In the past {str(LONG_TIMEFRAME_IN_DAYS)} days, {dag_obs5_sladpercent_weekprior}% of the tasks have missed their SLA'
464 |         dag_obs6_sladetailed_threeday = f'In the past {str(MEDIUM_TIMEFRAME_IN_DAYS)} days, {dag_obs6_sladpercent_threedayprior}% of the tasks have missed their SLA'
465 |         dag_obs5_sladetailed_oneday = f'In the past {str(SHORT_TIMEFRAME_IN_DAYS)} days, {dag_obs7_sladpercent_onedayprior}% of the tasks have missed their SLA'
466 | 
467 |         dag_sla_miss_pct_df_week_prior = pd.merge(
468 |             pd.merge(dag_sla_count_df_weekprior, dag_sla_totalcount_week_prior, on=["dag_id", "task_id"]),
469 |             dag_sla_count_df_weekprior_avgduration,
470 |             on=["dag_id", "task_id"],
471 |         )
472 |         dag_sla_miss_pct_df_threeday_prior = pd.merge(
473 |             pd.merge(
474 |                 dag_sla_count_df_threedayprior,
475 |                 dag_sla_totalcount_three_day_prior,
476 |                 on=["dag_id", "task_id"],
477 |             ),
478 |             dag_sla_count_df_threedayprior_avgduration,
479 |             on=["dag_id", "task_id"],
480 |         )
481 |         dag_sla_miss_pct_df_oneday_prior = pd.merge(
482 |             pd.merge(
483 |                 dag_sla_count_df_onedayprior,
484 |                 dag_sla_totalcount_one_day_prior,
485 |                 on=["dag_id", "task_id"],
486 |             ),
487 |             dag_sla_count_df_onedayprior_avgduration,
488 |             on=["dag_id", "task_id"],
489 |         )
490 | 
491 |         dag_sla_miss_pct_df_week_prior["sla_miss_percent_week"] = (
492 |             dag_sla_miss_pct_df_week_prior["size"] * 100 / dag_sla_miss_pct_df_week_prior["total_count"]).round(2)
493 |         dag_sla_miss_pct_df_threeday_prior["sla_miss_percent_three_day"] = (
494 |             dag_sla_miss_pct_df_threeday_prior["size"] * 100 /
495 |             dag_sla_miss_pct_df_threeday_prior["total_count"]).round(2)
496 |         dag_sla_miss_pct_df_oneday_prior["sla_miss_percent_one_day"] = (
497 |             dag_sla_miss_pct_df_oneday_prior["size"] * 100 / dag_sla_miss_pct_df_oneday_prior["total_count"]).round(2)
498 | 
499 |         dag_sla_miss_pct_df1 = dag_sla_miss_pct_df_week_prior.merge(dag_sla_miss_pct_df_threeday_prior,
500 |                                                                     on=["dag_id", "task_id"],
501 |                                                                     how="left")
502 |         dag_sla_miss_pct_df2 = dag_sla_miss_pct_df1.merge(dag_sla_miss_pct_df_oneday_prior,
503 |                                                           on=["dag_id", "task_id"],
504 |                                                           how="left")
505 |         dag_sla_miss_pct_df3 = dag_sla_miss_pct_df2.merge(serialized_dags_slas, on=["dag_id", "task_id"], how="left")
506 | 
507 |         dag_sla_miss_pct_detailed = dag_sla_miss_pct_df3.filter(
508 |             [
509 |                 "dag_id",
510 |                 "task_id",
511 |                 "sla",
512 |                 "sla_miss_percent_week",
513 |                 "duration_x",
514 |                 "sla_miss_percent_three_day",
515 |                 "duration_y",
516 |                 "sla_miss_percent_one_day",
517 |                 "duration",
518 |             ],
519 |             axis=1,
520 |         )
521 | 
522 |         float_column_names = dag_sla_miss_pct_detailed.select_dtypes(float).columns
523 |         dag_sla_miss_pct_detailed[float_column_names] = dag_sla_miss_pct_detailed[float_column_names].fillna(0)
524 | 
525 |         round_int_column_names = ["duration_x", "duration_y", "duration"]
526 |         dag_sla_miss_pct_detailed[round_int_column_names] = dag_sla_miss_pct_detailed[round_int_column_names].round(
527 |             0).astype(int)
528 |         dag_sla_miss_pct_detailed["sla"] = dag_sla_miss_pct_detailed["sla"].astype(int)
529 |         dag_sla_miss_pct_detailed["Dag: Task"] = (dag_sla_miss_pct_detailed["dag_id"].apply(str) + ": " +
530 |                                                   dag_sla_miss_pct_detailed["task_id"].apply(str))
531 | 
532 |         short_timeframe_col_name = f'{SHORT_TIMEFRAME_IN_DAYS}-day SLA Miss % (avg execution time)'
533 |         medium_timeframe_col_name = f'{MEDIUM_TIMEFRAME_IN_DAYS}-day SLA Miss % (avg execution time)'
534 |         long_timeframe_col_name = f'{LONG_TIMEFRAME_IN_DAYS}-day SLA Miss % (avg execution time)'
535 | 
536 |         dag_sla_miss_pct_detailed[short_timeframe_col_name] = (
537 |             dag_sla_miss_pct_detailed["sla_miss_percent_one_day"].apply(str) + "% (" +
538 |             dag_sla_miss_pct_detailed["duration"].apply(str) + "s)")
539 | 
540 |         dag_sla_miss_pct_detailed[medium_timeframe_col_name] = (
541 |             dag_sla_miss_pct_detailed["sla_miss_percent_three_day"].apply(str) + "% (" +
542 |             dag_sla_miss_pct_detailed["duration_y"].apply(str) + "s)")
543 | 
544 |         dag_sla_miss_pct_detailed[long_timeframe_col_name] = (
545 |             dag_sla_miss_pct_detailed["sla_miss_percent_week"].apply(str) + "% (" +
546 |             dag_sla_miss_pct_detailed["duration_x"].apply(str) + "s)")
547 | 
548 |         dag_sla_miss_pct_filtered = dag_sla_miss_pct_detailed.filter(
549 |             [
550 |                 "Dag: Task",
551 |                 "sla",
552 |                 short_timeframe_col_name,
553 |                 medium_timeframe_col_name,
554 |                 long_timeframe_col_name,
555 |             ],
556 |             axis=1,
557 |         ).sort_values(by=[long_timeframe_col_name], ascending=False)
558 | 
559 |         dag_sla_miss_pct_filtered.rename(columns={"sla": "Current SLA (s)"}, inplace=True)
560 | 
561 |         dag_sla_miss_pct_recc1 = dag_sla_miss_pct_detailed.nlargest(3, ["sla_miss_percent_week"]).fillna(0)
562 |         dag_sla_miss_pct_recc2 = dag_sla_miss_pct_recc1.filter(
563 |             ["dag_id", "task_id", "sla", "sla_miss_percent_week", "Dag: Task"], axis=1).fillna(0)
564 |         dag_sla_miss_pct_df4_recc3 = pd.merge(
565 |             pd.merge(
566 |                 dag_sla_miss_pct_recc2,
567 |                 dag_sla_run_count_week_prior_success,
568 |                 on=["dag_id", "task_id"],
569 |             ),
570 |             dag_sla_run_count_week_prior_failure,
571 |             on=["dag_id", "task_id"],
572 |             how="left",
573 |         ).fillna(0)
574 |         dag_sla_miss_pct_df4_recc4 = pd.merge(
575 |             pd.merge(
576 |                 dag_sla_miss_pct_df4_recc3,
577 |                 dag_sla_run_count_week_prior_success_duration_stats,
578 |                 on=["dag_id", "task_id"],
579 |                 how="left",
580 |             ),
581 |             dag_sla_run_count_week_prior_failure_duration_stats,
582 |             on=["dag_id", "task_id"],
583 |             how="left",
584 |         ).fillna(0)
585 |         dag_sla_miss_pct_df4_recc4["Recommendations"] = (
586 |             dag_sla_miss_pct_df4_recc4["Dag: Task"].apply(str) + " - Of the " +
587 |             dag_sla_miss_pct_df4_recc4["sla_miss_percent_week"].apply(str) +
588 |             "% of the tasks that missed their SLA of " + dag_sla_miss_pct_df4_recc4["sla"].apply(str) + " seconds, " +
589 |             dag_sla_miss_pct_df4_recc4["success_count"].astype(int).apply(str) + " succeeded (min: " +
590 |             dag_sla_miss_pct_df4_recc4["min_x"].round(0).astype(int).apply(str) + "s, avg: " +
591 |             dag_sla_miss_pct_df4_recc4["mean_x"].round(0).astype(int).apply(str) + "s, max: " +
592 |             dag_sla_miss_pct_df4_recc4["max_x"].round(0).astype(int).apply(str) + "s) & " +
593 |             dag_sla_miss_pct_df4_recc4["failure_count"].astype(int).apply(str) + " failed (min: " +
594 |             dag_sla_miss_pct_df4_recc4["min_y"].round(0).astype(int).apply(str) + "s, avg: " +
595 |             dag_sla_miss_pct_df4_recc4["mean_y"].round(0).astype(int).apply(str) + "s, max: " +
596 |             dag_sla_miss_pct_df4_recc4["max_y"].round(0).fillna(0).astype(int).apply(str) + "s)")
597 | 
598 |         daily_weeklytrend_observations_loop = [
599 |             dag_obs5_sladetailed_oneday,
600 |             dag_obs6_sladetailed_threeday,
601 |             dag_obs7_sladetailed_week,
602 |         ]
603 | 
604 |         dag_sla_miss_trend = dag_sla_miss_pct_df4_recc4["Recommendations"].tolist()
605 | 
606 |         return daily_weeklytrend_observations_loop, dag_sla_miss_trend, dag_sla_miss_pct_filtered
607 |     except:
608 |         short_timeframe_col_name = f'{SHORT_TIMEFRAME_IN_DAYS}-Day SLA miss % (avg execution time)'
609 |         medium_timeframe_col_name = f'{MEDIUM_TIMEFRAME_IN_DAYS}-Day SLA miss % (avg execution time)'
610 |         long_timeframe_col_name = f'{LONG_TIMEFRAME_IN_DAYS}-Day SLA miss % (avg execution time)'
611 |         daily_weeklytrend_observations_loop = ""
612 |         dag_sla_miss_trend = ""
613 |         dag_sla_miss_pct_filtered = pd.DataFrame(columns=[
614 |             "Dag: Task",
615 |             "Current SLA",
616 |             short_timeframe_col_name,
617 |             medium_timeframe_col_name,
618 |             long_timeframe_col_name,
619 |         ])
620 |         return daily_weeklytrend_observations_loop, dag_sla_miss_trend, dag_sla_miss_pct_filtered
621 | 
622 | 
623 | def sla_miss_report():
624 |     """Embed all the resultant output datframes within html format and send the email report to the intented recipients."""
625 | 
626 |     sla_run_detail, serialized_dags_slas = retrieve_metadata()
627 |     daily_slamiss_pct_last7days = sla_daily_miss(sla_run_detail)
628 |     observations_hourly_reccomendations, sla_miss_percent_past_day_hourly = sla_hourly_miss(sla_run_detail)
629 |     daily_weeklytrend_observations_loop, dag_sla_miss_trend, dag_sla_miss_pct_filtered = sla_dag_miss(
630 |         sla_run_detail, serialized_dags_slas)
631 | 
632 |     new_line = '\n'
633 |     print(f"""
634 | ------------------- START OF REPORT -------------------
635 | {EMAIL_SUBJECT}
636 | 
637 | Daily SLA Misses
638 | {new_line.join(map(str, daily_weeklytrend_observations_loop))}
639 | 
640 | {daily_slamiss_pct_last7days.to_markdown(index=False)}
641 | 
642 | Hourly SLA Misses
643 | {new_line.join(map(str, observations_hourly_reccomendations))}
644 | 
645 | {sla_miss_percent_past_day_hourly.to_markdown(index=False)}
646 | 
647 | DAG SLA Misses
648 | {new_line.join(map(str, dag_sla_miss_trend))}
649 | 
650 | {dag_sla_miss_pct_filtered.to_markdown(index=False)}
651 | 
652 | ------------------- END OF REPORT -------------------
653 |     """)
654 | 
655 |     daily_weeklytrend_observations_loop = "".join([f"<li>{item}</li>" for item in daily_weeklytrend_observations_loop])
656 |     observations_hourly_reccomendations = "".join([f"<li>{item}</li>" for item in observations_hourly_reccomendations])
657 |     dag_sla_miss_trend = "".join([f"<li>{item}</li>" for item in dag_sla_miss_trend])
658 | 
659 |     short_timeframe_print = f'<b>Short</b>: {SHORT_TIMEFRAME_IN_DAYS}d ({short_timeframe_start_date.strftime("%b %d")} - {(today - timedelta(days=1)).strftime("%b %d")})'
660 |     medium_timeframe_print = f'<b>Medium</b>: {MEDIUM_TIMEFRAME_IN_DAYS}d ({medium_timeframe_start_date.strftime("%b %d")} - {(today - timedelta(days=1)).strftime("%b %d")})'
661 |     long_timeframe_print = f'<b>Long</b>: {LONG_TIMEFRAME_IN_DAYS}d ({long_timeframe_start_date.strftime("%b %d")} - {(today - timedelta(days=1)).strftime("%b %d")})'
662 |     timeframe_prints = f'{short_timeframe_print} | {medium_timeframe_print} | {long_timeframe_print}'
663 | 
664 |     html_content = f"""\
665 |     <html>
666 |     <head>
667 |     <style>
668 |     table {{
669 |     font-family: Arial, Helvetica, sans-serif;
670 |     border-collapse: collapse;
671 |     width: 100%;
672 |     }}
673 | 
674 |     td, th {{
675 |     border: 1px solid #ddd;
676 |     padding: 8px;
677 |     }}
678 | 
679 |     th {{
680 |     padding-top: 12px;
681 |     padding-bottom: 12px;
682 |     text-align: left;
683 |     background-color: #154360;
684 |     color: white;
685 |     }}
686 | 
687 |     td {{
688 |     text-align: left;
689 |     background-color: #EBF5FB;
690 |     }}
691 |     </style>
692 |     </head>
693 |     <body>
694 |     The following timeframes are used to generate this report. To change them, update the [SHORT, MEDIUM, LONG]_TIMEFRAME_IN_DAYS variables in airflow-sla-miss-report.py.
695 |     <br></br><br></br>
696 |     {timeframe_prints}
697 | 
698 |     <h2>Daily SLA Misses</h2>
699 |     <p>Daily breakdown of SLA misses and the <b>worst offenders</b> over the past {LONG_TIMEFRAME_IN_DAYS} day(s).</p>
700 |     {daily_weeklytrend_observations_loop}
701 |     {daily_slamiss_pct_last7days.to_html(index=False)}
702 | 
703 | 
704 |     <h2>Hourly SLA Misses</h2>
705 |     <p>Hourly breakdown of tasks missing their SLAs and the worst offenders over the past {SHORT_TIMEFRAME_IN_DAYS} day(s). Useful for identifying <b>scheduling bottlenecks</b>.</p>
706 |     {observations_hourly_reccomendations}
707 |     {sla_miss_percent_past_day_hourly.to_html(index=False)}
708 | 
709 |     <h2>DAG SLA Misses</h2>
710 |     <p>Task level breakdown showcasing the SLA miss percentage & average execution time over the past {SHORT_TIMEFRAME_IN_DAYS}, {MEDIUM_TIMEFRAME_IN_DAYS}, and {LONG_TIMEFRAME_IN_DAYS} day(s). Useful for <b>identifying trends and updating defined SLAs</b> to meet actual exectution times.</p>
711 |     {dag_sla_miss_trend}
712 |     {dag_sla_miss_pct_filtered.to_html(index=False)}
713 | 
714 |     </body>
715 |     </html>
716 |     """
717 |     if EMAIL_ADDRESSES:
718 |         send_email(to=EMAIL_ADDRESSES, subject=EMAIL_SUBJECT, html_content=html_content)
719 | 
720 | 
721 | def no_metadata_found():
722 |     """Stock html email template to send if there is no data present in the base tables"""
723 | 
724 |     print("No Data Available. Check data is present in the airflow metadata database.")
725 | 
726 |     html_content = f"""\
727 |     <html>
728 |     <body>
729 |     <h2 style="color:red"><u>No Data Available</u></h2>
730 |     <p><b>Check data is present in the airflow metadata database.</b></p>
731 |     </body>
732 |     </html>
733 |     """
734 |     if EMAIL_ADDRESSES:
735 |         send_email(to=EMAIL_ADDRESSES, subject=EMAIL_SUBJECT, html_content=html_content)
736 | 
737 | 
738 | default_args = {
739 |     'owner': DAG_OWNER,
740 |     'depends_on_past': False,
741 |     'email': EMAIL_ADDRESSES,
742 |     'email_on_failure': True,
743 |     'email_on_retry': False,
744 |     'start_date': START_DATE,
745 |     "retries": 1,
746 |     "retry_delay": timedelta(minutes=5),
747 | }
748 | 
749 | with DAG(DAG_ID,
750 |          default_args=default_args,
751 |          description="DAG generating the SLA miss report",
752 |          schedule_interval=SCHEDULE_INTERVAL,
753 |          start_date=START_DATE,
754 |          tags=['teamclairvoyant', 'airflow-maintenance-dags']) as dag:
755 |     sla_miss_report_task = PythonOperator(task_id="sla_miss_report", python_callable=sla_miss_report, dag=dag)
756 | 


--------------------------------------------------------------------------------