├── .gitignore ├── README.md ├── dags ├── example_dag.py └── steps_example_dag │ └── step1_example_dag.py ├── docker-compose.yaml ├── docker_mlflow ├── Dockerfile └── requirements.txt └── docs └── imgs ├── airflow_home.png └── mlflow_home.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | 132 | .DS_Store 133 | 134 | .env 135 | 136 | logs/ 137 | 138 | plugins/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AIRFLOW_MLFLOW_DOCKER 2 | ![test](https://user-images.githubusercontent.com/31510474/196541725-ed4c7fca-4521-48d8-95e0-c7beaa6aa662.png) 3 | 4 | 5 | ## Table of content 6 | - [Background](#background) 7 | - [Tools Overview](#tools_overview) 8 | - [Getting started](#getting_started) 9 | * [Docker Compose configuration](#docker_config) 10 | * [Airflow](#airflow) 11 | * [MLflow](#mlflow) 12 | - [Connect Airflow to MLflow](#airflow_and_mlflow) 13 | - [References](#references) 14 | 15 | 16 | 17 | 18 | ## Background 19 | The goal of this project is to create an ecosystem where to run **Data Pipelines** and monitor **Machine Learning Experiments**. 20 | 21 | 22 | 23 | ## Tools Overview 24 | From `Airflow` documentation: 25 | ``` 26 | Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows 27 | ``` 28 | 29 | From `MLflow` documentation: 30 | ``` 31 | MLflow is an open source platform for managing the end-to-end machine learning lifecycle 32 | ``` 33 | 34 | From `Docker` documentation: 35 | ``` 36 | Docker Compose is a tool for defining and running multi-container Docker applications. 37 | ``` 38 | 39 | 40 | 41 | ## Getting Started 42 | The first step to structure this project is connecting `Airflow` and `MLflow` together: `docker compose`. 43 | 44 | 45 | 46 | 47 | ### Docker Compose Configuration 48 | Create `docker-compose.yaml`, which contains the configuration of those docker containers responsible for running `Airflow` and `MLflow` services. 49 | Each of those services runs on a different container: 50 | * airflow-webserver 51 | * airflow-scheduler 52 | * airflow-worker 53 | * airflow-triggerer 54 | * mlflow 55 | 56 | To create and start multiple container, from terminal run the following command: 57 | ``` 58 | docker compose up -d 59 | ``` 60 | 61 | 62 | 63 | ### Airflow 64 | In order to access to `Airflow server` visit the page: `localhost:8080` 65 | 66 | ![img](docs/imgs/airflow_home.png) 67 | 68 | And take a step into `Airflow` world! 69 | 70 | To start creating DAGS initialize an empty folder named `dags` and populate it with as many scripts as you need. 71 | ```bash 72 | └── dags 73 | └── example_dag.py 74 | ``` 75 | 76 | 77 | 78 | ### MLFlow 79 | In order to monitor `MLflow experiments` through its server, visit the page: `localhost:600` 80 | 81 | ![img](docs/imgs/mlflow_home.png) 82 | 83 | 84 | 85 | ## Connect Airflow to MLflow 86 | To establish a connection between `Airflow` and `MLflow`, define the URI of the `MLflow server`: 87 | ``` 88 | mlflow.set_tracking_uri('http://mlflow:600') 89 | ``` 90 | 91 | After that, create a new connection on `Airflow` that points to that port. 92 | image 93 | 94 | 95 | 96 | 97 | ## References 98 | * [Airflow Docker](https://airflow.apache.org/docs/apache-airflow/2.0.1/start/docker.html) 99 | * [What is Airflow?](https://airflow.apache.org/docs/apache-airflow/stable/index.html) 100 | * [MLflow](https://mlflow.org/docs/latest/index.html) 101 | -------------------------------------------------------------------------------- /dags/example_dag.py: -------------------------------------------------------------------------------- 1 | """ 2 | This DAG is responsible for running the sequence of steps from steps_example_dag. 3 | 4 | 5 | Author: Fabio Barbazza 6 | Date: Oct, 2022 7 | """ 8 | from airflow import DAG 9 | from airflow.operators.python import PythonOperator 10 | from datetime import datetime 11 | import logging 12 | import mlflow 13 | from numpy import random 14 | 15 | from steps_example_dag import step1_example_dag 16 | 17 | logging.basicConfig(level=logging.WARN) 18 | logger = logging.getLogger(__name__) 19 | 20 | 21 | 22 | mlflow.set_tracking_uri('http://mlflow:600') 23 | 24 | experiment = mlflow.set_experiment("Airflow_Example") 25 | 26 | 27 | def _task1(): 28 | """ 29 | This method is responsible for running _task1 logic 30 | """ 31 | try: 32 | 33 | logger.info('taks1') 34 | 35 | step1_example_dag.run_workflow_step1() 36 | 37 | except Exception as err: 38 | logger.exception(err) 39 | raise err 40 | 41 | def _task2(): 42 | """ 43 | This method is responsible for running _task2 logic 44 | """ 45 | try: 46 | 47 | logger.info('taks2') 48 | 49 | except Exception as err: 50 | logger.exception(err) 51 | raise err 52 | 53 | def _task3(): 54 | """ 55 | This method is responsible for running _task3 logic 56 | """ 57 | try: 58 | 59 | logger.info('taks3') 60 | 61 | except Exception as err: 62 | logger.exception(err) 63 | raise err 64 | 65 | with DAG(dag_id='dag_example', start_date=datetime(2022, 1, 1), 66 | schedule_interval='@daily', catchup=False) as dag: 67 | 68 | with mlflow.start_run(): 69 | 70 | id_run = random.rand() 71 | 72 | mlflow.log_param("run_id_manual",id_run) 73 | 74 | t1 = PythonOperator( 75 | task_id='t1', 76 | op_kwargs=dag.default_args, 77 | provide_context=True, 78 | python_callable=_task1 79 | ) 80 | 81 | 82 | t2 = PythonOperator( 83 | task_id='t2', 84 | op_kwargs=dag.default_args, 85 | provide_context=True, 86 | python_callable=_task2 87 | ) 88 | 89 | t3 = PythonOperator( 90 | task_id='t3', 91 | op_kwargs=dag.default_args, 92 | provide_context=True, 93 | python_callable=_task3 94 | ) 95 | 96 | t1 >> [t2, t3] 97 | 98 | 99 | 100 | 101 | 102 | -------------------------------------------------------------------------------- /dags/steps_example_dag/step1_example_dag.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module includes the logic of the first step 3 | 4 | Date: Oct, 2022 5 | Author: Fabio Barbazza 6 | """ 7 | import logging 8 | 9 | logging.basicConfig(level=logging.WARN) 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | def run_feat_eng(): 14 | """ 15 | This method is responsible for running feature engineering 16 | """ 17 | try: 18 | 19 | 20 | logger.info('run_feat_eng starting') 21 | 22 | except Exception as err: 23 | logger.exception(err) 24 | raise err 25 | 26 | 27 | def run_feat_enrich(): 28 | """ 29 | This method is responsible for running feature enrichment 30 | """ 31 | try: 32 | 33 | 34 | logger.info('run_feat_enrich starting') 35 | 36 | except Exception as err: 37 | logger.exception(err) 38 | raise err 39 | 40 | 41 | 42 | def run_workflow_step1(): 43 | """ 44 | This method is responsible for running the sequence of methods of the step1 45 | """ 46 | try: 47 | 48 | logger.info('run_workflow_step1 starting') 49 | 50 | run_feat_eng() 51 | 52 | run_feat_enrich() 53 | 54 | logger.info('run_workflow_step1 finish') 55 | 56 | except Exception as err: 57 | logger.exception(err) 58 | raise err -------------------------------------------------------------------------------- /docker-compose.yaml: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, 12 | # software distributed under the License is distributed on an 13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14 | # KIND, either express or implied. See the License for the 15 | # specific language governing permissions and limitations 16 | # under the License. 17 | # 18 | 19 | # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL. 20 | # 21 | # WARNING: This configuration is for local development. Do not use it in a production deployment. 22 | # 23 | # This configuration supports basic configuration using environment variables or an .env file 24 | # The following variables are supported: 25 | # 26 | # AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow. 27 | # Default: apache/airflow:2.4.1 28 | # AIRFLOW_UID - User ID in Airflow containers 29 | # Default: 50000 30 | # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode 31 | # 32 | # _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested). 33 | # Default: airflow 34 | # _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested). 35 | # Default: airflow 36 | # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers. 37 | # Default: '' 38 | # 39 | # Feel free to modify this file to suit your needs. 40 | --- 41 | version: '3' 42 | x-airflow-common: 43 | &airflow-common 44 | # In order to add custom dependencies or upgrade provider packages you can use your extended image. 45 | # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml 46 | # and uncomment the "build" line below, Then run `docker-compose build` to build the images. 47 | 48 | image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.4.1} 49 | # build: . 50 | environment: 51 | &airflow-common-env 52 | AIRFLOW__CORE__EXECUTOR: CeleryExecutor 53 | AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow 54 | AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow 55 | AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0 56 | AIRFLOW__CORE__FERNET_KEY: '' 57 | AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' 58 | AIRFLOW__CORE__LOAD_EXAMPLES: 'true' 59 | _PIP_ADDITIONAL_REQUIREMENTS: 'mlflow==1.29.0' 60 | volumes: 61 | - ./dags:/opt/airflow/dags 62 | - ./logs:/opt/airflow/logs 63 | - ./plugins:/opt/airflow/plugins 64 | user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}" 65 | depends_on: 66 | redis: 67 | condition: service_healthy 68 | postgres: 69 | condition: service_healthy 70 | 71 | 72 | 73 | services: 74 | 75 | mlflow: 76 | build: 77 | dockerfile: docker_mlflow/Dockerfile 78 | ports: 79 | - 600:600 80 | 81 | postgres: 82 | image: postgres:13 83 | environment: 84 | POSTGRES_USER: airflow 85 | POSTGRES_PASSWORD: airflow 86 | POSTGRES_DB: airflow 87 | volumes: 88 | - postgres-db-volume:/var/lib/postgresql/data 89 | healthcheck: 90 | test: ["CMD", "pg_isready", "-U", "airflow"] 91 | interval: 5s 92 | retries: 5 93 | restart: always 94 | 95 | redis: 96 | image: redis:latest 97 | expose: 98 | - 6379 99 | healthcheck: 100 | test: ["CMD", "redis-cli", "ping"] 101 | interval: 5s 102 | timeout: 30s 103 | retries: 50 104 | restart: always 105 | 106 | airflow-webserver: 107 | <<: *airflow-common 108 | command: webserver 109 | ports: 110 | - 8080:8080 111 | healthcheck: 112 | test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] 113 | interval: 10s 114 | timeout: 10s 115 | retries: 5 116 | restart: always 117 | 118 | airflow-scheduler: 119 | <<: *airflow-common 120 | command: scheduler 121 | restart: always 122 | 123 | airflow-worker: 124 | <<: *airflow-common 125 | command: celery worker 126 | restart: always 127 | 128 | airflow-init: 129 | <<: *airflow-common 130 | command: version 131 | environment: 132 | <<: *airflow-common-env 133 | _AIRFLOW_DB_UPGRADE: 'true' 134 | _AIRFLOW_WWW_USER_CREATE: 'true' 135 | _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} 136 | _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} 137 | 138 | flower: 139 | <<: *airflow-common 140 | command: celery flower 141 | ports: 142 | - 5555:5555 143 | healthcheck: 144 | test: ["CMD", "curl", "--fail", "http://localhost:5555/"] 145 | interval: 10s 146 | timeout: 10s 147 | retries: 5 148 | restart: always 149 | 150 | volumes: 151 | postgres-db-volume: 152 | -------------------------------------------------------------------------------- /docker_mlflow/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.7 2 | 3 | # Create app directory 4 | WORKDIR /app 5 | 6 | # Install app dependencies 7 | COPY docker_mlflow/requirements.txt . 8 | 9 | RUN pip install -r requirements.txt 10 | 11 | # Bundle app source 12 | COPY . . 13 | 14 | EXPOSE 600 15 | 16 | CMD [ "mlflow", "server","--host","0.0.0.0","--port","600"] -------------------------------------------------------------------------------- /docker_mlflow/requirements.txt: -------------------------------------------------------------------------------- 1 | mlflow==1.29.0 -------------------------------------------------------------------------------- /docs/imgs/airflow_home.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fabioba/mlops-architecture/86b8b32aa6ca2c6c781a5af7b331314d0906cda0/docs/imgs/airflow_home.png -------------------------------------------------------------------------------- /docs/imgs/mlflow_home.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fabioba/mlops-architecture/86b8b32aa6ca2c6c781a5af7b331314d0906cda0/docs/imgs/mlflow_home.png --------------------------------------------------------------------------------