├── .gitignore
├── README.md
├── dags
├── example_dag.py
└── steps_example_dag
│ └── step1_example_dag.py
├── docker-compose.yaml
├── docker_mlflow
├── Dockerfile
└── requirements.txt
└── docs
└── imgs
├── airflow_home.png
└── mlflow_home.png
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | .python-version
86 |
87 | # pipenv
88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
91 | # install all needed dependencies.
92 | #Pipfile.lock
93 |
94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95 | __pypackages__/
96 |
97 | # Celery stuff
98 | celerybeat-schedule
99 | celerybeat.pid
100 |
101 | # SageMath parsed files
102 | *.sage.py
103 |
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 |
131 |
132 | .DS_Store
133 |
134 | .env
135 |
136 | logs/
137 |
138 | plugins/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # AIRFLOW_MLFLOW_DOCKER
2 | 
3 |
4 |
5 | ## Table of content
6 | - [Background](#background)
7 | - [Tools Overview](#tools_overview)
8 | - [Getting started](#getting_started)
9 | * [Docker Compose configuration](#docker_config)
10 | * [Airflow](#airflow)
11 | * [MLflow](#mlflow)
12 | - [Connect Airflow to MLflow](#airflow_and_mlflow)
13 | - [References](#references)
14 |
15 |
16 |
17 |
18 | ## Background
19 | The goal of this project is to create an ecosystem where to run **Data Pipelines** and monitor **Machine Learning Experiments**.
20 |
21 |
22 |
23 | ## Tools Overview
24 | From `Airflow` documentation:
25 | ```
26 | Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows
27 | ```
28 |
29 | From `MLflow` documentation:
30 | ```
31 | MLflow is an open source platform for managing the end-to-end machine learning lifecycle
32 | ```
33 |
34 | From `Docker` documentation:
35 | ```
36 | Docker Compose is a tool for defining and running multi-container Docker applications.
37 | ```
38 |
39 |
40 |
41 | ## Getting Started
42 | The first step to structure this project is connecting `Airflow` and `MLflow` together: `docker compose`.
43 |
44 |
45 |
46 |
47 | ### Docker Compose Configuration
48 | Create `docker-compose.yaml`, which contains the configuration of those docker containers responsible for running `Airflow` and `MLflow` services.
49 | Each of those services runs on a different container:
50 | * airflow-webserver
51 | * airflow-scheduler
52 | * airflow-worker
53 | * airflow-triggerer
54 | * mlflow
55 |
56 | To create and start multiple container, from terminal run the following command:
57 | ```
58 | docker compose up -d
59 | ```
60 |
61 |
62 |
63 | ### Airflow
64 | In order to access to `Airflow server` visit the page: `localhost:8080`
65 |
66 | 
67 |
68 | And take a step into `Airflow` world!
69 |
70 | To start creating DAGS initialize an empty folder named `dags` and populate it with as many scripts as you need.
71 | ```bash
72 | └── dags
73 | └── example_dag.py
74 | ```
75 |
76 |
77 |
78 | ### MLFlow
79 | In order to monitor `MLflow experiments` through its server, visit the page: `localhost:600`
80 |
81 | 
82 |
83 |
84 |
85 | ## Connect Airflow to MLflow
86 | To establish a connection between `Airflow` and `MLflow`, define the URI of the `MLflow server`:
87 | ```
88 | mlflow.set_tracking_uri('http://mlflow:600')
89 | ```
90 |
91 | After that, create a new connection on `Airflow` that points to that port.
92 |
93 |
94 |
95 |
96 |
97 | ## References
98 | * [Airflow Docker](https://airflow.apache.org/docs/apache-airflow/2.0.1/start/docker.html)
99 | * [What is Airflow?](https://airflow.apache.org/docs/apache-airflow/stable/index.html)
100 | * [MLflow](https://mlflow.org/docs/latest/index.html)
101 |
--------------------------------------------------------------------------------
/dags/example_dag.py:
--------------------------------------------------------------------------------
1 | """
2 | This DAG is responsible for running the sequence of steps from steps_example_dag.
3 |
4 |
5 | Author: Fabio Barbazza
6 | Date: Oct, 2022
7 | """
8 | from airflow import DAG
9 | from airflow.operators.python import PythonOperator
10 | from datetime import datetime
11 | import logging
12 | import mlflow
13 | from numpy import random
14 |
15 | from steps_example_dag import step1_example_dag
16 |
17 | logging.basicConfig(level=logging.WARN)
18 | logger = logging.getLogger(__name__)
19 |
20 |
21 |
22 | mlflow.set_tracking_uri('http://mlflow:600')
23 |
24 | experiment = mlflow.set_experiment("Airflow_Example")
25 |
26 |
27 | def _task1():
28 | """
29 | This method is responsible for running _task1 logic
30 | """
31 | try:
32 |
33 | logger.info('taks1')
34 |
35 | step1_example_dag.run_workflow_step1()
36 |
37 | except Exception as err:
38 | logger.exception(err)
39 | raise err
40 |
41 | def _task2():
42 | """
43 | This method is responsible for running _task2 logic
44 | """
45 | try:
46 |
47 | logger.info('taks2')
48 |
49 | except Exception as err:
50 | logger.exception(err)
51 | raise err
52 |
53 | def _task3():
54 | """
55 | This method is responsible for running _task3 logic
56 | """
57 | try:
58 |
59 | logger.info('taks3')
60 |
61 | except Exception as err:
62 | logger.exception(err)
63 | raise err
64 |
65 | with DAG(dag_id='dag_example', start_date=datetime(2022, 1, 1),
66 | schedule_interval='@daily', catchup=False) as dag:
67 |
68 | with mlflow.start_run():
69 |
70 | id_run = random.rand()
71 |
72 | mlflow.log_param("run_id_manual",id_run)
73 |
74 | t1 = PythonOperator(
75 | task_id='t1',
76 | op_kwargs=dag.default_args,
77 | provide_context=True,
78 | python_callable=_task1
79 | )
80 |
81 |
82 | t2 = PythonOperator(
83 | task_id='t2',
84 | op_kwargs=dag.default_args,
85 | provide_context=True,
86 | python_callable=_task2
87 | )
88 |
89 | t3 = PythonOperator(
90 | task_id='t3',
91 | op_kwargs=dag.default_args,
92 | provide_context=True,
93 | python_callable=_task3
94 | )
95 |
96 | t1 >> [t2, t3]
97 |
98 |
99 |
100 |
101 |
102 |
--------------------------------------------------------------------------------
/dags/steps_example_dag/step1_example_dag.py:
--------------------------------------------------------------------------------
1 | """
2 | This module includes the logic of the first step
3 |
4 | Date: Oct, 2022
5 | Author: Fabio Barbazza
6 | """
7 | import logging
8 |
9 | logging.basicConfig(level=logging.WARN)
10 | logger = logging.getLogger(__name__)
11 |
12 |
13 | def run_feat_eng():
14 | """
15 | This method is responsible for running feature engineering
16 | """
17 | try:
18 |
19 |
20 | logger.info('run_feat_eng starting')
21 |
22 | except Exception as err:
23 | logger.exception(err)
24 | raise err
25 |
26 |
27 | def run_feat_enrich():
28 | """
29 | This method is responsible for running feature enrichment
30 | """
31 | try:
32 |
33 |
34 | logger.info('run_feat_enrich starting')
35 |
36 | except Exception as err:
37 | logger.exception(err)
38 | raise err
39 |
40 |
41 |
42 | def run_workflow_step1():
43 | """
44 | This method is responsible for running the sequence of methods of the step1
45 | """
46 | try:
47 |
48 | logger.info('run_workflow_step1 starting')
49 |
50 | run_feat_eng()
51 |
52 | run_feat_enrich()
53 |
54 | logger.info('run_workflow_step1 finish')
55 |
56 | except Exception as err:
57 | logger.exception(err)
58 | raise err
--------------------------------------------------------------------------------
/docker-compose.yaml:
--------------------------------------------------------------------------------
1 | # Licensed to the Apache Software Foundation (ASF) under one
2 | # or more contributor license agreements. See the NOTICE file
3 | # distributed with this work for additional information
4 | # regarding copyright ownership. The ASF licenses this file
5 | # to you under the Apache License, Version 2.0 (the
6 | # "License"); you may not use this file except in compliance
7 | # with the License. You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing,
12 | # software distributed under the License is distributed on an
13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 | # KIND, either express or implied. See the License for the
15 | # specific language governing permissions and limitations
16 | # under the License.
17 | #
18 |
19 | # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
20 | #
21 | # WARNING: This configuration is for local development. Do not use it in a production deployment.
22 | #
23 | # This configuration supports basic configuration using environment variables or an .env file
24 | # The following variables are supported:
25 | #
26 | # AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow.
27 | # Default: apache/airflow:2.4.1
28 | # AIRFLOW_UID - User ID in Airflow containers
29 | # Default: 50000
30 | # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
31 | #
32 | # _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested).
33 | # Default: airflow
34 | # _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested).
35 | # Default: airflow
36 | # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
37 | # Default: ''
38 | #
39 | # Feel free to modify this file to suit your needs.
40 | ---
41 | version: '3'
42 | x-airflow-common:
43 | &airflow-common
44 | # In order to add custom dependencies or upgrade provider packages you can use your extended image.
45 | # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
46 | # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
47 |
48 | image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.4.1}
49 | # build: .
50 | environment:
51 | &airflow-common-env
52 | AIRFLOW__CORE__EXECUTOR: CeleryExecutor
53 | AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
54 | AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
55 | AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
56 | AIRFLOW__CORE__FERNET_KEY: ''
57 | AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
58 | AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
59 | _PIP_ADDITIONAL_REQUIREMENTS: 'mlflow==1.29.0'
60 | volumes:
61 | - ./dags:/opt/airflow/dags
62 | - ./logs:/opt/airflow/logs
63 | - ./plugins:/opt/airflow/plugins
64 | user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
65 | depends_on:
66 | redis:
67 | condition: service_healthy
68 | postgres:
69 | condition: service_healthy
70 |
71 |
72 |
73 | services:
74 |
75 | mlflow:
76 | build:
77 | dockerfile: docker_mlflow/Dockerfile
78 | ports:
79 | - 600:600
80 |
81 | postgres:
82 | image: postgres:13
83 | environment:
84 | POSTGRES_USER: airflow
85 | POSTGRES_PASSWORD: airflow
86 | POSTGRES_DB: airflow
87 | volumes:
88 | - postgres-db-volume:/var/lib/postgresql/data
89 | healthcheck:
90 | test: ["CMD", "pg_isready", "-U", "airflow"]
91 | interval: 5s
92 | retries: 5
93 | restart: always
94 |
95 | redis:
96 | image: redis:latest
97 | expose:
98 | - 6379
99 | healthcheck:
100 | test: ["CMD", "redis-cli", "ping"]
101 | interval: 5s
102 | timeout: 30s
103 | retries: 50
104 | restart: always
105 |
106 | airflow-webserver:
107 | <<: *airflow-common
108 | command: webserver
109 | ports:
110 | - 8080:8080
111 | healthcheck:
112 | test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
113 | interval: 10s
114 | timeout: 10s
115 | retries: 5
116 | restart: always
117 |
118 | airflow-scheduler:
119 | <<: *airflow-common
120 | command: scheduler
121 | restart: always
122 |
123 | airflow-worker:
124 | <<: *airflow-common
125 | command: celery worker
126 | restart: always
127 |
128 | airflow-init:
129 | <<: *airflow-common
130 | command: version
131 | environment:
132 | <<: *airflow-common-env
133 | _AIRFLOW_DB_UPGRADE: 'true'
134 | _AIRFLOW_WWW_USER_CREATE: 'true'
135 | _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
136 | _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
137 |
138 | flower:
139 | <<: *airflow-common
140 | command: celery flower
141 | ports:
142 | - 5555:5555
143 | healthcheck:
144 | test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
145 | interval: 10s
146 | timeout: 10s
147 | retries: 5
148 | restart: always
149 |
150 | volumes:
151 | postgres-db-volume:
152 |
--------------------------------------------------------------------------------
/docker_mlflow/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.7
2 |
3 | # Create app directory
4 | WORKDIR /app
5 |
6 | # Install app dependencies
7 | COPY docker_mlflow/requirements.txt .
8 |
9 | RUN pip install -r requirements.txt
10 |
11 | # Bundle app source
12 | COPY . .
13 |
14 | EXPOSE 600
15 |
16 | CMD [ "mlflow", "server","--host","0.0.0.0","--port","600"]
--------------------------------------------------------------------------------
/docker_mlflow/requirements.txt:
--------------------------------------------------------------------------------
1 | mlflow==1.29.0
--------------------------------------------------------------------------------
/docs/imgs/airflow_home.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabioba/mlops-architecture/86b8b32aa6ca2c6c781a5af7b331314d0906cda0/docs/imgs/airflow_home.png
--------------------------------------------------------------------------------
/docs/imgs/mlflow_home.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabioba/mlops-architecture/86b8b32aa6ca2c6c781a5af7b331314d0906cda0/docs/imgs/mlflow_home.png
--------------------------------------------------------------------------------