We will be taking a look at the basic concepts of data pipelines as well as practical use cases using Python.
54 |Some experience using the command line
Intermediate Python knowledge / use
Be able to apply what we learn and adopt to your use cases
Interested in data and systems
Aspring or current data engineering
Some knowledge about systems and databases (enough to be dangerous)
Greater understanding on how to apply data pipelines using the Python toolset
Focus on concepts
Apply knowledge with each library
Will give you the building blocks
You will find 🚦 across the tutorial examples. We will use this to identify how folks are doing over the workshop (if following along in person). 77 | Place the post it as follows:
78 |🚦 Purple postit: all good, task has been completed
79 |🚦 Orange postit: I need extra time or need help with the task in hand
80 |This Docker image has been used as the base for many deployments.
45 |Let’s try and get Airflow running on Docker:
46 |docker pull puckel/docker-airflow
47 |
Once you have the container you can run as
50 |docker run -d --rm -p 8080:8080 puckel/docker-airflow webserver
51 |
To load the examples you can do:
54 |docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow
55 |
Based on this container we can deploy to Azure
58 | 59 |Note that this is a very basic deployment on Azure.
60 |This tutorial was originally developed for PyCon US 2019.
49 |My name is Tania. I live in Manchester UK where I work as a 57 | Cloud Advocate for Microsoft.
58 |Over the years, I have worked as a data engineer, machine learning engineer, 59 | and research software engineer. I love data intensive 60 | enviroments and I am particularly interested in the tools and workflows to 61 | deliver robust, reproducible data insights.
62 |If you have any questions or feedback about this tutorial please, 63 | file an issue using the following link: https://github.com/trallard/airflow-tutorial/issues/new.
64 |You can also contact me via the following channels:
65 |E-mail: trallard@bitsandchips.me
Twitter: @ixek
All attendees to this workshop are expected to adhere to PyCon’s Code of Conduct, 74 | in brief: 75 | Be open, considerate, and respectful.
76 |The content in this workshop is Licensed under CC-BY-SA 4.0. 80 | Which means that you can use, remix and re-distribute so long attribution to the original 81 | author is maintained (Tania Allard).
82 |The logo used here was designed by Ashley McNamara for the Microsoft Developer Advocates team use.
83 |50 | Please activate JavaScript to enable the search 51 | functionality. 52 |
53 |55 | From here you can search these documents. Enter your search 56 | words into the box below and click "search". Note that the search 57 | function will automatically search for all of the words. Pages 58 | containing fewer words won't appear in the result list. 59 |
60 | 65 | 66 |This tutorial was originally developed for PyCon US 2019.
48 |My name is Tania. I live in Manchester UK where I work as a 56 | Cloud Advocate for Microsoft.
57 |Over the years, I have worked as a data engineer, machine learning engineer, 58 | and research software engineer. I love data intensive 59 | enviroments and I am particularly interested in the tools and workflows to 60 | deliver robust, reproducible data insights.
61 |If you have any questions or feedback about this tutorial please, 62 | file an issue using the following link: https://github.com/trallard/airflow-tutorial/issues/new.
63 |You can also contact me via the following channels:
64 |E-mail: trallard@bitsandchips.me
Twitter: @ixek
All attendees to this workshop are expected to adhere to PyCon’s Code of Conduct, 73 | in brief: 74 | Be open, considerate, and respectful.
75 |The content in this workshop is Licensed under CC-BY-SA 4.0. 79 | Which means that you can use, remix and re-distribute so long attribution to the original 80 | author is maintained (Tania Allard).
81 |The logo used here was designed by Ashley McNamara for the Microsoft Developer Advocates team use.
82 |49 | Please activate JavaScript to enable the search 50 | functionality. 51 |
52 |54 | From here you can search these documents. Enter your search 55 | words into the box below and click "search". Note that the search 56 | function will automatically search for all of the words. Pages 57 | containing fewer words won't appear in the result list. 58 |
59 | 64 | 65 |Follow @ixek 5 | 6 |
7 | -------------------------------------------------------------------------------- /source/about.md: -------------------------------------------------------------------------------- 1 | # About the workshop 2 | 3 | We will be taking a look at the basic concepts of data pipelines as well as practical use cases using Python. 4 | 5 | ## About you: 6 | - Some experience using the command line 7 | - Intermediate Python knowledge / use 8 | - Be able to apply what we learn and adopt to your use cases 9 | - Interested in data and systems 10 | - Aspring or current data engineering 11 | - Some knowledge about systems and databases (enough to be dangerous) 12 | 13 | ## Our focus for the day 14 | - Greater understanding on how to apply data pipelines using the Python toolset 15 | - Focus on concepts 16 | - Apply knowledge with each library 17 | - Will give you the building blocks 18 | 19 | ## Keeping on track 20 | 21 | You will find 🚦 across the tutorial examples. We will use this to identify how folks are doing over the workshop (if following along in person). 22 | Place the post it as follows: 23 | 24 | 🚦 Purple postit: all good, task has been completed 25 | 26 | 🚦 Orange postit: I need extra time or need help with the task in hand -------------------------------------------------------------------------------- /source/airflow-intro.md: -------------------------------------------------------------------------------- 1 | # Airflow basics 2 | 3 | ## What is Airflow? 4 | 5 |  6 | 7 | Airflow is a Workflow engine which means: 8 | 9 | - Manage scheduling and running jobs and data pipelines 10 | - Ensures jobs are ordered correctly based on dependencies 11 | - Manage the allocation of scarce resources 12 | - Provides mechanisms for tracking the state of jobs and recovering from failure 13 | 14 | It is highly versatile and can be used across many many domains: 15 |  16 | 17 | ## Basic Airflow concepts 18 | 19 | - **Task**: a defined unit of work (these are called operators in Airflow) 20 | - **Task instance**: an individual run of a single task. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc. 21 | - **DAG**: Directed acyclic graph, 22 | a set of tasks with explicit execution order, beginning, and end 23 | - **DAG run**: individual execution/run of a DAG 24 | 25 | **Debunking the DAG** 26 | 27 | The vertices and edges (the arrows linking the nodes) have an order and direction associated to them 28 | 29 |  30 | 31 | each node in a DAG corresponds to a task, which in turn represents some sort of data processing. For example: 32 | 33 | Node A could be the code for pulling data from an API, node B could be the code for anonymizing the data. Node B could be the code for checking that there are no duplicate records, and so on. 34 | 35 | These 'pipelines' are acyclic since they need a point of completion. 36 | 37 | **Dependencies** 38 | 39 | Each of the vertices has a particular direction that shows the relationship between certain nodes. For example, we can only anonymize data once this has been pulled out from the API. 40 | 41 | ## Idempotency 42 | 43 | This is one of the most important characteristics of good ETL architectures. 44 | 45 | When we say that something is idempotent it means it will produce the same result regardless of how many times this is run (i.e. the results are reproducible). 46 | 47 | Reproducibility is particularly important in data-intensive environments as this ensures that the same inputs will always return the same outputs. 48 | 49 | ## Airflow components 50 | 51 |  52 | 53 | There are 4 main components to Apache Airflow: 54 | 55 | ### Web server 56 | 57 | The GUI. This is under the hood a Flask app where you can track the status of your jobs and read logs from a remote file store (e.g. [Azure Blobstorage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview/?wt.mc_id=PyCon-github-taallard)). 58 | 59 | ### Scheduler 60 | 61 | This component is responsible for scheduling jobs. This is a multithreaded Python process that uses the DAGb object to decide what tasks need to be run, when and where. 62 | 63 | The task state is retrieved and updated from the database accordingly. The web server then uses these saved states to display job information. 64 | 65 | ### Executor 66 | 67 | The mechanism that gets the tasks done. 68 | 69 | ### Metadata database 70 | 71 | - Powers how the other components interact 72 | - Stores the Airflow states 73 | - All processes read and write from here 74 | 75 | ## Workflow as a code 76 | One of the main advantages of using a workflow system like Airflow is that all is code, which makes your workflows maintainable, versionable, testable, and collaborative. 77 | 78 | Thus your workflows become more explicit and maintainable (atomic tasks). 79 | 80 | Not only your code is dynamic but also is your infrastructure. 81 | 82 | ### Defining tasks 83 | 84 | Tasks are defined based on the abstraction of `Operators` (see Airflow docs [here](https://airflow.apache.org/concepts.html#operators)) which represent a single **idempotent task**. 85 | 86 | The best practice is to have atomic operators (i.e. can stand on their own and do not need to share resources among them). 87 | 88 | You can choose among; 89 | - `BashOperator` 90 | - `PythonOperator` 91 | - `EmailOperator` 92 | - `SimpleHttpOperator` 93 | - `MySqlOperator` (and other DB) 94 | 95 | Examples: 96 | 97 | ```python 98 | t1 = BashOperator(task_id='print_date', 99 | bash_command='date, 100 | dag=dag) 101 | ``` 102 | 103 | ```python 104 | def print_context(ds, **kwargs): 105 | pprint(kwargs) 106 | print(ds) 107 | return 'Whatever you return gets printed in the logs' 108 | 109 | 110 | run_this = PythonOperator( 111 | task_id='print_the_context', 112 | provide_context=True, 113 | python_callable=print_context, 114 | dag=dag, 115 | ) 116 | ``` 117 | 118 | ## Comparing Luigi and Airflow 119 | 120 | ### Luigi 121 | 122 | - Created at Spotify (named after the plumber) 123 | - Open sourced in late 2012 124 | - GNU make for data 125 | 126 | ### Airflow 127 | - Airbnb data team 128 | - Open-sourced mud 2015 129 | - Apache incubator mid-2016 130 | - ETL pipelines 131 | 132 | ### Similarities 133 | - Python open source projects for data pipelines 134 | - Integrate with a number of sources (databases, filesystems) 135 | - Tracking failure, retries, success 136 | - Ability to identify the dependencies and execution 137 | 138 | ### Differences 139 | - Scheduler support: Airflow has built-in support using schedulers 140 | - Scalability: Airflow has had stability issues in the past 141 | - Web interfaces 142 | 143 |  144 | 145 | 146 |  147 | 148 | 149 | | Airflow | Luigi | 150 | | ------------------------------------------------ | ------------------------------------------------------------------------------ | 151 | | Task are defined by`dag_id` defined by user name | Task are defined by task name and parameters | 152 | | Task retries based on definitions | Decide if a task is done via input/output | 153 | | Task code to the worker | Workers started by Python file where the tasks are defined | 154 | | Centralized scheduler (Celery spins up workers) | Centralized scheduler in charge of deduplication sending tasks (Tornado based) | -------------------------------------------------------------------------------- /source/azure.md: -------------------------------------------------------------------------------- 1 | ### Deploying to the cloud 2 | 3 | 4 |  5 | 6 | [This Docker image](https://hub.docker.com/r/puckel/docker-airflow/) has been used as the base for many deployments. 7 | 8 | 9 | Let's try and get Airflow running on Docker: 10 | 11 | ``` 12 | docker pull puckel/docker-airflow 13 | ``` 14 | 15 | Once you have the container you can run as 16 | 17 | ``` 18 | docker run -d --rm -p 8080:8080 puckel/docker-airflow webserver 19 | ``` 20 | 21 | To load the examples you can do: 22 | ``` 23 | docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow 24 | ``` 25 | 26 | Based on this container we can deploy to [Azure](https://azure.microsoft.com/en-us/blog/deploying-apache-airflow-in-azure-to-build-and-run-data-pipelines//?wt.mc_id=PyCon-github-taallard) 27 | 28 | 29 | [](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fsavjani%2Fazure-quickstart-templates%2Fmaster%2F101-webapp-linux-airflow-postgresql%2Fazuredeploy.json/?wt.mc_id=PyCon-github-taallard) 30 | 31 | 32 | Note that this is a very basic deployment on Azure. -------------------------------------------------------------------------------- /source/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Configuration file for the Sphinx documentation builder. 4 | # 5 | # This file does only contain a selection of the most common options. For a 6 | # full list see the documentation: 7 | # http://www.sphinx-doc.org/en/master/config 8 | 9 | # -- Path setup -------------------------------------------------------------- 10 | 11 | # If extensions (or modules to document with autodoc) are in another directory, 12 | # add these directories to sys.path here. If the directory is relative to the 13 | # documentation root, use os.path.abspath to make it absolute, like shown here. 14 | # 15 | # import os 16 | # import sys 17 | # sys.path.insert(0, os.path.abspath('.')) 18 | 19 | 20 | # -- Project information ----------------------------------------------------- 21 | 22 | project = "Airflow tutorial" 23 | copyright = "2019, Tania Allard" 24 | author = "Tania Allard" 25 | 26 | # The short X.Y version 27 | version = "" 28 | # The full version, including alpha/beta/rc tags 29 | release = "" 30 | 31 | 32 | # -- General configuration --------------------------------------------------- 33 | 34 | # If your documentation needs a minimal Sphinx version, state it here. 35 | # 36 | # needs_sphinx = '1.0' 37 | 38 | # Add any Sphinx extension module names here, as strings. They can be 39 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 40 | # ones. 41 | extensions = [ 42 | "sphinx.ext.doctest", 43 | "sphinx.ext.intersphinx", 44 | "sphinx.ext.mathjax", 45 | "sphinx.ext.githubpages", 46 | "recommonmark", 47 | ] 48 | 49 | # Add any paths that contain templates here, relative to this directory. 50 | templates_path = ["_templates"] 51 | 52 | # The suffix(es) of source filenames. 53 | # You can specify multiple suffix as a list of string: 54 | # 55 | source_suffix = [".rst", ".md"] 56 | 57 | # The master toctree document. 58 | master_doc = "index" 59 | 60 | # The language for content autogenerated by Sphinx. Refer to documentation 61 | # for a list of supported languages. 62 | # 63 | # This is also used if you do content translation via gettext catalogs. 64 | # Usually you set "language" from the command line for these cases. 65 | language = None 66 | 67 | # List of patterns, relative to source directory, that match files and 68 | # directories to ignore when looking for source files. 69 | # This pattern also affects html_static_path and html_extra_path. 70 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] 71 | 72 | # The name of the Pygments (syntax highlighting) style to use. 73 | pygments_style = "monokai" 74 | 75 | 76 | # -- Options for HTML output ------------------------------------------------- 77 | 78 | # The theme to use for HTML and HTML Help pages. See the documentation for 79 | # a list of builtin themes. 80 | # 81 | html_theme = "alabaster" 82 | 83 | # Theme options are theme-specific and customize the look and feel of a theme 84 | # further. For a list of options available for each theme, see the 85 | # documentation. 86 | # 87 | html_theme_options = { 88 | "github_banner": False, 89 | "github_button": True, 90 | "github_user": "trallard", 91 | "github_repo": "airflow-tutorial", 92 | "github_type": "star", 93 | "font_family": "Nunito, Georgia, sans", 94 | "head_font_family": "Nunito, Georgia, serif", 95 | "code_font_family": "'Source Code Pro', 'Consolas', monospace", 96 | "description": "a.k.a an introduction to all things DAGS and pipelines joy", 97 | "show_relbars": True, 98 | "logo": "python.png", 99 | } 100 | 101 | # Add any paths that contain custom static files (such as style sheets) here, 102 | # relative to this directory. They are copied after the builtin static files, 103 | # so a file named "default.css" will overwrite the builtin "default.css". 104 | html_static_path = ["_static"] 105 | 106 | # Custom sidebar templates, must be a dictionary that maps document names 107 | # to template names. 108 | # 109 | # The default sidebars (for documents that don't match any pattern) are 110 | # defined by theme itself. Builtin themes are using these templates by 111 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html', 112 | # 'searchbox.html']``. 113 | # 114 | # Custom sidebar templates, maps document names to template names. 115 | html_sidebars = { 116 | "**": [ 117 | "about.html", 118 | "localtoc.html", 119 | "searchbox.html", 120 | "navigation.html", 121 | "relations.html", 122 | "sidebarlogo.html", 123 | ] 124 | } 125 | 126 | # -- Options for HTMLHelp output --------------------------------------------- 127 | 128 | # Output file base name for HTML help builder. 129 | htmlhelp_basename = "Airflowtutorialdoc" 130 | 131 | 132 | # -- Options for LaTeX output ------------------------------------------------ 133 | 134 | latex_elements = { 135 | # The paper size ('letterpaper' or 'a4paper'). 136 | # 137 | # 'papersize': 'letterpaper', 138 | # The font size ('10pt', '11pt' or '12pt'). 139 | # 140 | # 'pointsize': '10pt', 141 | # Additional stuff for the LaTeX preamble. 142 | # 143 | # 'preamble': '', 144 | # Latex figure (float) alignment 145 | # 146 | # 'figure_align': 'htbp', 147 | } 148 | 149 | # Grouping the document tree into LaTeX files. List of tuples 150 | # (source start file, target name, title, 151 | # author, documentclass [howto, manual, or own class]). 152 | latex_documents = [ 153 | ( 154 | master_doc, 155 | "Airflowtutorial.tex", 156 | "Airflow tutorial Documentation", 157 | "Tania Allard", 158 | "manual", 159 | ) 160 | ] 161 | 162 | 163 | # -- Options for manual page output ------------------------------------------ 164 | 165 | # One entry per manual page. List of tuples 166 | # (source start file, name, description, authors, manual section). 167 | man_pages = [ 168 | (master_doc, "airflowtutorial", "Airflow tutorial Documentation", [author], 1) 169 | ] 170 | 171 | 172 | # -- Options for Texinfo output ---------------------------------------------- 173 | 174 | # Grouping the document tree into Texinfo files. List of tuples 175 | # (source start file, target name, title, author, 176 | # dir menu entry, description, category) 177 | texinfo_documents = [ 178 | ( 179 | master_doc, 180 | "Airflowtutorial", 181 | "Airflow tutorial Documentation", 182 | author, 183 | "Airflowtutorial", 184 | "One line description of project.", 185 | "Miscellaneous", 186 | ) 187 | ] 188 | 189 | 190 | # -- Options for Epub output ------------------------------------------------- 191 | 192 | # Bibliographic Dublin Core info. 193 | epub_title = project 194 | 195 | # The unique identifier of the text. This can be a ISBN number 196 | # or the project homepage. 197 | # 198 | # epub_identifier = '' 199 | 200 | # A unique identification for the text. 201 | # 202 | # epub_uid = '' 203 | 204 | # A list of files that should not be packed into the epub file. 205 | epub_exclude_files = ["search.html"] 206 | 207 | 208 | # -- Extension configuration ------------------------------------------------- 209 | 210 | # -- Options for intersphinx extension --------------------------------------- 211 | 212 | # Example configuration for intersphinx: refer to the Python standard library. 213 | intersphinx_mapping = {"https://docs.python.org/": None} 214 | 215 | -------------------------------------------------------------------------------- /source/first-airflow.md: -------------------------------------------------------------------------------- 1 | # Airflow 101: working locally and familiarise with the tool 2 | 3 | ### Pre-requisites 4 | 5 | The following prerequisites are needed: 6 | 7 | - Libraries detailed in the Setting up section (either via conda or pipenv) 8 | - MySQL installed 9 | - text editor 10 | - command line 11 | 12 | ## Getting your environment up and running 13 | 14 | If you followed the instructions you should have Airflow installed as well as the rest of the packages we will be using. 15 | 16 | So let's get our environment up and running: 17 | 18 | If you are using conda start your environment via: 19 | ``` 20 | $ source activate airflow-env 21 | ``` 22 | If using pipenv then: 23 | ``` 24 | $ pipenv shell 25 | ```` 26 | 27 | this will start a shell within a virtual environment, to exit the shell you need to type `exit` and this will exit the virtual environment. 28 | 29 | ## Starting Airflow locally 30 | 31 | Airflow home lives in `~/airflow` by default, but you can change the location before installing airflow. You first need to set the `AIRFLOW_HOME` environment variable and then install airflow. For example, using pip: 32 | 33 | ```sh 34 | export AIRFLOW_HOME=~/mydir/airflow 35 | 36 | # install from PyPI using pip 37 | pip install apache-airflow 38 | ``` 39 | 40 | once you have completed the installation you should see something like this in the `airflow` directory (wherever it lives for you) 41 | 42 | ``` 43 | drwxr-xr-x - myuser 18 Apr 14:02 . 44 | .rw-r--r-- 26k myuser 18 Apr 14:02 ├── airflow.cfg 45 | drwxr-xr-x - myuser 18 Apr 14:02 ├── logs 46 | drwxr-xr-x - myuser 18 Apr 14:02 │ └── scheduler 47 | drwxr-xr-x - myuser 18 Apr 14:02 │ ├── 2019-04-18 48 | lrwxr-xr-x 46 myuser 18 Apr 14:02 │ └── latest -> /Users/myuser/airflow/logs/scheduler/2019-04-18 49 | .rw-r--r-- 2.5k myuser 18 Apr 14:02 └── unittests.cfg 50 | ``` 51 | We need to create a local dag folder: 52 | 53 | ``` 54 | mkdir ~/airflow/dags 55 | ``` 56 | 57 | As your project evolves, your directory will look something like this: 58 | 59 | ``` 60 | airflow # the root directory. 61 | ├── dags # root folder for all dags. files inside folders are not searched for dags. 62 | │ ├── my_dag.py, # my dag (definitions of tasks/operators) including precedence. 63 | │ └── ... 64 | ├── logs # logs for the various tasks that are run 65 | │ └── my_dag # DAG specific logs 66 | │ │ ├── src1_s3 # folder for task-specific logs (log files are created by date of a run) 67 | │ │ ├── src2_hdfs 68 | │ │ ├── src3_s3 69 | │ │ └── spark_task_etl 70 | ├── airflow.db # SQLite database used by Airflow internally to track the status of each DAG. 71 | ├── airflow.cfg # global configuration for Airflow (this can be overridden by config inside the file.) 72 | └── ... 73 | ``` 74 | 75 | ## Prepare your database 76 | 77 | As we mentioned before Airflow uses a database to keep track of the tasks and their statuses. So it is critical to have one set up. 78 | 79 | To start the default database we can run 80 | ` airflow initdb`. This will initialize your database via alembic so that it matches the latest Airflow release. 81 | 82 | The default database used is `sqlite` which means you cannot parallelize tasks using this database. Since we have MySQL and MySQL client installed we will set them up so that we can use them with airflow. 83 | 84 | 🚦Create an airflow database 85 | 86 | From the command line: 87 | 88 | ``` 89 | MySQL -u root -p 90 | mysql> CREATE DATABASE airflow CHARACTER SET utf8 COLLATE utf8_unicode_ci; 91 | mysql> GRANT ALL PRIVILEGES ON airflow.* To 'airflow'@'localhost'; 92 | mysql> FLUSH PRIVILEGES; 93 | ``` 94 | and initialize the database: 95 | 96 | ``` 97 | airflow initdb 98 | ``` 99 | 100 | Notice that this will fail with the default `airflow.cfg` 101 | 102 | 103 | ## Update your local configuration 104 | 105 | Open your airflow configuration file `~/airflow/airflow.cf` and make the following changes: 106 | 107 | 108 | ``` 109 | executor = CeleryExecutor 110 | ``` 111 | 112 | ``` 113 | # http://docs.celeryproject.org/en/latest/userguide/configuration.html#broker-settings 114 | # needs rabbitmq running 115 | broker_url = amqp://guest:guest@127.0.0.1/ 116 | 117 | 118 | # http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-result-backend-settings 119 | result_backend = db+mysql://airflow:airflow@localhost:3306/airflow 120 | 121 | sql_alchemy_conn = mysql://airflow:python2019@localhost:3306/airflow 122 | 123 | ``` 124 | 125 | Here we are replacing the default executor (`SequentialExecutor`) with the `CeleryExecutor` so that we can run multiple DAGs in parallel. 126 | We also replace the default `sqlite` database with our newly created `airflow` database. 127 | 128 | Now we can initialize the database: 129 | ``` 130 | airflow initdb 131 | ``` 132 | 133 | Let's now start the web server locally: 134 | 135 | 136 | ``` 137 | airflow webserver -p 8080 138 | ``` 139 | 140 | we can head over to [http://localhost:8080](http://localhost:8080) now and you will see that there are a number of examples DAGS already there. 141 | 142 | 🚦 Take some time to familiarise with the UI and get your local instance set up 143 | 144 | Now let's have a look at the connections ([http://localhost:8080/admin/connection/](http://localhost:8080/admin/connection/)) go to `admin > connections`. You should be able to see a number of connections available. For this tutorial, we will use some of the connections including `mysql`. 145 | 146 | 152 | 153 | ### Commands 154 | Let us go over some of the commands. Back on your command line: 155 | 156 | ``` 157 | airflow list_dags 158 | ``` 159 | we can list the DAG tasks in a tree view 160 | 161 | ``` 162 | airflow list_tasks tutorial --tree 163 | ``` 164 | 165 | we can tests the dags too, but we will need to set a date parameter so that this executes: 166 | 167 | ``` 168 | airflow test tutorial print_date 2019-05-01 169 | ``` 170 | (note that you cannot use a future date or you will get an error) 171 | ``` 172 | airflow test tutorial templated 2019-05-01 173 | ``` 174 | By using the test commands these are not saved in the database. 175 | 176 | Now let's start the scheduler: 177 | ``` 178 | airflow scheduler 179 | ``` 180 | 181 | Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it contains. The Airflow scheduler is designed to run as a service in an Airflow production environment. 182 | 183 | Now with the schedule up and running we can trigger an instance: 184 | ``` 185 | $ airflow run airflow run example_bash_operator runme_0 2015-01-01 186 | ``` 187 | 188 | This will be stored in the database and you can see the change of the status change straight away. 189 | 190 | What would happen for example if we wanted to run or trigger the `tutorial` task? 🤔 191 | 192 | Let's try from the CLI and see what happens. 193 | 194 | ``` 195 | airflow trigger_dag tutorial 196 | ``` 197 | 198 | 199 | ## Writing your first DAG 200 | 201 | Let's create our first simple DAG. 202 | Inside the dag directory (`~/airflow/dags)` create a `simple_dag.py` file. 203 | 204 | 205 | ```python 206 | from datetime import datetime, timedelta 207 | from airflow import DAG 208 | from airflow.operators.dummy_operator import DummyOperator 209 | from airflow.operators.python_operator import PythonOperator 210 | 211 | 212 | def print_hello(): 213 | return "Hello world!" 214 | 215 | 216 | default_args = { 217 | "owner": "airflow", 218 | "depends_on_past": False, 219 | "start_date": datetime(2019, 4, 30), 220 | "email": ["airflow@example.com"], 221 | "email_on_failure": False, 222 | "email_on_retry": False, 223 | "retries": 1, 224 | "retry_delay": timedelta(minutes=2), 225 | } 226 | 227 | dag = DAG( 228 | "hello_world", 229 | description="Simple tutorial DAG", 230 | schedule_interval="0 12 * * *", 231 | default_args=default_args, 232 | catchup=False, 233 | ) 234 | 235 | t1 = DummyOperator(task_id="dummy_task", retries=3, dag=dag) 236 | 237 | t2 = PythonOperator(task_id="hello_task", python_callable=print_hello, dag=dag) 238 | 239 | # sets downstream foe t1 240 | t1 >> t2 241 | 242 | # equivalent 243 | # t2.set_upstream(t1) 244 | 245 | ``` 246 | If it is properly setup you should be able to see this straight away on your instance. 247 | 248 | 249 | ### Now let's create a DAG from the previous ETL pipeline (kind of) 250 | 251 | All hands on - check the solutions -------------------------------------------------------------------------------- /source/index.rst: -------------------------------------------------------------------------------- 1 | .. Airflow tutorial documentation master file, created by 2 | sphinx-quickstart on Mon Apr 15 15:52:00 2019. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Airflow tutorial 7 | ============================================ 8 | This tutorial was originally developed for PyCon US 2019. 9 | 10 | .. toctree:: 11 | :caption: Table of Contents 12 | :hidden: 13 | :maxdepth: 2 14 | 15 | setup 16 | about 17 | pipelines 18 | airflow-intro 19 | first-airflow 20 | 21 | .. toctree:: 22 | :maxdepth: 2 23 | :caption: Contents: 24 | 25 | About your facilitator 26 | ====================== 27 | 28 | My name is Tania. I live in Manchester UK where I work as a 29 | Cloud Advocate for Microsoft. 30 | 31 | Over the years, I have worked as a data engineer, machine learning engineer, 32 | and research software engineer. I love data intensive 33 | enviroments and I am particularly interested in the tools and workflows to 34 | deliver robust, reproducible data insights. 35 | 36 | If you have any questions or feedback about this tutorial please, 37 | file an issue using the following link: `