├── .gitignore ├── README.md └── airflow_home ├── dags └── my_first_dag.py └── scripts ├── A_task.R ├── B_task.R ├── C_task.R ├── D_task.Rmd └── run_r.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | __pycache__ 3 | *.idea 4 | .DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Airflow-R-tutorial 2 | ## Airflow tutorial for running R scripts 3 | 4 | ### How to run 5 | 6 | Create a virtual environment (with Python and R) and activate it. In this case, I created it using anaconda: 7 | 8 | cd path/to/Airflow-R-tutorial 9 | conda create -n my_airflow_env r-essentials r-base 10 | conda activate my_airflow_env 11 | 12 | Now, install apache-airflow: 13 | 14 | sudo pip3 install apache-airflow 15 | 16 | (Note: It is possible that you'll need to execute sudo pip3 install SQLAlchemy==1.3.18to override the newer SQLAlchemy version.) 17 | 18 | It is necessary to create the AIRFLOW_HOME directory and set the environment variable AIRFLOW_HOME. To be more explicit, I called this directory airflow_home: 19 | 20 | cd path/to/Airflow-R-tutorial 21 | mkdir airflow_home 22 | export AIRFLOW_HOME=`pwd`/airflow_home 23 | 24 | Now check if everything is ok: 25 | 26 | airflow version 27 | 28 | Start the database 29 | 30 | airflow db init 31 | 32 | Create a user 33 | 34 | airflow users create \ 35 | --username admin \ 36 | --firstname FIRST_NAME \ 37 | --lastname LAST_NAME \ 38 | --role Admin \ 39 | --email admin@example.org 40 | 41 | Start the Webserver to acccess to the UI: 42 | 43 | airflow webserver 44 | 45 | In a new terminal start the Scheduler: 46 | 47 | cd path/to/Airflow-R-tutorial 48 | conda activate my_airflow_env 49 | export AIRFLOW_HOME=`pwd`/airflow_home 50 | 51 | airflow scheduler 52 | 53 | To access Airflow UI open localhost:8080. 54 | You will find your DAG there. Unpaused if it is paused. 55 | Trigger it by clicking the "play" button in the upper right corner. 56 | 57 | For more detailed instructions, visit [this post](https://lcalcagni.medium.com/running-r-scripts-in-airflow-using-airflow-bashoperators-6d827f5da5b1). -------------------------------------------------------------------------------- /airflow_home/dags/my_first_dag.py: -------------------------------------------------------------------------------- 1 | import airflow 2 | from airflow.models import DAG 3 | from airflow.operators.bash import BashOperator 4 | import os 5 | 6 | # Get current directory 7 | cwd = os.getcwd() 8 | cwd = cwd + '/airflow_home/scripts/' 9 | 10 | # Define the default arguments 11 | args = { 12 | 'owner': 'your_name', 13 | 'start_date': airflow.utils.dates.days_ago(2), 14 | } 15 | 16 | # Instantiate the DAG passing the args as default_args 17 | dag = DAG( 18 | dag_id='my_dag_id', 19 | default_args=args, 20 | schedule_interval=None 21 | ) 22 | 23 | # Define the 4 tasks: 24 | A = BashOperator( 25 | task_id='A_get_users', 26 | bash_command=f'{cwd}run_r.sh {cwd}A_task.R ', 27 | dag=dag, 28 | ) 29 | B = BashOperator( 30 | task_id='B_counts_by_gender', 31 | bash_command=f'{cwd}run_r.sh {cwd}B_task.R ', 32 | dag=dag, 33 | ) 34 | C = BashOperator( 35 | task_id='C_counts_by_age', 36 | bash_command=f'{cwd}run_r.sh {cwd}C_task.R ', 37 | dag=dag, 38 | ) 39 | command_line = 'Rscript -e "rmarkdown::render('+ "'" + f'{cwd}D_task.Rmd' + "')" + '"' 40 | D = BashOperator( 41 | task_id='D_html_report', 42 | bash_command=f'{command_line} ', 43 | dag=dag, 44 | ) 45 | 46 | # Define the task dependencies 47 | A >> B 48 | A >> C 49 | [B, C] >> D -------------------------------------------------------------------------------- /airflow_home/scripts/A_task.R: -------------------------------------------------------------------------------- 1 | library(httr) 2 | library(jsonlite) 3 | 4 | res = GET( 5 | url = "https://randomuser.me/api/", 6 | query = list( 7 | results=200, 8 | nat="ca", 9 | inc="gender,name,dob" 10 | ) 11 | ) 12 | 13 | data = fromJSON(content(res, "text")) 14 | 15 | write.csv(data$results,"users.csv", row.names = FALSE, sep=",") -------------------------------------------------------------------------------- /airflow_home/scripts/B_task.R: -------------------------------------------------------------------------------- 1 | library(ggplot2) 2 | 3 | # Load data 4 | data <- read.csv("users.csv", header=TRUE) 5 | 6 | # Barplot 7 | p <- ggplot(data, aes(x = as.factor(gender), fill = gender)) + 8 | geom_bar(stat = "count", position = "stack") + 9 | labs(x = "Gender", 10 | y = "Count") 11 | 12 | png("counts_by_gender.png") 13 | print(p) 14 | dev.off() -------------------------------------------------------------------------------- /airflow_home/scripts/C_task.R: -------------------------------------------------------------------------------- 1 | library(ggplot2) 2 | 3 | # Load data 4 | data <- read.csv("users.csv", header=TRUE) 5 | 6 | # Barplot 7 | p <- ggplot(data, aes(x = as.factor(dob.age))) + 8 | geom_bar(stat = "count", position = "stack", fill = "#FF6666") + 9 | labs(x = "Age", 10 | y = "Count") 11 | 12 | png("counts_by_age.png", width = 800, height = 400) 13 | print(p) 14 | dev.off() -------------------------------------------------------------------------------- /airflow_home/scripts/D_task.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Simple Report" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | ``` 9 | 10 | ### Counts by gender 11 | ![](counts_by_gender.png) 12 | 13 | ### Counts by Age 14 | ![](counts_by_age.png) -------------------------------------------------------------------------------- /airflow_home/scripts/run_r.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | args = commandArgs(trailingOnly=TRUE) 4 | 5 | setwd(dirname(args[1])) 6 | source(args[1]) --------------------------------------------------------------------------------