├── .gitignore
├── .secrets
├── Dockerfile
├── README.md
├── script.R
└── why-r-2022-serverless-r-in-the-cloud.Rproj


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | .env
6 | .secrets


--------------------------------------------------------------------------------
/.secrets:
--------------------------------------------------------------------------------
1 | AWS_ACCESS_KEY_ID=<PUT_YOUR_ACCESS_KEY_HERE>
2 | AWS_SECRET_ACCESS_KEY=<PUT_YOUR_SECRET_ACCESS_KEY_HERE>
3 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM rocker/tidyverse:latest
2 | 
3 | RUN R -e "install.packages(c('aws.s3', 'box', 'dotenv', 'logger'))"
4 | 
5 | COPY script.R /
6 | COPY .secrets /
7 | 
8 | CMD Rscript /script.R


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Why R? 2022: Serverless R in the Cloud
  2 | ## Deploying R into Production with AWS and Docker
  3 | ### By Ismail Tigrek ([linkedin.com/in/ismailtigrek](https://www.linkedin.com/in/ismailtigrek/))
  4 | As presented in the Why R? Turkey 2022 conference: https://www.youtube.com/watch?v=dgkm0QkWXag
  5 | 
  6 | This tutorial will walk through deploying R code, machine learning models, or Shiny applications in the cloud environment. With this knowledge, you will be able to take any local R-based project you’ve built on your machine or at your company and deploy it into production on AWS using modern serverless and microservices architectures. In order to do this, you will learn how to properly containerize R code using Docker, allowing you to create reproducible environments. You will also learn how to set up event-based and time-based triggers. We will build out a real example that reads in live data, processes it, and writes it into a data lake, all in the cloud.
  7 | 
  8 | This tutorial has been published in the Why R? 2022 Abstract Book: [link](https://whyr.pl/2022/turkey/abstract_book/konu%C5%9Fmalar.html#serverless-r-in-the-cloud---deploying-r-into-production-with-aws-and-docker)
  9 | 
 10 | ![image](https://user-images.githubusercontent.com/6436162/167397606-17b9e8ad-3eac-478f-bb72-13e4c327f4d2.png)
 11 | 
 12 | # AWS Resources
 13 | 
 14 | These are the AWS resources and their names as used in this codebase. You will need to change the names in your version of the code to match the names of your resources. You should be able to create all the resources below with the same names except for the S3 buckets.
 15 | 
 16 | #### S3 Buckets: _whyr2022input_, _whyr2022output_
 17 | 
 18 | #### ECR Repository: _whyr2022_
 19 | 
 20 | #### ECS Cluster: _ETL_
 21 | 
 22 | #### ECS Task: _whyr2022_
 23 | 
 24 | #### EventBridge Rule: _whyr2022input_upload_
 25 | 
 26 | # Setting Up
 27 | 
 28 | ## AWS
 29 | 
 30 | 1. [Create and activate an AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/)
 31 | 2. [Retrieve access keys](https://www.msp360.com/resources/blog/how-to-find-your-aws-access-key-id-and-secret-access-key/)
 32 | 3. Create S3 buckets
 33 |  - Create input bucket (_whyr2022test_)
 34 |  - Creat output bucket (_whyr2022testoutput_)
 35 |  - [Enable CloudTrail event logging for S3 buckets and objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-cloudtrail-logging-for-s3.html)
 36 | 
 37 | 4. Create ECR repository (_whyr2022_)
 38 | 5. Creaet ECS cluster (_ETL_)
 39 | 6. Create ECS task defintion (_whyr2022_)
 40 | 7. Create EventBridge rule (_whyr2022input_upload_)
 41 |  - Create event pattern
 42 | 
 43 | ```
 44 | {
 45 |   "source": ["aws.s3"],
 46 |   "detail-type": ["AWS API Call via CloudTrail"],
 47 |   "detail": {
 48 |     "eventSource": ["s3.amazonaws.com"],
 49 |     "eventName": ["PutObject"],
 50 |     "requestParameters": {
 51 |       "bucketName": ["whyr2022input"]
 52 |     }
 53 |   }
 54 | }
 55 | ```
 56 | - Set target to ECS task _whyr2022_ in ECS cluster _ETL_
 57 | 
 58 | ## Your Computer
 59 | 1. [Install and setup Docker Desktop](https://docs.docker.com/desktop/windows/install/)
 60 | 2. [Install and setup AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
 61 | 3. [Create named AWS profile called (_whyr2022_)](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)
 62 | 4. Put access keys into _.secrets_ 
 63 | 
 64 | # Deployment
 65 | 1. Authenticate Docker client to registry
 66 | ```
 67 | aws ecr get-login-password --region us-east-1 --profile whyr2022 | docker login --username AWS --password-stdin 631607388267.dkr.ecr.us-east-1.amazonaws.com
 68 | ```
 69 | 2. Build Docker image
 70 | ```
 71 | docker build -t whyr2022 .
 72 | ```
 73 | 3. Run Docker image locally to test
 74 | ```
 75 | docker run whyr2022
 76 | ```
 77 | 4. Tag Docker image
 78 | ```
 79 | docker tag whyr2022:latest 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest
 80 | ```
 81 | 5. Push Docker image to AWS ECR
 82 | ```
 83 | docker push 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest
 84 | ```
 85 | 
 86 | ## View logs
 87 | 
 88 | You can view the logs of all container runs in AWS CloudWatch under Log Groups
 89 | 
 90 | # Further steps
 91 | 
 92 | This was only meant to be a brief tutorial to fit into 30 minutes. Many crucial steps were overlooked for the sake of brevity. Ideally, you should look at doing the following:
 93 | 
 94 | ## Create an IAM user for your production code
 95 | 
 96 | We used the access keys of our root AWS account. This is not ideal for security reasons. Use your root account to create an admin user for yourself. Then lock the root credentials away and never use them again unless absolutely necessary. Then, using your new admin account, create another IAM user for your production code. Replace the access keys in your .secrets file with access keys from this account
 97 | 
 98 | ## Remove access keys from your code
 99 | 
100 | Ideally, you don't want to store your access keys inside your code or docker image. Instead, pass in security keys as environment variables when creating your ECS task definition
101 | 
102 | ## Replace Docker image with a general image
103 | 
104 | We baked our script into our Docker image. This is not ideal since every time you update your code, you will need to rebuild your Docker image and re-push to ECR. Instead, create a general Docker image that can read in the name of a script to pull from an S3 bucket (you can pass this name in as an ECS task environment variable). This way, you can have a bucket that contains your R scripts, and you will only need to build your Docker image once. Every time your container is deployed, it will pull the latest version of your script from the S3 bucket.
105 | 
106 | ## Make code more robust
107 | 
108 | Our script is running on a lot of assumptions, such as:
109 | - only one file is uploaded to _whyr2022input_ at a time
110 | - only RDS files are uploaded to _whyr2022input_
111 | - the files uploaded to _whyr2022input_ are dataframes with two numeric columns
112 | - 
113 | Production code should not run on any assumptions. Everything should be validated, and possible errors or edge cases should be gracefully handled.
114 | 
115 | Another thing that can be done to enhance the pipeline is to mark "used" input files. This can be done by appending "-used_<TIMESTAMP>" to the file name.
116 | 
117 | ## Convert all infrastructure to code
118 | 
119 | We provisioned all our resources through the AWS Console. This is not ideal since we cannot easily recreate the allocation and configuration of these resources. Ideally, you want to codify this process using an Infrastructure-as-code solution (ex: Terraform)
120 | 


--------------------------------------------------------------------------------
/script.R:
--------------------------------------------------------------------------------
 1 | #### ONE ####
 2 | logger::log_info("Starting script")
 3 | box::use(magrittr[`%>%`])
 4 | dotenv::load_dot_env(".secrets")
 5 | bucket_input <- "whyr2022input"
 6 | bucket_output <- "whyr2022output"
 7 | 
 8 | #### TWO ####
 9 | logger::log_info("Pulling latest file from {bucket_input}")
10 | latest_file <- aws.s3::get_bucket(bucket_input) %>%
11 |   tibble::as_tibble() %>%
12 |   dplyr::arrange(dplyr::desc(LastModified)) %>%
13 |   dplyr::slice(1) %>%
14 |   dplyr::pull(Key) %>%
15 |   stringr::str_replace("\\.rds", "")
16 | logger::log_info("Latest file from {bucket_input} pulled: {latest_file}.rds")
17 | 
18 | #### THREE ####
19 | logger::log_info("Creating scatter plot")
20 | s3_uri <- glue::glue("s3://{bucket_input}/{latest_file}.rds")
21 | plot <- aws.s3::s3readRDS(s3_uri) %>%
22 |   setNames(c("x", "y")) %>%
23 |   ggplot2::ggplot(ggplot2::aes(x, y)) +
24 |   ggplot2::geom_point() +
25 |   ggplot2::ggtitle(latest_file)
26 | image_file <- glue::glue("output_{latest_file}.png")
27 | 
28 | #### FOUR ####
29 | logger::log_info("Saving plot as image: {image_file}")
30 | ggplot2::ggsave(image_file, plot) %>%
31 |   suppressWarnings() %>%
32 |   suppressMessages()
33 | 
34 | #### FIVE ####
35 | logger::log_info("Pushing image to {bucket_output}")
36 | aws.s3::put_object(image_file, image_file, bucket_output) %>%
37 |   invisible()
38 | 
39 | #### END ####
40 | logger::log_success("{image_file} created from {latest_file}.rds and pushed to {bucket_output}")
41 | 


--------------------------------------------------------------------------------
/why-r-2022-serverless-r-in-the-cloud.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: No
 4 | SaveWorkspace: No
 5 | AlwaysSaveHistory: Yes
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------