├── .gitignore ├── .secrets ├── Dockerfile ├── README.md ├── script.R └── why-r-2022-serverless-r-in-the-cloud.Rproj /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .env 6 | .secrets -------------------------------------------------------------------------------- /.secrets: -------------------------------------------------------------------------------- 1 | AWS_ACCESS_KEY_ID= 2 | AWS_SECRET_ACCESS_KEY= 3 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM rocker/tidyverse:latest 2 | 3 | RUN R -e "install.packages(c('aws.s3', 'box', 'dotenv', 'logger'))" 4 | 5 | COPY script.R / 6 | COPY .secrets / 7 | 8 | CMD Rscript /script.R -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Why R? 2022: Serverless R in the Cloud 2 | ## Deploying R into Production with AWS and Docker 3 | ### By Ismail Tigrek ([linkedin.com/in/ismailtigrek](https://www.linkedin.com/in/ismailtigrek/)) 4 | As presented in the Why R? Turkey 2022 conference: https://www.youtube.com/watch?v=dgkm0QkWXag 5 | 6 | This tutorial will walk through deploying R code, machine learning models, or Shiny applications in the cloud environment. With this knowledge, you will be able to take any local R-based project you’ve built on your machine or at your company and deploy it into production on AWS using modern serverless and microservices architectures. In order to do this, you will learn how to properly containerize R code using Docker, allowing you to create reproducible environments. You will also learn how to set up event-based and time-based triggers. We will build out a real example that reads in live data, processes it, and writes it into a data lake, all in the cloud. 7 | 8 | This tutorial has been published in the Why R? 2022 Abstract Book: [link](https://whyr.pl/2022/turkey/abstract_book/konu%C5%9Fmalar.html#serverless-r-in-the-cloud---deploying-r-into-production-with-aws-and-docker) 9 | 10 | ![image](https://user-images.githubusercontent.com/6436162/167397606-17b9e8ad-3eac-478f-bb72-13e4c327f4d2.png) 11 | 12 | # AWS Resources 13 | 14 | These are the AWS resources and their names as used in this codebase. You will need to change the names in your version of the code to match the names of your resources. You should be able to create all the resources below with the same names except for the S3 buckets. 15 | 16 | #### S3 Buckets: _whyr2022input_, _whyr2022output_ 17 | 18 | #### ECR Repository: _whyr2022_ 19 | 20 | #### ECS Cluster: _ETL_ 21 | 22 | #### ECS Task: _whyr2022_ 23 | 24 | #### EventBridge Rule: _whyr2022input_upload_ 25 | 26 | # Setting Up 27 | 28 | ## AWS 29 | 30 | 1. [Create and activate an AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/) 31 | 2. [Retrieve access keys](https://www.msp360.com/resources/blog/how-to-find-your-aws-access-key-id-and-secret-access-key/) 32 | 3. Create S3 buckets 33 | - Create input bucket (_whyr2022test_) 34 | - Creat output bucket (_whyr2022testoutput_) 35 | - [Enable CloudTrail event logging for S3 buckets and objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-cloudtrail-logging-for-s3.html) 36 | 37 | 4. Create ECR repository (_whyr2022_) 38 | 5. Creaet ECS cluster (_ETL_) 39 | 6. Create ECS task defintion (_whyr2022_) 40 | 7. Create EventBridge rule (_whyr2022input_upload_) 41 | - Create event pattern 42 | 43 | ``` 44 | { 45 | "source": ["aws.s3"], 46 | "detail-type": ["AWS API Call via CloudTrail"], 47 | "detail": { 48 | "eventSource": ["s3.amazonaws.com"], 49 | "eventName": ["PutObject"], 50 | "requestParameters": { 51 | "bucketName": ["whyr2022input"] 52 | } 53 | } 54 | } 55 | ``` 56 | - Set target to ECS task _whyr2022_ in ECS cluster _ETL_ 57 | 58 | ## Your Computer 59 | 1. [Install and setup Docker Desktop](https://docs.docker.com/desktop/windows/install/) 60 | 2. [Install and setup AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) 61 | 3. [Create named AWS profile called (_whyr2022_)](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html) 62 | 4. Put access keys into _.secrets_ 63 | 64 | # Deployment 65 | 1. Authenticate Docker client to registry 66 | ``` 67 | aws ecr get-login-password --region us-east-1 --profile whyr2022 | docker login --username AWS --password-stdin 631607388267.dkr.ecr.us-east-1.amazonaws.com 68 | ``` 69 | 2. Build Docker image 70 | ``` 71 | docker build -t whyr2022 . 72 | ``` 73 | 3. Run Docker image locally to test 74 | ``` 75 | docker run whyr2022 76 | ``` 77 | 4. Tag Docker image 78 | ``` 79 | docker tag whyr2022:latest 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest 80 | ``` 81 | 5. Push Docker image to AWS ECR 82 | ``` 83 | docker push 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest 84 | ``` 85 | 86 | ## View logs 87 | 88 | You can view the logs of all container runs in AWS CloudWatch under Log Groups 89 | 90 | # Further steps 91 | 92 | This was only meant to be a brief tutorial to fit into 30 minutes. Many crucial steps were overlooked for the sake of brevity. Ideally, you should look at doing the following: 93 | 94 | ## Create an IAM user for your production code 95 | 96 | We used the access keys of our root AWS account. This is not ideal for security reasons. Use your root account to create an admin user for yourself. Then lock the root credentials away and never use them again unless absolutely necessary. Then, using your new admin account, create another IAM user for your production code. Replace the access keys in your .secrets file with access keys from this account 97 | 98 | ## Remove access keys from your code 99 | 100 | Ideally, you don't want to store your access keys inside your code or docker image. Instead, pass in security keys as environment variables when creating your ECS task definition 101 | 102 | ## Replace Docker image with a general image 103 | 104 | We baked our script into our Docker image. This is not ideal since every time you update your code, you will need to rebuild your Docker image and re-push to ECR. Instead, create a general Docker image that can read in the name of a script to pull from an S3 bucket (you can pass this name in as an ECS task environment variable). This way, you can have a bucket that contains your R scripts, and you will only need to build your Docker image once. Every time your container is deployed, it will pull the latest version of your script from the S3 bucket. 105 | 106 | ## Make code more robust 107 | 108 | Our script is running on a lot of assumptions, such as: 109 | - only one file is uploaded to _whyr2022input_ at a time 110 | - only RDS files are uploaded to _whyr2022input_ 111 | - the files uploaded to _whyr2022input_ are dataframes with two numeric columns 112 | - 113 | Production code should not run on any assumptions. Everything should be validated, and possible errors or edge cases should be gracefully handled. 114 | 115 | Another thing that can be done to enhance the pipeline is to mark "used" input files. This can be done by appending "-used_" to the file name. 116 | 117 | ## Convert all infrastructure to code 118 | 119 | We provisioned all our resources through the AWS Console. This is not ideal since we cannot easily recreate the allocation and configuration of these resources. Ideally, you want to codify this process using an Infrastructure-as-code solution (ex: Terraform) 120 | -------------------------------------------------------------------------------- /script.R: -------------------------------------------------------------------------------- 1 | #### ONE #### 2 | logger::log_info("Starting script") 3 | box::use(magrittr[`%>%`]) 4 | dotenv::load_dot_env(".secrets") 5 | bucket_input <- "whyr2022input" 6 | bucket_output <- "whyr2022output" 7 | 8 | #### TWO #### 9 | logger::log_info("Pulling latest file from {bucket_input}") 10 | latest_file <- aws.s3::get_bucket(bucket_input) %>% 11 | tibble::as_tibble() %>% 12 | dplyr::arrange(dplyr::desc(LastModified)) %>% 13 | dplyr::slice(1) %>% 14 | dplyr::pull(Key) %>% 15 | stringr::str_replace("\\.rds", "") 16 | logger::log_info("Latest file from {bucket_input} pulled: {latest_file}.rds") 17 | 18 | #### THREE #### 19 | logger::log_info("Creating scatter plot") 20 | s3_uri <- glue::glue("s3://{bucket_input}/{latest_file}.rds") 21 | plot <- aws.s3::s3readRDS(s3_uri) %>% 22 | setNames(c("x", "y")) %>% 23 | ggplot2::ggplot(ggplot2::aes(x, y)) + 24 | ggplot2::geom_point() + 25 | ggplot2::ggtitle(latest_file) 26 | image_file <- glue::glue("output_{latest_file}.png") 27 | 28 | #### FOUR #### 29 | logger::log_info("Saving plot as image: {image_file}") 30 | ggplot2::ggsave(image_file, plot) %>% 31 | suppressWarnings() %>% 32 | suppressMessages() 33 | 34 | #### FIVE #### 35 | logger::log_info("Pushing image to {bucket_output}") 36 | aws.s3::put_object(image_file, image_file, bucket_output) %>% 37 | invisible() 38 | 39 | #### END #### 40 | logger::log_success("{image_file} created from {latest_file}.rds and pushed to {bucket_output}") 41 | -------------------------------------------------------------------------------- /why-r-2022-serverless-r-in-the-cloud.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Yes 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | --------------------------------------------------------------------------------