├── syllabus └── ECBS5211.pdf └── README.md /syllabus/ECBS5211.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daroczig/CEU-R-prod/HEAD/syllabus/ECBS5211.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Here you can find the materials for the "[Data Engineering 3: Batch Jobs and APIs](https://ceu.studyguide.timeedit.net/modules/ECBS5211?type=CORE)" course, part of the [MSc in Business Analytics](https://courses.ceu.edu/programs/ms/master-science-business-analytics) at CEU. For the previous editions, see [2017/2018](https://github.com/daroczig/CEU-R-prod/tree/2017-2018), [2018/2019](https://github.com/daroczig/CEU-R-prod/tree/2018-2019), [2019/2020](https://github.com/daroczig/CEU-R-prod/tree/2019-2020), [2020/2021](https://github.com/daroczig/CEU-R-prod/tree/2020-2021), [2021/2022](https://github.com/daroczig/CEU-R-prod/tree/2021-2022), [2022/2023](https://github.com/daroczig/CEU-R-prod/tree/2022-2023), and [2023/2024](https://github.com/daroczig/CEU-R-prod/tree/2023-2024). 2 | 3 | ## Table of Contents 4 | 5 | * [Table of Contents](#table-of-contents) 6 | * [Schedule](#schedule) 7 | * [Location](#location) 8 | * [Syllabus](#syllabus) 9 | * [Technical Prerequisites](#technical-prerequisites) 10 | * [Class Schedule](#class-schedule) 11 | 12 | * [Home assignment](#home-assignment) 13 | * [Getting help](#getting-help) 14 | 15 | ## Schedule 16 | 17 | 3 x 2 x 100 mins on April 28, May 5 and 12: 18 | 19 | * 13:30 - 15:10 session 1 20 | * 15:10 - 15:40 break 21 | * 15:40 - 17:20 session 2 22 | 23 | ## Location 24 | 25 | In-person at the Vienna campus (QS B-421). 26 | 27 | ## Syllabus 28 | 29 | Please find in the `syllabus` folder of this repository. 30 | 31 | ## Technical Prerequisites 32 | 33 | 1. You need a laptop with any operating system and stable Internet connection. 34 | 2. Please make sure that Internet/network firewall rules are not limiting your access to unusual ports (e.g. 22, 8787, 8080, 8000), as we will heavily use these in the class (can be a problem on a company network). CEU WiFi should have the related firewall rules applied for the class. 35 | 3. Join the Teams channel dedicated to the class at `ba-de4-2024` with the `symvo7f` team code 36 | 4. When joining remotely, it's highly suggested to get a second monitor where you can follow the online stream, and keep your main monitor for your own work. The second monitor could be an external screen attached to your laptop, e.g. a TV, monitor, projector, but if you don't have access to one, you may also use a tablet or phone to dial-in to the Zoom call. 37 | 38 | ## Class Schedule 39 | 40 | ## Week 1 41 | 42 | **Goal**: learn how to run and schedule R jobs in the cloud. 43 | 44 | ### Background: Example use-cases and why to use R in the cloud? 45 | 46 | Excerpts from https://daroczig.github.io/talks 47 | 48 | * "A Decade of Using R in Production" (Real Data Science USA - R meetup) 49 | * "Getting Things Logged" (RStudio::conf 2020) 50 | * "Analytics databases in a startup environment: beyond MySQL and Spark" (Budapest Data Forum 2018) 51 | 52 | ### Welcome to AWS! 53 | 54 | 1. Use the following sign-in URL to access the class AWS account: https://657609838022.signin.aws.amazon.com/console 55 | 2. Secure your access key(s), other credentials and any login information ... 56 | 57 |
... because a truly wise person learns from the mistakes of others! 58 | 59 | > "When I woke up the next morning, I had four emails and a missed phone call from Amazon AWS - something about 140 servers running on my AWS account, mining Bitcoin" 60 | -- [Hoffman said](https://www.theregister.co.uk/2015/01/06/dev_blunder_shows_github_crawling_with_keyslurping_bots) 61 | 62 | > "Nevertheless, now I know that Bitcoin can be mined with SQL, which is priceless ;-)" 63 | -- [Uri Shaked](https://medium.com/@urish/thank-you-google-how-to-mine-bitcoin-on-googles-bigquery-1c8e17b04e62) 64 | 65 | So set up 2FA (go to IAM / Users / username / Security credentials / Assigned MFA device): https://console.aws.amazon.com/iam 66 | 67 | PS probably you do not really need to store any access keys, but you may rely on roles (and the Key Management Service, and the Secrets Manager and so on) 68 |
69 | 70 | 3. Let's use the `eu-west-1` Ireland region 71 | 72 | ### Getting access to EC2 boxes 73 | 74 | **Note**: we follow the instructions on Windows in the Computer Lab, but please find below how to access the boxes from Mac or Linux as well when working with the instances remotely. 75 | 76 | 1. Create (or import) an SSH key in AWS (EC2 / Key Pairs): https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#KeyPairs:sort=keyName including using the Owner tag! 77 | 2. Get an SSH client: 78 | 79 | * Windows -- Download and install PuTTY: https://www.putty.org 80 | * Mac -- Install PuTTY for Mac using homebrew or macports 81 | 82 | ```sh 83 | sudo brew install putty 84 | sudo port install putty 85 | ``` 86 | 87 | * Linux -- probably the OpenSSH client is already installed, but to use the same tools on all operating systems, please install and use PuTTY on Linux too, eg on Ubuntu: 88 | 89 | ```sh 90 | sudo apt install putty 91 | ``` 92 | 93 | 3. ~~Convert the generated pem key to PuTTY format~~No need to do this anymore, AWS can provide the key as PPK now. 94 | 95 | * GUI: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html#putty-private-key 96 | * CLI: 97 | 98 | ```sh 99 | puttygen key.pem -O private -o key.ppk 100 | ``` 101 | 102 | 4. Make sure the key is readable only by your Windows/Linux/Mac user, eg 103 | 104 | ```sh 105 | chmod 0400 key.ppk 106 | ``` 107 | 108 | ### Create and connect to an EC2 box 109 | 110 | 1. Create an EC2 instance 111 | 112 | 0. Optional: create an Elastic IP for your box 113 | 1. Go the the Instances overview at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Instances:sort=instanceId 114 | 2. Click "Launch Instance" 115 | 3. Provide a name for your server (e.g. `daroczig-de3-week1`) and some additional tags for resource tracking, including tagging downstream services, such as Instance and Volumes: 116 | * Class: `DE3` 117 | * Owner: `daroczig` 118 | 4. Pick the `Ubuntu Server 24.04 LTS (HVM), SSD Volume Type` AMI 119 | 5. Pick `t3a.small` (2 GiB of RAM should be enough for most tasks) instance type (see more [instance types](https://aws.amazon.com/ec2/instance-types)) 120 | 6. Select your AWS key created above and launch 121 | 7. Pick a unique name for the security group after clicking "Edit" on the "Network settings" 122 | 8. Click "Launch instance" 123 | 9. Note and click on the instance id 124 | 125 | 2. Connect to the box 126 | 127 | 1. Specify the hostname or IP address 128 | 129 | ![](hhttps://docs.aws.amazon.com/AWSEC2/latest/UserGuide/images/putty-session-config.png) 130 | 131 | 2. Specify the "Private key file for authentication" in the Connection category's SSH/Auth/Credentials pane 132 | 3. Set the username to `ubuntu` on the Connection/Data tab 133 | 4. Save the Session profile 134 | 5. Click the "Open" button 135 | 6. Accept & cache server's host key 136 | 137 | Alternatively, you can connect via a standard SSH client on a Mac or Linux, something like: 138 | 139 | ```sh 140 | chmod 0400 /path/to/your/pem 141 | ssh -i /path/to/your/pem ubuntu@ip-address-of-your-machine 142 | ``` 143 | 144 | As a last resort, use "EC2 Instance Connect" from the EC2 dashboard by clicking "Connect" in the context menu of the instance (triggered by right click in the table). 145 | 146 | ### Install RStudio Server on EC2 147 | 148 | 1. Look at the docs: https://www.rstudio.com/products/rstudio/download-server 149 | 2. First, we will upgrade the system to the most recent version of the already installed packages. Note, check on the concept of a package manager! 150 | 151 | Download Ubuntu `apt` package list: 152 | 153 | ```sh 154 | sudo apt update 155 | ``` 156 | 157 | Optionally upgrade the system: 158 | 159 | ```sh 160 | sudo apt upgrade 161 | ``` 162 | 163 | And optionally also reboot so that kernel upgrades can take effect. 164 | 165 | 3. Install R 166 | 167 | ```sh 168 | sudo apt install r-base 169 | ``` 170 | 171 | To avoid manually answering "Yes" to the question to confirm installation, you can specify the `-y` flag: 172 | 173 | ```sh 174 | sudo apt install -y r-base 175 | ``` 176 | 177 | 178 | 4. Try R 179 | 180 | ```sh 181 | R 182 | ``` 183 | 184 | For example: 185 | 186 | ```r 187 | 1 + 4 188 | hist(mtcars$hp) 189 | # duh, where is the plot?! 190 | ``` 191 | 192 | Exit: 193 | 194 | ```r 195 | q() 196 | ``` 197 | 198 | Look at the files: 199 | 200 | ```sh 201 | ls 202 | ls -latr 203 | ``` 204 | 205 | 206 | 5. Install RStudio Server 207 | 208 | ```sh 209 | wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2024.12.1-563-amd64.deb 210 | sudo apt install -y gdebi-core 211 | sudo gdebi rstudio-server-2024.12.1-563-amd64.deb 212 | ``` 213 | 214 | 6. Check process and open ports 215 | 216 | ```sh 217 | rstudio-server status 218 | sudo rstudio-server status 219 | sudo systemctl status rstudio-server 220 | sudo ps aux | grep rstudio 221 | 222 | sudo apt -y install net-tools 223 | sudo netstat -tapen | grep LIST 224 | sudo netstat -tapen 225 | ``` 226 | 227 | 7. Look at the docs: http://docs.rstudio.com/ide/server-pro/ 228 | 229 | ### Connect to the RStudio Server 230 | 231 | 1. Confirm that the service is up and running and the port is open 232 | 233 | ```console 234 | ubuntu@ip-172-31-12-150:~$ sudo netstat -tapen | grep LIST 235 | tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 0 49065 23587/rserver 236 | tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 0 15671 1305/sshd 237 | tcp6 0 0 :::22 :::* LISTEN 0 15673 1305/sshd 238 | ``` 239 | 240 | 2. Try to connect to the host from a browser on port 8787, eg http://foobar.eu-west-1.compute.amazonaws.com:8787 241 | 3. Realize it's not working 242 | 4. Open up port 8787 in the security group, by selecting your security group and click "Edit inbound rules": 243 | 244 | ![](https://user-images.githubusercontent.com/495736/222724348-869d0703-05f2-4ef3-bd80-574362e73265.png) 245 | 246 | 5. Authentication: http://docs.rstudio.com/ide/server-pro/authenticating-users.html 247 | 6. Create a new user: 248 | 249 | sudo adduser ceu 250 | 251 | 7. Login & quick demo: 252 | 253 | ```r 254 | 1+2 255 | plot(mtcars) 256 | install.packages('fortunes') 257 | library(fortunes) 258 | fortune() 259 | fortune(200) 260 | system('whoami') 261 | ``` 262 | 263 | 8. Reload webpage (F5), realize we continue where we left the browser :) 264 | 9. Demo the terminal: 265 | 266 | ```console 267 | $ whoami 268 | ceu 269 | $ sudo whoami 270 | ceu is not in the sudoers file. This incident will be reported. 271 | ``` 272 | 273 | 8. Grant sudo access to the new user by going back to SSH with `root` access: 274 | 275 | ```sh 276 | sudo apt install -y mc 277 | sudo mc 278 | sudo mcedit /etc/sudoers 279 | sudo adduser ceu admin 280 | man adduser 281 | man deluser 282 | ``` 283 | 284 | Note 1: might need to relogin / restart RStudio / reload R / reload page .. to force a new shell login so that the updated group setting is applied 285 | Note 2: you might want to add `NOPASSWD` to the `sudoers` file: 286 | 287 | ```sh 288 | ceu ALL=(ALL) NOPASSWD:ALL 289 | ``` 290 | 291 | Although also note (3) the related security risks. 292 | 293 | 9. Custom login page: http://docs.rstudio.com/ide/server-pro/authenticating-users.html#customizing-the-sign-in-page 294 | 10. Custom port (e.g. 80): http://docs.rstudio.com/ide/server-pro/access-and-security.html#network-port-and-address 295 | 296 | ```sh 297 | echo "www-port=80" | sudo tee -a /etc/rstudio/rserver.conf 298 | sudo rstudio-server restart 299 | 300 | ### Update R 301 | 302 | Note the pretty outdated R version ... so let's update R by using the apt repo managed by the CRAN team: 303 | 304 | ```sh 305 | wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc 306 | sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" 307 | 308 | ## fetch most recent list of packages and version, then auto-upgrade 309 | sudo apt-get update && sudo apt-get -y upgrade 310 | ``` 311 | 312 | ## Week 2 313 | 314 | Quiz: https://forms.office.com/e/SgME7mHPM9 (5 mins) 315 | 316 | Recap on what we covered last week: 317 | 318 | 1. 2FA/MFA in AWS 319 | 2. Creating EC2 nodes 320 | 3. Connecting to EC2 nodes via SSH/Putty (note the difference between the PPK and PEM key formats) 321 | 4. Updating security groups 322 | 5. Installing RStudio Server 323 | 6. The difference between R console and Shell 324 | 7. The use of `sudo` and how to grant `root` (system administrator) privileges 325 | 8. Adding new Linux users, setting password, adding to group 326 | 9. Messing up package updates and making RStudio Server unusable 327 | 328 | Note that you do NOT need to do the instructions below marked with the 💪 emoji -- those have been already done for you, and the related steps are only included below for documenting what has been done and demonstrated in the class. 329 | 330 | ### Amazon Machine Images 331 | 332 | 💪 Instead of starting from scratch, let's create an Amazon Machine Image (AMI) from the EC2 node we used last week, so that we can use that as the basis of all the next steps: 333 | 334 | * Find the EC2 node in the EC2 console 335 | * Right click, then "Image and templates" / "Create image" 336 | * Name the AMI and click "Create image" 337 | * It might take a few minutes to finish 338 | 339 | Then you can use the newly created `de3-week2` AMI to spin up a new instance for you: 340 | 341 | 1. Go the the Instances overview at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Instances:sort=instanceId 342 | 2. Click "Launch Instance" 343 | 3. Provide a name for your server (e.g. `daroczig-de3-week2`) and some additional tags for resource tracking, including tagging downstream services, such as Instance and Volumes: 344 | * Class: `DE3` 345 | * Owner: `daroczig` 346 | 4. Pick the `de3-week2` AMI 347 | 5. Pick `t3a.small` (2 GiB of RAM should be enough for most tasks) instance type (see more [instance types](https://aws.amazon.com/ec2/instance-types)) 348 | 6. Select your AWS key created above and launch 349 | 7. Select the `de3` security group (granting access to ports 22, 8000, 8080, and 8787) 350 | 8. Click "Advanced details" and select `ceudataserver` IAM instance profile 351 | 9. Click "Launch instance" 352 | 10. Note and click on the instance id 353 | 354 | ### 💪 Create a user for every member of the team 355 | 356 | We'll export the list of IAM users from AWS and create a system user for everyone. 357 | 358 | 1. Attach a newly created IAM EC2 Role (let's call it `ceudataserver`) to the EC2 box and assign 'Read-only IAM access' (`IAMReadOnlyAccess`): 359 | 360 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/master/images/ec2-new-role.png) 361 | 362 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/master/images/ec2-new-role-type.png) 363 | 364 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/master/images/ec2-new-role-rights.png) 365 | 366 | 2. Install AWS CLI tool (note that using the snap package manager as it was removed from the apt repos): 367 | 368 | ``` 369 | sudo snap install aws-cli --classic 370 | ``` 371 | 372 | 3. List all the IAM users: https://docs.aws.amazon.com/cli/latest/reference/iam/list-users.html 373 | 374 | ``` 375 | aws iam list-users 376 | ``` 377 | 378 | 4. Install R packages from JSON parsing and logging (in the next steps) from the apt repo instead of CRAN sources as per https://github.com/eddelbuettel/r2u 379 | 380 | ```sh 381 | wget -q -O- https://eddelbuettel.github.io/r2u/assets/dirk_eddelbuettel_key.asc | sudo tee -a /etc/apt/trusted.gpg.d/cranapt_key.asc 382 | sudo add-apt-repository "deb [arch=amd64] https://r2u.stat.illinois.edu/ubuntu noble main" 383 | sudo apt update 384 | 385 | sudo apt-get install -y r-cran-jsonlite r-cran-logger r-cran-glue 386 | ``` 387 | 388 | Note that all dependencies (let it be an R package or system/Ubuntu package) have been automatically resolved and installed. 389 | 390 | Don't forget to click on the brush icon to clean up your terminal output if needed. 391 | 392 | Optionally [enable `bspm`](https://github.com/eddelbuettel/r2u#step-5-use-bspm-optional) to enable binary package installations via the traditional `install.packages` R function. 393 | 394 | 5. Export the list of users from R: 395 | 396 | ``` 397 | library(jsonlite) 398 | users <- fromJSON(system('aws iam list-users', intern = TRUE)) 399 | str(users) 400 | users[[1]]$UserName 401 | ``` 402 | 403 | 6. Create a new system user on the box (for RStudio Server access) for every IAM user, set password and add to group: 404 | 405 | ``` 406 | library(logger) 407 | library(glue) 408 | for (user in users[[1]]$UserName) { 409 | 410 | ## remove invalid character 411 | user <- sub('@.*', '', user) 412 | user <- sub('.', '_', user, fixed = TRUE) 413 | 414 | log_info('Creating {user}') 415 | system(glue("sudo adduser --disabled-password --quiet --gecos '' {user}")) 416 | 417 | log_info('Setting password for {user}') 418 | system(glue("echo '{user}:secretpass' | sudo chpasswd")) # note the single quotes + placement of sudo 419 | 420 | log_info('Adding {user} to sudo group') 421 | system(glue('sudo adduser {user} sudo')) 422 | 423 | } 424 | ``` 425 | 426 | Note, you may have to temporarily enable passwordless `sudo` for this user (if have not done already) :/ 427 | 428 | ``` 429 | ceu ALL=(ALL) NOPASSWD:ALL 430 | ``` 431 | 432 | Check users: 433 | 434 | ``` 435 | readLines('/etc/passwd') 436 | ``` 437 | 438 | ### 💪 Install Jenkins to schedule R commands 439 | 440 | ![](https://wiki.jenkins-ci.org/download/attachments/2916393/fire-jenkins.svg) 441 | 442 | 1. Install Jenkins from the RStudio/Terminal: https://www.jenkins.io/doc/book/installing/linux/#debianubuntu 443 | 444 | ```sh 445 | sudo apt install -y fontconfig openjdk-17-jre 446 | 447 | sudo wget -O /usr/share/keyrings/jenkins-keyring.asc \ 448 | https://pkg.jenkins.io/debian-stable/jenkins.io-2023.key 449 | echo deb [signed-by=/usr/share/keyrings/jenkins-keyring.asc] \ 450 | https://pkg.jenkins.io/debian-stable binary/ | sudo tee \ 451 | /etc/apt/sources.list.d/jenkins.list > /dev/null 452 | sudo apt-get update 453 | sudo apt-get install -y jenkins 454 | 455 | # check which port is open by java (jenkins) 456 | sudo ss -tapen | grep java 457 | ``` 458 | 459 | 2. Open up port 8080 in the related security group 460 | 3. Access Jenkins from your browser and finish installation 461 | 462 | 1. Read the initial admin password from RStudio/Terminal via 463 | 464 | ```sh 465 | sudo cat /var/lib/jenkins/secrets/initialAdminPassword 466 | ``` 467 | 468 | 2. Proceed with installing the suggested plugins 469 | 3. Create your first user (eg `ceu`) 470 | 471 | Note that if loading Jenkins after getting a new IP takes a lot of time, it might be due to 472 | not be able to load the `theme.css` as trying to search for that on the previous IP (as per 473 | Jenkins URL setting). To overcome this, wait 2 mins for the `theme.css` timeout, login, disable 474 | the dark theme: https://github.com/jenkinsci/dark-theme-plugin/issues/458 475 | 476 | ### 💪 Update Jenkins for shared usage 477 | 478 | Update the security backend to use real Unix users for shared access (if users already created): 479 | 480 | ![](https://user-images.githubusercontent.com/495736/224517493-652ac34e-f44d-4ac9-8d04-d661dcfc4c4b.png) 481 | 482 | And allow `jenkins` to authenticate UNIX users and restart: 483 | 484 | ```sh 485 | sudo adduser jenkins shadow 486 | sudo systemctl restart jenkins 487 | ``` 488 | 489 | Then make sure to test new user access in an incognito window to avoid closing yourself out :) 490 | 491 | ### 💪 Set up an easy to remember IP address 492 | 493 | Optionally you can associate a fixed IP address to your box: 494 | 495 | 1. Allocate a new Elastic IP address at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Addresses: 496 | 2. Name this resource by assigning a "Name" tag 497 | 3. Associate this Elastic IP with your stopped box, then start it 498 | 499 | ### 💪 Set up an easy to remember domain name 500 | 501 | Optionally you can associate a subdomain with your node, using the above created Elastic IP address: 502 | 503 | 1. Go to Route 53: https://console.aws.amazon.com/route53/home 504 | 2. Go to Hosted Zones and click on `ceudata.net` 505 | 3. Create a new Record, where 506 | 507 | - fill in the desired `Name` (subdomain), eg `de3.ceudata.net` 508 | - paste the public IP address or hostname of your server in the `Value` field 509 | - click `Create` 510 | 511 | 4. Now you will be able to access your box using this custon (sub)domain, no need to remember IP addresses. 512 | 513 | ### 💪 Configuring for standard ports 514 | 515 | To avoid using ports like `8787` and `8080` (and get blocked by the firewall installed on the CEU WiFi), let's configure our services to listen on the standard 80 (HTTP) and potentially on the 443 (HTTPS) port as well, and serve RStudio on the `/rstudio`, and Jenkins on the `/jenkins` path. 516 | 517 | For this end, we will use Nginx as a reverse-proxy, so let's install it first: 518 | 519 | ```shell 520 | sudo apt install -y nginx 521 | ``` 522 | 523 | First, we need to edit the Nginx config to enable websockets for Shiny apps etc in `/etc/nginx/nginx.conf` under the `http` section: 524 | 525 | ``` 526 | map $http_upgrade $connection_upgrade { 527 | default upgrade; 528 | '' close; 529 | } 530 | ``` 531 | 532 | Then we need to edit the main site's configuration at `/etc/nginx/sites-enabled/default` to act as a proxy, which also do some transformations, eg rewriting the URL (removing the `/rstudio` path) before hitting RStudio Server: 533 | 534 | ``` 535 | server { 536 | listen 80; 537 | rewrite ^/rstudio$ $scheme://$http_host/rstudio/ permanent; 538 | location /rstudio/ { 539 | rewrite ^/rstudio/(.*)$ /$1 break; 540 | proxy_pass http://localhost:8787; 541 | proxy_redirect http://localhost:8787/ $scheme://$http_host/rstudio/; 542 | proxy_http_version 1.1; 543 | proxy_set_header Upgrade $http_upgrade; 544 | proxy_set_header Connection $connection_upgrade; 545 | proxy_read_timeout 20d; 546 | } 547 | } 548 | ``` 549 | 550 | And restart Nginx: 551 | 552 | ```shell 553 | sudo systemctl restart nginx 554 | ``` 555 | 556 | Find more information at https://support.rstudio.com/hc/en-us/articles/200552326-Running-RStudio-Server-with-a-Proxy. 557 | 558 | Let's see if the port is open on the machine: 559 | 560 | ```shell 561 | sudo ss -tapen|grep LIST 562 | ``` 563 | 564 | Let's see if we can access RStudio Server on the new path: 565 | 566 | ```shell 567 | curl localhost/rstudio 568 | ``` 569 | 570 | Now let's see from the outside world ... and realize that we need to open up port 80! 571 | 572 | Now we need to tweak the config to support Jenkins as well, but the above Nginx rewrite hack will not work (see https://www.jenkins.io/doc/book/system-administration/reverse-proxy-configuration-troubleshooting/ for more details), so we will just make it a standard reverse-proxy, eg: 573 | 574 | ``` 575 | server { 576 | listen 80; 577 | rewrite ^/rstudio$ $scheme://$http_host/rstudio/ permanent; 578 | location / { 579 | 580 | } 581 | location /rstudio/ { 582 | rewrite ^/rstudio/(.*)$ /$1 break; 583 | proxy_pass http://localhost:8787; 584 | proxy_redirect http://localhost:8787/ $scheme://$http_host/rstudio/; 585 | proxy_http_version 1.1; 586 | proxy_set_header Upgrade $http_upgrade; 587 | proxy_set_header Connection $connection_upgrade; 588 | proxy_read_timeout 20d; 589 | } 590 | location ^~ /jenkins/ { 591 | proxy_pass http://127.0.0.1:8080/jenkins/; 592 | proxy_set_header X-Real-IP $remote_addr; 593 | proxy_set_header X-Forwarded-For $remote_addr; 594 | proxy_set_header Host $host; 595 | } 596 | } 597 | ``` 598 | 599 | And we also need to let Jenkins also know about the custom path, so uncomment `Environment="JENKINS_PREFIX=/jenkins"` in `/lib/systemd/system/jenkins.service`, then reload the Systemd configs and restart Jenkins: 600 | 601 | ```shell 602 | sudo systemctl daemon-reload 603 | sudo systemctl restart jenkins 604 | ``` 605 | 606 | See more details at the [Jenkins reverse proxy guide](https://www.jenkins.io/doc/book/system-administration/reverse-proxy-configuration-with-jenkins/reverse-proxy-configuration-nginx/). 607 | 608 | Optionally, replace the default, system-wide `index.html` for folks visiting the root domain without either the `rstudio` or `jenkins` path (note that instead of the editing the file, which might be overwritten with package updates, it would be better to create a new HTML file and refer that from the Nginx configuration, but we will keep it simple and dirty for now): 609 | 610 | ```shell 611 | echo "Welcome to DE3! Are you looking for /rstudio or /jenkins?" | sudo tee /usr/share/nginx/html/index.html 612 | ``` 613 | 614 | Then restart Jenkins, and good to go! 615 | 616 | It might be useful to also proxy port 8000 for future use via updating the Nginx config to: 617 | 618 | ``` 619 | server { 620 | listen 80; 621 | rewrite ^/rstudio$ $scheme://$http_host/rstudio/ permanent; 622 | location / { 623 | 624 | } 625 | location /rstudio/ { 626 | rewrite ^/rstudio/(.*)$ /$1 break; 627 | proxy_pass http://localhost:8787; 628 | proxy_redirect http://localhost:8787/ $scheme://$http_host/rstudio/; 629 | proxy_http_version 1.1; 630 | proxy_set_header Upgrade $http_upgrade; 631 | proxy_set_header Connection $connection_upgrade; 632 | proxy_read_timeout 20d; 633 | } 634 | location ^~ /jenkins/ { 635 | proxy_pass http://127.0.0.1:8080/jenkins/; 636 | proxy_set_header X-Real-IP $remote_addr; 637 | proxy_set_header X-Forwarded-For $remote_addr; 638 | proxy_set_header Host $host; 639 | } 640 | location ^~ /8000/ { 641 | rewrite ^/8000/(.*)$ /$1 break; 642 | proxy_pass http://127.0.0.1:8000; 643 | proxy_set_header X-Real-IP $remote_addr; 644 | proxy_set_header X-Forwarded-For $remote_addr; 645 | proxy_set_header Host $host; 646 | } 647 | } 648 | ``` 649 | 650 | This way you can access the above services via the below URLs: 651 | 652 | RStudio Server: 653 | 654 | * http://your.ip.address:8787 655 | * http://your.ip.address/rstudio 656 | 657 | Jenkins: 658 | 659 | * http://your.ip.address:8080/jenkins 660 | * http://your.ip.address/jenkins 661 | 662 | Port 8000: 663 | 664 | * http://your.ip.address:8000 665 | * http://your.ip.address/8000 666 | 667 | If you cannot access RStudio Server on port 80, you might need to restart `nginx` as per above. 668 | 669 | Next, set up SSL either with Nginx or placing an AWS Load Balancer in front of the EC2 node. 670 | 671 | ### Warmup exercises 672 | 673 | Install `ggplot2` or your preferred Python packages to replicate the below steps that we will automate later: 674 | 675 | 1. Install the `devtools` R package and a few others (binary distribution) in the RStudio/Terminal: 676 | 677 | ```sh 678 | sudo apt-get install -y r-cran-devtools r-cran-data.table r-cran-httr r-cran-jsonlite r-cran-data.table r-cran-stringi r-cran-stringr r-cran-glue r-cran-logger r-cran-snakecase 679 | ``` 680 | 681 | 2. Switch back to the R console and install the `binancer` R package from GitHub to interact with crypto exchanges (note the extra dependency to be installed from CRAN, no need to update any already installed package): 682 | 683 | ```r 684 | devtools::install_github('daroczig/binancer', upgrade = FALSE) 685 | ``` 686 | 687 | 3. First steps with live data: load the `binancer` package and then use the `binance_klines` function to get the last 3 hours of Bitcoin price changes (in USD) with 1-minute granularity -- resulting in an object like: 688 | 689 | ```r 690 | > str(klines) 691 | Classes ‘data.table’ and 'data.frame': 180 obs. of 12 variables: 692 | $ open_time : POSIXct, format: "2020-03-08 20:09:00" "2020-03-08 20:10:00" "2020-03-08 20:11:00" "2020-03-08 20:12:00" ... 693 | $ open : num 8292 8298 8298 8299 8298 ... 694 | $ high : num 8299 8299 8299 8299 8299 ... 695 | $ low : num 8292 8297 8297 8298 8296 ... 696 | $ close : num 8298 8298 8299 8298 8299 ... 697 | $ volume : num 25.65 9.57 20.21 9.65 24.69 ... 698 | $ close_time : POSIXct, format: "2020-03-08 20:09:59" "2020-03-08 20:10:59" "2020-03-08 20:11:59" "2020-03-08 20:12:59" ... 699 | $ quote_asset_volume : num 212759 79431 167677 80099 204883 ... 700 | $ trades : int 371 202 274 186 352 271 374 202 143 306 ... 701 | $ taker_buy_base_asset_volume : num 13.43 5.84 11.74 7.12 15.24 ... 702 | $ taker_buy_quote_asset_volume: num 111430 48448 97416 59071 126493 ... 703 | $ symbol : chr "BTCUSDT" "BTCUSDT" "BTCUSDT" "BTCUSDT" ... 704 | - attr(*, ".internal.selfref")= 705 | ``` 706 | 707 |
Click here for the code generating the above ... 708 | 709 | ```r 710 | library(binancer) 711 | klines <- binance_klines('BTCUSDT', interval = '1m', limit = 60*3) 712 | str(klines) 713 | summary(klines$close) 714 | ``` 715 |
716 | 717 | 4. Visualize the data, eg on a simple line chart: 718 | 719 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/2019-2020/images/binancer-plot-1.png) 720 | 721 |
Click here for the code generating the above ... 722 | 723 | ```r 724 | library(ggplot2) 725 | ggplot(klines, aes(close_time, close)) + geom_line() 726 | ``` 727 |
728 | 729 | 5. Now create a candle chart, something like: 730 | 731 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/2019-2020/images/binancer-plot-2.png) 732 | 733 |
Click here for the code generating the above ... 734 | 735 | ```r 736 | library(scales) 737 | ggplot(klines, aes(open_time)) + 738 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 739 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 740 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 741 | ggtitle(paste('Last Updated:', Sys.time())) + 742 | scale_y_continuous(labels = dollar) + 743 | scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn 744 | ``` 745 |
746 | 747 | 6. Compare prices of 4 currencies (eg BTC, ETH, BNB and XRP) in the past 24 hours on 15 mins intervals: 748 | 749 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/2019-2020/images/binancer-plot-3.png) 750 | 751 |
Click here for the code generating the above ... 752 | 753 | ```r 754 | library(data.table) 755 | klines <- rbindlist(lapply( 756 | c('BTCUSDT', 'ETHUSDT', 'BNBUSDT', 'XRPUSDT'), 757 | binance_klines, 758 | interval = '15m', limit = 4*24)) 759 | ggplot(klines, aes(open_time)) + 760 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 761 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 762 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 763 | ggtitle(paste('Last Updated:', Sys.time())) + 764 | scale_color_manual(values = c('#1a9850', '#d73027')) + 765 | facet_wrap(~symbol, scales = 'free', nrow = 2) 766 | ``` 767 |
768 | 769 | 770 | 771 | 7. Some further useful functions: 772 | 773 | - `binance_ticker_all_prices()` 774 | - `binance_coins_prices()` 775 | - `binance_credentials` and `binance_balances` 776 | 777 | 8. Create an R script that reports and/or plots on some cryptocurrencies, ideas: 778 | 779 | - compute the (relative) change in prices of cryptocurrencies in the past 24 / 168 hours 780 | - go back in time 1 / 12 / 24 months and "invest" $1K in BTC and see the value today 781 | - write a bot buying and selling crypto on a virtual exchange 782 | 783 | ### Schedule R commands 784 | 785 | Let's schedule a Jenkins job to check on the Bitcoin prices every hour! 786 | 787 | 1. Visit Jenkins using the `/jenkins` URL path of your instance's public IP address 788 | 2. Use your UNIX username and a password to log in 789 | 790 | If logging in takes a long time, it might be due to the Dark Theme plugin trying to load a CSS file from an outdated location absed on the Jenkins URL configured at `/jenkins/manage/configure`. To overcome this issue, wait it out at the above URL and update the IP address to your new dynamic IP address, or disable the plugin at `/jenkins/manage/pluginManager/installed` and then restart Jenkins at the bottom of the page. 791 | 792 | 3. Create a "New Item" (job): 793 | 794 | 1. Enter the name of the job: `get current Bitcoin price` 795 | 2. Pick "Freestyle project" 796 | 3. Click "OK" 797 | 4. Add a new "Execute shell" build step 798 | 5. Enter the below command to look up the most recent BTC price 799 | 800 | ```sh 801 | R -e "library(binancer);binance_coins_prices()[symbol == 'BTC', usd]" 802 | ``` 803 | 804 | 6. Run the job 805 | 806 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/2019-2020/images/jenkins-errors.png) 807 | 808 | 4. Debug & figure out what's the problem ... 809 | 5. Install R packages system wide from RStudio/Terminal (more on this later): 810 | 811 | Either start R in the terminal as the root user (via `sudo R`) and run the previous `devtools::install_github` command there, or with a one-liner: 812 | 813 | ```sh 814 | sudo Rscript -e "library(devtools);withr::with_libpaths(new = '/usr/local/lib/R/site-library', install_github('daroczig/binancer', upgrade = FALSE))" 815 | ``` 816 | 817 | 6. Rerun the job 818 | 819 | ![](https://raw.githubusercontent.com/daroczig/CEU-R-prod/2018-2019/images/jenkins-success.png) 820 | 821 | ### Schedule R scripts 822 | 823 | 1. Create an R script with the below content and save on the server, eg as `/home/ceu/bitcoin-price.R`: 824 | 825 | ```r 826 | library(binancer) 827 | prices <- binance_coins_prices() 828 | paste('The current Bitcoin price is', prices[symbol == 'BTC', usd]) 829 | ``` 830 | 831 | 2. Follow the steps from the [Schedule R commands](#schedule-r-commands) section to create a new Jenkins job, but instead of calling `R -e "..."` in shell step, reference the above R script using `Rscript` instead: 832 | 833 | ```shell 834 | Rscript /home/ceu/de3.R 835 | ``` 836 | 837 | Alternatively, you could also install little R for this purpose: 838 | 839 | ```shell 840 | sudo apt install -y r-cran-littler 841 | r /home/ceu/de3.R 842 | ``` 843 | 844 | Note the permission error, so let's add the `jenkins` user to the `ceu` group: 845 | 846 | ```shell 847 | sudo adduser jenkins ceu 848 | ``` 849 | 850 | Then restart Jenkins from the RStudio Server terminal: 851 | 852 | ```shell 853 | sudo systemctl restart jenkins 854 | ``` 855 | 856 | A better solution will be later to commit our R script into a git repo, and make it part of the job to update from the repo. 857 | 858 | 3. Create an R script that generates a candlestick chart on the BTC prices from the past hour, saves as `btc.png` in the workspace, and update every 5 minutes! 859 | 860 |
Example solution for the above ... 861 | 862 | ```r 863 | library(binancer) 864 | library(ggplot2) 865 | library(scales) 866 | klines <- binance_klines('BTCUSDT', interval = '1m', limit = 60) 867 | g <- ggplot(klines, aes(open_time)) + 868 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 869 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 870 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 871 | ggtitle(paste('Last Updated:', Sys.time())) + 872 | scale_y_continuous(labels = dollar) + 873 | scale_color_manual(values = c('#1a9850', '#d73027')) 874 | ggsave('btc.png', plot = g, width = 10, height = 5) 875 | ``` 876 |
877 | 878 | 1. Enter the name of the job: `Update BTC candlestick chart` 879 | 2. Pick "Freestyle project" 880 | 3. Click "OK" 881 | 4. Add a new "Execute shell" build step 882 | 5. Enter the below command to look up the most recent BTC price 883 | 884 | ```sh 885 | Rscript /home/ceu/plot.R 886 | ``` 887 | 6. Run the job 888 | 7. Look at the workspace that can be accessed from the sidebar menu of the job. 889 | 890 | ### ScheduleR improvements 891 | 892 | 1. Use a git repository to store the R scripts and fetch the most recent version on job start: 893 | 894 | 1. Configure the Jenkins job to use "Git" in the "Source Code Management" section, and use e.g. https://gist.github.com/daroczig/e5d3ee3664549932bb7f23ce8e93e472 as the repo URL, and specify the branch (`main`). 895 | 2. Update the Execute task section to refer to the `btcprice.R` file of the repo instead of the hardcoded local path. 896 | 3. Make edits to the repo, e.g. update lookback to 3 hours and check a future job output. 897 | 898 | 2. Set up e-mail notifications via eg https://app.mailjet.com/signin 899 | 900 | 1. 💪 Sign up, confirm your e-mail address and domain 901 | 2. 💪 Take a note on the SMTP settings, eg 902 | 903 | * SMTP server: in-v3.mailjet.com 904 | * Port: 465 905 | * SSL: Yes 906 | * Username: *** 907 | * Password: *** 908 | 909 | 3. 💪 Configure Jenkins at http://de3.ceudata.net/jenkins/configure 910 | 911 | 1. Set up the default FROM e-mail address at "System Admin e-mail address": jenkins@ceudata.net 912 | 2. Search for "Extended E-mail Notification" and configure 913 | 914 | * SMTP Server 915 | * Click "Advanced" 916 | * Check "Use SMTP Authentication" 917 | * Enter User Name from the above steps 918 | * Enter Password from the above steps 919 | * Check "Use SSL" 920 | * SMTP port: 465 921 | 922 | 5. Set up "Post-build Actions" in Jenkins: Editable Email Notification - read the manual and info popups, configure to get an e-mail on job failures and fixes 923 | 6. Configure the job to send the whole e-mail body as the deault body template for all outgoing emails 924 | 925 | ```shell 926 | ${BUILD_LOG, maxLines=1000} 927 | ``` 928 | 929 | 3. Look at other Jenkins plugins, eg the Slack Notifier: https://plugins.jenkins.io/slack 930 | 931 | ### Intro to redis 932 | 933 | We need a persistent storage for our Jenkins jobs ... let's give a try to a key-value database: 934 | 935 | 1. 💪 Install server 936 | 937 | ``` 938 | sudo apt install redis-server 939 | ss -tapen | grep LIST 940 | ``` 941 | 942 | Test using the CLI tool: 943 | 944 | ``` 945 | redis-cli get foo 946 | redis-cli set foo 42 947 | redis-cli get foo 948 | redis-cli del foo 949 | ``` 950 | 951 | 2. 💪 Install an R client 952 | 953 | Although we could use the `RcppRedis` available on CRAN, so thus also in our already added apt repo, but `rredis` provides some convenient helpers that we plan to use, so we are going to install that as well from a custom R package repository to also demonstrate how `drat` works: 954 | 955 | ``` 956 | sudo apt install -y r-cran-rcppredis 957 | sudo Rscript -e "withr::with_libpaths(new = '/usr/local/lib/R/site-library', install.packages('rredis', repos=c('https://ghrr.github.io/drat', 'https://cloud.r-project.org')))" 958 | ``` 959 | 960 | 3. Interact from R 961 | 962 | ```r 963 | ## set up and initialize the connection to the local redis server 964 | library(rredis) 965 | redisConnect() 966 | 967 | ## set/get values 968 | redisSet('foo', 'bar') 969 | redisGet('foo') 970 | 971 | ## increment and decrease counters 972 | redisIncr('counter') 973 | redisIncr('counter') 974 | redisIncr('counter') 975 | redisGet('counter') 976 | redisDecr('counter') 977 | redisDecr('counter2') 978 | 979 | ## get multiple values at once 980 | redisMGet(c('counter', 'counter2')) 981 | 982 | ## list all keys 983 | redisKeys() 984 | 985 | ## drop all keys 986 | redisDelete(redisKeys()) 987 | ``` 988 | 989 | For more examples and ideas, see the [`rredis` package vignette](https://rdrr.io/cran/rredis/f/inst/doc/rredis.pdf) or try the interactive, genaral (not R-specific) [redis tutorial](https://try.redis.io). 990 | 991 | 4. Exercises 992 | 993 | - Create a Jenkins job running every minute to cache the most recent Bitcoin and Ethereum prices in Redis 994 | - Write an R script in RStudio that can read the Bitcoin and Ethereum prices from the Redis cache 995 | 996 |
Example solution ... 997 | 998 | ```r 999 | library(binancer) 1000 | library(data.table) 1001 | prices <- binance_coins_prices() 1002 | 1003 | library(rredis) 1004 | redisConnect() 1005 | 1006 | redisSet('username:price:BTC', prices[symbol == 'BTC', usd]) 1007 | redisSet('username:price:ETH', prices[symbol == 'ETH', usd]) 1008 | 1009 | redisGet('username:price:BTC') 1010 | redisGet('username:price:ETH') 1011 | 1012 | redisMGet(c('username:price:BTC', 'username:price:ETH')) 1013 | ``` 1014 |
1015 | 1016 |
Example solution using a helper function doing some logging ... 1017 | 1018 | ```r 1019 | library(binancer) 1020 | library(logger) 1021 | library(rredis) 1022 | redisConnect() 1023 | 1024 | store <- function(s) { 1025 | ## TODO use the checkmate pkg to assert the type of symbol 1026 | log_info('Looking up and storing {s}') 1027 | value <- binance_coins_prices()[symbol == s, usd] 1028 | key <- paste('username', 'price', s, sep = ':') 1029 | redisSet(key, value) 1030 | log_info('The price of {s} is {value}') 1031 | } 1032 | 1033 | store('BTC') 1034 | store('ETH') 1035 | 1036 | ## list all keys with the "price" prefix and lookup the actual values 1037 | redisMGet(redisKeys('username:price:*')) 1038 | ``` 1039 |
1040 | 1041 | More on databases at the "Mastering R" class in the Spring semester ;) 1042 | 1043 | ### Interacting with MS Teams 1044 | 1045 | 1. Join the #bots-bots-bots channel in the DE3 course's MS Teams 1046 | 2. Click on "Connectors" in the channel's config and add an incoming webhook with your username and optional logo, store the URL for later use 1047 | 3. 💪 Install the `teamr` package from CRAN 1048 | 1049 | ```shell 1050 | sudo apt install -y r-cran-teamr 1051 | ``` 1052 | 1053 | 4. Use the webhook URL to send messages to the channel from R: 1054 | 1055 | ```r 1056 | library(teamr) 1057 | webhook_url <- "https://ceuedu.webhook.office.com/webhookb2/..." 1058 | # create new connector card 1059 | cc <- connector_card$new(hookurl = webhook_url) 1060 | # set the title and text of the message 1061 | cc$title("Hi from R!") 1062 | cc$text("This is a test message sent using the `teamr` package.") 1063 | # send the message 1064 | cc$send() 1065 | ``` 1066 | 1067 | ## Week 3 1068 | 1069 | Quiz: https://forms.office.com/e/wRAxGqirdV (5 mins) 1070 | 1071 | Recap on what we covered last week: 1072 | 1073 | - Amazon Machine Image 1074 | - Shared RStudio Server 1075 | - Settup up Jenkins 1076 | - Reverse proxy for more conenvient access to services 1077 | - Scheduling R commands in Jenkins 1078 | - Scheduling R script in Jenkins 1079 | - Further Jenkins features, such as e-mail notifications and git integration 1080 | - Introduction to Redis 1081 | - Interacting with MS Teams 1082 | 1083 | New server for this week: 1084 | 1085 | 1. Go the the Instances overview at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Instances:sort=instanceId 1086 | 2. Click "Launch Instance" 1087 | 3. Provide a name for your server (e.g. `daroczig-de3-week3`) and some additional tags for resource tracking, including tagging downstream services, such as Instance and Volumes: 1088 | * Class: `DE3` 1089 | * Owner: `daroczig` 1090 | 4. Pick the `de3` AMI 1091 | 5. Pick `t3a.small` (2 GiB of RAM should be enough for most tasks) instance type (see more [instance types](https://aws.amazon.com/ec2/instance-types)) 1092 | 6. Select your AWS key created above and launch 1093 | 7. Select the `de3` security group (granting access to ports 22, 8000, 8080, and 8787) 1094 | 8. Click "Advanced details" and select `ceudataserver` IAM instance profile 1095 | 9. Click "Launch instance" 1096 | 10. Note and click on the instance id 1097 | 1098 | #### Storing the secret webhook URL 1099 | 1100 | 1. Do NOT store the webhook URL in plain-text (e.g. in your R script)! 1101 | 2. Let's use Amazon's Key Management Service: https://github.com/daroczig/CEU-R-prod/raw/2017-2018/AWR.Kinesis/AWR.Kinesis-talk.pdf (slides 73-75) 1102 | 3. 💪 Instead of using the Java SDK referenced in the above talk, let's install `boto3` Python module and use via `reticulate`: 1103 | 1104 | ```shell 1105 | sudo apt install -y python3-boto3 1106 | sudo apt install -y r-cran-reticulate r-cran-botor 1107 | ``` 1108 | 1109 | Let's also let R know that we want to use the globally installed Python interpreter and its packages instead of setting up local virtual environments by adding the following to your `/etc/R/Renviron.site` file: 1110 | 1111 | ```shell 1112 | RETICULATE_PYTHON=/usr/bin/python3 1113 | ``` 1114 | 1115 | 4. 💪 Create a key in the Key Management Service (KMS): `alias/de3` 1116 | 5. 💪 Grant access to that KMS key by creating an EC2 IAM role at https://console.aws.amazon.com/iam/home?region=eu-west-1#/roles with the `AWSKeyManagementServicePowerUser` policy and explicit grant access to the key in the KMS console 1117 | 6. 💪 Attach the newly created IAM role if not yet done 1118 | 7. Use this KMS key to encrypt the Slack token: 1119 | 1120 | ```r 1121 | library(botor) 1122 | botor(region = 'eu-west-1') 1123 | kms_encrypt(webhook_url, key = 'alias/de3') 1124 | ``` 1125 | 1126 | Note, if R asks you to install Miniconda, say NO, as Python3 and the required packages have already been installed system-wide. 1127 | 1128 | 8. Store the ciphertext and use `kms_decrypt` to decrypt later, see eg 1129 | 1130 | ```r 1131 | kms_decrypt("AQICAHgzIk6iRoD8yYhFk//xayHj0G7uYfdCxrW6ncfAZob2MwHI8Q6jK8f6xby87I/+4BXBAAABYjCCAV4GCSqGSIb3DQEHBqCCAU8wggFLAgEAMIIBRAYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAymjX9tB9jzUXVfrt8CARCAggEVpmQubNTH72mH3J1/54gNIbOUJ2bZ9VMRqg0zKkdnw7ke6lYhCODJtysKx+sgK8r7zzeSWLXrvX0nSP572boxVfQWFWWNg3f+ib17rkaDlBSbF0DM8nPoaHAQMK38HeOs6STmhRXmyiGY0OuxAWFWwxPoh2t72Yc7JJO5SLUDK6ddSkUQr3S3gjtMYc1L+QMEg9vkYEKICDGAprgZc21br5+eQRsowGkOSPw8mx+U0WAiMpwDGrzza+/hnVGmRvG4HDLaXbaRgouWyhtbEhuP5CKyGjOjYzoCY0WMcZOmowwG773ABijB+zr2SUVn2yJI5tMfn7b7aRUnQybLuPCEdmUk17lQvdJJw/0a+TpmMkMSYM9wfg==") 1132 | ``` 1133 | 1134 | 9. 💪 Alternatively, use the AWS Parameter Store or Secrets Manager, see eg https://eu-west-1.console.aws.amazon.com/systems-manager/parameters/?region=eu-west-1&tab=Table and granting the `AmazonSSMReadOnlyAccess` policy to your IAM role or user. 1135 | 1136 | ```r 1137 | ssm_get_parameter('/teams/daroczig') 1138 | ``` 1139 | 1140 | 10. Store your own webhook in the Parameter Store and use it in your R script. 1141 | 1142 | ### Job Scheduler exercises 1143 | 1144 | * Create a Jenkins job to alert if Bitcoin price is below $80K or higher than $100K. 1145 | * Limit this job to alert only once per hour, but alert should kick off as soon as possible (so don't schedule to run hourly, instead use a state to track when the last alert was sent). 1146 | * Create a Jenkins job to alert if Bitcoin price changed more than $200 in the past hour. 1147 | * Create a Jenkins job to alert if Bitcoin price changed more than 5% in the past day. 1148 | * Create a Jenkins job running hourly to generate a candlestick chart on the price of BTC and ETH. 1149 | 1150 |
Example solution for the first exercise ... 1151 | 1152 | ```r 1153 | ## get data right from the Binance API 1154 | library(binancer) 1155 | btc <- binance_klines('BTCUSDT', interval = '1m', limit = 1)$close 1156 | 1157 | ## or from the local cache (updated every minute from Jenkins as per above) 1158 | library(rredis) 1159 | redisConnect() 1160 | btc <- redisGet('username:price:BTC') 1161 | 1162 | ## log whatever was retreived 1163 | library(logger) 1164 | log_info('The current price of a Bitcoin is ${btc}') 1165 | 1166 | ## get the last alert time 1167 | last_alert <- redisGet('username:alert:last') 1168 | if (is.null(last_alert)) { 1169 | last_alert <- 0 1170 | } 1171 | since_last_alert <- as.numeric( 1172 | difftime(Sys.time(), last_alert, 1173 | units = "secs")) 1174 | 1175 | ## send alert 1176 | if (since_last_alert >= 60 && (btc < 80000 | btc > 100000)) { 1177 | library(botor) 1178 | botor(region = 'eu-west-1') 1179 | webhook_url <- ssm_get_parameter('/teams/username') 1180 | library(teamr) 1181 | cc <- connector_card$new(hookurl = webhook_url) 1182 | cc$title('Bitcoin price alert!') 1183 | cc$text(paste('The current price of a Bitcoin is:', btc)) 1184 | cc$send() 1185 | } 1186 | ``` 1187 | 1188 |
1189 | 1190 | ### Make API endpoints 1191 | 1192 | 1. 💪 Install plumber: [rplumber.io](https://www.rplumber.io) 1193 | 1194 | ```sh 1195 | sudo apt install -y r-cran-plumber 1196 | ``` 1197 | 1198 | 2. Create an API endpoint to show the min, max and mean price of a BTC in the past hour! 1199 | 1200 | Create `~/plumber.R` with the below content: 1201 | 1202 | ```r 1203 | library(binancer) 1204 | 1205 | #* BTC stats 1206 | #* @get /btc 1207 | function() { 1208 | klines <- binance_klines('BTCUSDT', interval = '1m', limit = 60L) 1209 | klines[, .(min = min(close), mean = mean(close), max = max(close))] 1210 | } 1211 | ``` 1212 | 1213 | Start the plumber application wither via clicking on the "Run API" button or the below commands: 1214 | 1215 | ```r 1216 | library(plumber) 1217 | pr("plumber.R") %>% pr_run(host='0.0.0.0', port=8000) 1218 | ``` 1219 | 1220 | 3. Add a new API endpoint to generate the candlestick chart with dynamic symbol (default to BTC), interval and limit! Note that you might need a new `@serializer`, function arguments, and type conversions as well. 1221 | 1222 |
Example solution for the above ... 1223 | 1224 | ```r 1225 | library(binancer) 1226 | library(ggplot2) 1227 | library(scales) 1228 | 1229 | #* Generate plot 1230 | #* @param symbol coin pair 1231 | #* @param interval:str enum 1232 | #* @param limit integer 1233 | #* @get /klines 1234 | #* @serializer png 1235 | function(symbol = 'BTCUSDT', interval = '1m', limit = 60L) { 1236 | klines <- binance_klines(symbol, interval = interval, limit = as.integer(limit)) # NOTE int conversion 1237 | library(scales) 1238 | p <- ggplot(klines, aes(open_time)) + 1239 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 1240 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 1241 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 1242 | ggtitle(paste('Last Updated:', Sys.time())) + 1243 | scale_y_continuous(labels = dollar) + 1244 | scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn 1245 | print(p) 1246 | } 1247 | ``` 1248 |
1249 | 1250 | 4. Add a new API endpoint to generate a HTML report including both the above! 1251 | 1252 |
Example solution for the above ... 1253 | 1254 | 💪 Update the `markdown` package: 1255 | 1256 | ```shell 1257 | sudo apt install -y r-cran-markdown 1258 | ``` 1259 | 1260 | Create an R markdown for the reporting: 1261 | 1262 | `````md 1263 | --- 1264 | title: "report" 1265 | output: html_document 1266 | date: "`r Sys.Date()`" 1267 | --- 1268 | 1269 | ```{r setup, include=FALSE} 1270 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE) 1271 | library(binancer) 1272 | library(ggplot2) 1273 | library(scales) 1274 | library(knitr) 1275 | 1276 | klines <- function() { 1277 | binance_klines('BTCUSDT', interval = '1m', limit = 60L) 1278 | } 1279 | ``` 1280 | 1281 | Bitcoin stats: 1282 | 1283 | ```{r stats} 1284 | kable(klines()[, .(min = min(close), mean = mean(close), max = max(close))]) 1285 | ``` 1286 | 1287 | On a nice plot: 1288 | 1289 | ```{r plot} 1290 | ggplot(klines(), aes(open_time, )) + 1291 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 1292 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 1293 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 1294 | ggtitle(paste('Last Updated:', Sys.time())) + 1295 | scale_y_continuous(labels = dollar) + 1296 | scale_color_manual(values = c('#1a9850', '#d73027')) 1297 | ``` 1298 | ````` 1299 | 1300 | And the plumber file: 1301 | 1302 | ```r 1303 | library(binancer) 1304 | library(ggplot2) 1305 | library(scales) 1306 | library(rmarkdown) 1307 | library(plumber) 1308 | 1309 | #' Gets BTC data from the past hour 1310 | #' @return data.table 1311 | klines <- function() { 1312 | binance_klines('BTCUSDT', interval = '1m', limit = 60L) 1313 | } 1314 | 1315 | #* BTC stats 1316 | #* @get /stats 1317 | function() { 1318 | klines()[, .(min = min(close), mean = mean(close), max = max(close))] 1319 | } 1320 | 1321 | #* Generate plot 1322 | #* @get /plot 1323 | #* @serializer png 1324 | function() { 1325 | p <- ggplot(klines(), aes(open_time, )) + 1326 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 1327 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 1328 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 1329 | ggtitle(paste('Last Updated:', Sys.time())) + 1330 | scale_y_continuous(labels = dollar) + 1331 | scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn 1332 | print(p) 1333 | } 1334 | 1335 | #* Generate HTML 1336 | #* @get /report 1337 | #* @serializer html 1338 | function(res) { 1339 | filename <- tempfile(fileext = '.html') 1340 | on.exit(unlink(filename)) 1341 | render('report.Rmd', output_file = filename) 1342 | include_file(filename, res) 1343 | } 1344 | ``` 1345 | 1346 | Run via: 1347 | 1348 | ```r 1349 | library(plumber) 1350 | pr('plumber.R') %>% pr_run(port = 8000) 1351 | ``` 1352 |
1353 | 1354 | Try to DRY (don't repeat yourself!) this up as much as possible. 1355 | 1356 | ### R API containers 1357 | 1358 | Why API? Why R-based API? Examples 1359 | 1360 | * adtech 1361 | * healthtech 1362 | 1363 | 1. Write an R script that provides 3 API endpoints (look up examples from past week!): 1364 | 1365 | * `/stats` reports on the min/mean/max BTC price from the past 3 hours 1366 | * `/plot` generates a candlestick chart on the price of BTC from past 3 hours 1367 | * `/report` generates a HTML report including both the above 1368 | 1369 |
Example solution for the above ... 1370 | 1371 | 💪 Update the `markdown` package: 1372 | 1373 | ```shell 1374 | sudo apt install -y r-cran-markdown 1375 | ``` 1376 | 1377 | Create an R markdown for the reporting: 1378 | 1379 | `````md 1380 | --- 1381 | title: "report" 1382 | output: html_document 1383 | date: "`r Sys.Date()`" 1384 | --- 1385 | 1386 | ```{r setup, include=FALSE} 1387 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE) 1388 | library(binancer) 1389 | library(ggplot2) 1390 | library(scales) 1391 | library(knitr) 1392 | 1393 | klines <- function() { 1394 | binance_klines('BTCUSDT', interval = '1m', limit = 60L) 1395 | } 1396 | ``` 1397 | 1398 | Bitcoin stats: 1399 | 1400 | ```{r stats} 1401 | kable(klines()[, .(min = min(close), mean = mean(close), max = max(close))]) 1402 | ``` 1403 | 1404 | On a nice plot: 1405 | 1406 | ```{r plot} 1407 | ggplot(klines(), aes(open_time, )) + 1408 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 1409 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 1410 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 1411 | ggtitle(paste('Last Updated:', Sys.time())) + 1412 | scale_y_continuous(labels = dollar) + 1413 | scale_color_manual(values = c('#1a9850', '#d73027')) 1414 | ``` 1415 | ````` 1416 | 1417 | And the plumber file: 1418 | 1419 | ```r 1420 | library(binancer) 1421 | library(ggplot2) 1422 | library(scales) 1423 | library(rmarkdown) 1424 | library(plumber) 1425 | 1426 | #' Gets BTC data from the past hour 1427 | #' @return data.table 1428 | klines <- function() { 1429 | binance_klines('BTCUSDT', interval = '1m', limit = 60L) 1430 | } 1431 | 1432 | #* BTC stats 1433 | #* @get /stats 1434 | function() { 1435 | klines()[, .(min = min(close), mean = mean(close), max = max(close))] 1436 | } 1437 | 1438 | #* Generate plot 1439 | #* @get /plot 1440 | #* @serializer png 1441 | function() { 1442 | p <- ggplot(klines(), aes(open_time, )) + 1443 | geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + 1444 | geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + 1445 | theme_bw() + theme('legend.position' = 'none') + xlab('') + 1446 | ggtitle(paste('Last Updated:', Sys.time())) + 1447 | scale_y_continuous(labels = dollar) + 1448 | scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn 1449 | print(p) 1450 | } 1451 | 1452 | #* Generate HTML 1453 | #* @get /report 1454 | #* @serializer html 1455 | function(res) { 1456 | filename <- tempfile(fileext = '.html') 1457 | on.exit(unlink(filename)) 1458 | render('report.Rmd', output_file = filename) 1459 | include_file(filename, res) 1460 | } 1461 | ``` 1462 | 1463 | Run via: 1464 | 1465 | ```r 1466 | library(plumber) 1467 | pr('plumber.R') %>% pr_run(port = 8000) 1468 | ``` 1469 |
1470 | 1471 | 2. Try to DRY (don't repeat yourself!) this up as much as possible. 1472 | 3. Bundle all the scripts into a single Docker image: 1473 | 1474 | a. 💪 Install Docker: 1475 | 1476 | ```shell 1477 | curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg 1478 | echo \ 1479 | "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \ 1480 | https://download.docker.com/linux/ubuntu \ 1481 | $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null 1482 | sudo apt-get update 1483 | sudo apt-get install -y docker-ce 1484 | ``` 1485 | 1486 | b. Create a new file named `Dockerfile` (File/New file/Text file to avoid auto-adding the `R` file extension) with the below content to add the required files and set the default working directory to the same folder: 1487 | 1488 | ``` 1489 | FROM rstudio/plumber 1490 | 1491 | RUN apt-get update && apt-get install -y pandoc && apt-get clean && rm -rf /var/lib/apt/lists/ 1492 | RUN install2.r ggplot2 1493 | RUN installGithub.r daroczig/binancer 1494 | ADD report.Rmd /app/report.Rmd 1495 | ADD plumber.R /app/plumber.R 1496 | EXPOSE 8000 1497 | WORKDIR /app 1498 | ``` 1499 | 1500 | Note the step installing the required R packages! 1501 | 1502 | c. Build the Docker image: 1503 | 1504 | ```sh 1505 | sudo docker build -t btc-report-api . 1506 | ``` 1507 | 1508 | d. Run a container based on the above image: 1509 | 1510 | ```sh 1511 | sudo docker run -p 8000:8000 -ti btc-report-api plumber.R 1512 | ``` 1513 | 1514 | e. Realize that it's not working, as we need to install `rmarkdown` package, so update the above `Dockerfile` on line 4, rebuild the image and run again. 1515 | 1516 | f. Test by visiting the `8000` port or the Nginx proxy at http://de3.ceudata.net/8000, e.g. Swagger docs at http://de3.ceudata.net/8000/__docs__/#/default/get_report or an endpoint directly at http://de3.ceudata.net/8000/report. 1517 | 1518 | ### Docker registry 1519 | 1520 | Now let's make the above created and tested Docker image available outside of the RStudio Server by uploading the Docker image to Elastic Container Registry (ECR): 1521 | 1522 | 1. 💪 Create a new private repository at https://eu-west-1.console.aws.amazon.com/ecr/home?region=eu-west-1, call it `de3-example-api` 1523 | 2. 💪 Assign the `EC2InstanceProfileForImageBuilderECRContainerBuilds` policy to the `ceudataserver` IAM role so that we get RW access to the ECR repositories. Tighten this role up in prod! 1524 | 3. 💪 Let's login to ECR on the RStudio Server so that we can upload the Docker image: 1525 | 1526 | ```sh 1527 | aws ecr get-login-password --region eu-west-1 | sudo docker login --username AWS --password-stdin 657609838022.dkr.ecr.eu-west-1.amazonaws.com 1528 | ``` 1529 | 1530 | 4. 💪 Tag the already build Docker image for upload: 1531 | 1532 | ```sh 1533 | sudo docker tag btc-report-api:latest 657609838022.dkr.ecr.eu-west-1.amazonaws.com/de3-example-api:latest 1534 | ``` 1535 | 1536 | 5. 💪 Push the Docker image: 1537 | 1538 | ```sh 1539 | sudo docker push 657609838022.dkr.ecr.eu-west-1.amazonaws.com/de3-example-api:latest 1540 | ``` 1541 | 1542 | 6. Check the Docker repository at https://eu-west-1.console.aws.amazon.com/ecr/repositories/private/657609838022/de3-example-api?region=eu-west-1 1543 | 1544 | ### Docker service 1545 | 1546 | 1. Go to the Elastic Container Service (ECS) dashboard at https://eu-west-1.console.aws.amazon.com/ecs/home?region=eu-west-1#/ 1547 | 2. Create a task definition for the Docker run: 1548 | 1549 | 1. Task name: `btc-api` 1550 | 2. Container name: `api` 1551 | 3. Image URI: `657609838022.dkr.ecr.eu-west-1.amazonaws.com/de3-example-api` 1552 | 4. Container port: 8000 1553 | 5. Command in the Docker configuration: `plumber.R` 1554 | 6. Review Task size, but default values should fine for this simple task 1555 | 1556 | 3. Create a new cluster, call it `BTC_API`, using Fargate. Don't forget to add the `Class` tag! 1557 | 4. Create a Service in the newly created Cluster at https://eu-west-1.console.aws.amazon.com/ecs/v2/clusters/btc-api/services?region=eu-west-1 1558 | 1559 | 1. Compute option can be "Launch type" for now 1560 | 2. Specify the Task Family as `btc-api` 1561 | 3. Provide the same as service name 1562 | 4. Use the `de3` security group 1563 | 5. Create a load balancer listening on port 80 (would need to create an SSL cert for HTTPS), and specify `/stats` as the healthcheck path, with a 10 seconds grace period 1564 | 6. Test the deployed service behind the load balancer, e.g. https://btc-api-1417435399.eu-west-1.elb.amazonaws.com/report 1565 | 1566 | ## Home assignment 1567 | 1568 | The goal of this assignment is to confirm that you have a general understanding on how to build data pipelines using Amazon Web Services and R, and can actually implement a stream processing application (either running in almost real-time or batched/scheduled way) or R-based API in practice. 1569 | 1570 | ### Tech setup 1571 | 1572 | To minimize the system administration and some of the already-covered engineering tasks for the students, the below pre-configured tools are provided as free options, but students can decide to build their own environment (on the top of or independently from these) and feel free to use any other tools: 1573 | 1574 | * `de3` Amazon Machine Image that you can use to spin up an EC2 node with RStudio Server, Shiny Server, Jenkins, Redis and Docker installed & pre-configured (use your AWS username and the password shared on Slack previously) along with the most often used R packages (including the ones we used for stream processing, eg `botor`, `AWR.Kinesis` and the `binancer` package) 1575 | * `de3` EC2 IAM role with full access to Kinesis, Dynamodb, Cloudwatch and the `slack` token in the Parameter Store 1576 | * `de3` security group with open ports for RStudio Server and Jenkins 1577 | * lecture and seminar notes at https://github.com/daroczig/CEU-R-prod 1578 | 1579 | ### Required output 1580 | 1581 | Make sure to clean-up your EC2 nodes, security groups, keys etc created in the past weeks, as left-over AWS resources will contribute negative points to your final grade! E.g. the EC2 node you created on the second week should be terminated. 1582 | 1583 | * Minimal project (for grade up to "B"): schedule a Jenkins job that runs every hour getting the past hour's 1-minute interval klines data on ETH prices (in USD). The job should be configured to pull the R script at the start of the job either from a private or public git repo or gist. Then 1584 | 1585 | * Find the min and max price of ETH in the past hour, and post these stats in the `#bots-bots-bots` MS Teams channel. Make sure to set your username for the message, and use a custom emoji as the icon. 1586 | * Set up email notification for the job when it fails. 1587 | 1588 | * Recommended project (for grade up to "A"): Deploy an R-based API in ECS (like we did on the last week) for analyzing recent Binance (or any other real-time) data. The API should include at least 4 endpoints using different serializers, and these endpoints should be other than the ones we covered in the class. **At least one endpoint should have at least a few parameters.** Build a Docker image, push it to ECR, and deploy as service in ECS. Document the steps required to set up ECR/ECS with screenshots, then delete all services after confirming that everything works correctly. 1589 | 1590 | Regarding feedback: by default, I add a super short feedback on Moodle as a comment to your submission (e.g. "good job" or "excellent" for grade A, or short details on why it was not A). If you want to receive more detailed feedback, please send me an email to schedule a quick call. If you want early feedback (before grading), send me an email at least a week before the submission deadline! 1591 | 1592 | ### Delivery method 1593 | 1594 | * Create a PDF document that describes your solution and all the main steps involved with low level details: attach screenshots (including the URL nav bar and the date/time widget of your OS, so like full-screen and not area-picked screenshots) of your browser showing what you are doing in RStudio Server, Jenkins, in the AWS dashboards, or example messages posted in MS Teams, and make sure that the code you wrote is either visible on the screenshots, or included in the PDF. 1595 | 1596 | * STOP the EC2 Instance you worked on, but don’t terminate it, so I can start it and check how it works. Note that your instance will be terminated by me after the end of the class. 1597 | * Include the `instance_id` on the first page of the PDF, along with your name or student id. 1598 | * Upload the PDF to Moodle. 1599 | 1600 | ### Submission deadline 1601 | 1602 | Midnight (CET) on May 31, 2025. 1603 | 1604 | ## Extra: Stream processing using R and AWS 1605 | 1606 | An introduction to stream processing with AWS Kinesis and R: https://github.com/daroczig/CEU-R-prod/raw/2017-2018/AWR.Kinesis/AWR.Kinesis-talk.pdf (presented at the Big Data Day Los Angeles 2016, EARL 2016 London and useR! 2017 Brussels) 1607 | 1608 | This section describes how to set up a Kinesis stream with a few on-demand shards on the live Binance transactions read from its websocket -- running in a Docker container, then feeding the JSON lines to Kinesis via the Amazon Kinesis Agent. 1609 | 1610 | ### 💪 Setting up a demo stream 1611 | 1612 | 1. Start a `t3a.micro` instance running "Amazon Linux 2 AMI" (where it's easier to install the Kinesis Agent compared to using eg Ubuntu) with a known key. Make sure to **set a name** and enable termination protection (in the instance details)! Use SSH, Putty or eg the browser-based SSH connection. Note that the default username is `ec2-user` instead of `ubuntu`. 1613 | 1614 | 2. Install Docker (note that we are not on Ubuntu today, but using Red Hat's `yum` package manager): 1615 | 1616 | ``` 1617 | sudo yum install docker 1618 | sudo service docker start 1619 | sudo service docker status 1620 | ``` 1621 | 1622 | 3. Let's use a small Python app relying on the Binance API to fetch live transactions and store in a local file: 1623 | 1624 | * sources: https://github.com/daroczig/ceu-de3-docker-binance-streamer 1625 | * docker: https://cloud.docker.com/repository/registry-1.docker.io/daroczig/ceu-de3-docker-binance-streamer 1626 | 1627 | Usage: 1628 | 1629 | ``` 1630 | screen -RRd streamer 1631 | sudo docker run -ti --rm --log-opt max-size=50m daroczig/ceu-de3-docker-binance-streamer >> /tmp/transactions.json 1632 | ## "C-a c" to create a new screen, then you can switch with C-a " 1633 | ls -latr /tmp 1634 | tail -f /tmp/transactions.json 1635 | ``` 1636 | 1637 | 4. Install the Kinesis Agent: 1638 | 1639 | As per https://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html#download-install: 1640 | 1641 | ``` 1642 | sudo yum install -y aws-kinesis-agent 1643 | ``` 1644 | 1645 | 5. Create a new Kinesis Stream (called `crypto`) at https://eu-west-1.console.aws.amazon.com/kinesis. Don't forget to tag it (Class, Owner)! 1646 | 1647 | 6. Configure the Kinesis Agent: 1648 | 1649 | ``` 1650 | sudo yum install mc 1651 | sudo mcedit /etc/aws-kinesis/agent.json 1652 | ``` 1653 | 1654 | Running the above commands, edit the config file to update the Kinesis endpoint, the name of the stream on the local file path: 1655 | 1656 | ``` 1657 | { 1658 | "cloudwatch.emitMetrics": true, 1659 | "kinesis.endpoint": "https://kinesis.eu-west-1.amazonaws.com", 1660 | "firehose.endpoint": "", 1661 | 1662 | "flows": [ 1663 | { 1664 | "filePattern": "/tmp/transactions.json*", 1665 | "kinesisStream": "crypto", 1666 | "partitionKeyOption": "RANDOM" 1667 | } 1668 | ] 1669 | } 1670 | ``` 1671 | 1672 | Note that extra star at the end of the `filePattern` to handle potential issues when file is copy/truncated (logrotate). 1673 | 1674 | 7. Restart the Agent: 1675 | 1676 | ``` 1677 | sudo service aws-kinesis-agent start 1678 | ``` 1679 | 1680 | 8. Check the status and logs: 1681 | 1682 | ``` 1683 | sudo service aws-kinesis-agent status 1684 | sudo journalctl -xe 1685 | ls -latr /var/log/aws-kinesis-agent/aws-kinesis-agent.log 1686 | tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log 1687 | ``` 1688 | 1689 | 9. Make sure that the IAM role (eg `kinesis-admin`) can write to Kinesis and Cloudwatch, eg by attaching the `AmazonKinesisFullAccess` policy, then restart the agent 1690 | 1691 | ``` 1692 | sudo service aws-kinesis-agent restart 1693 | ``` 1694 | 1695 | 10. Check the AWS console's monitor if all looks good there as well 1696 | 11. Note for the need of permissions to `cloudwatch:PutMetricData` (see example `cloudwatch-putmetrics` policy). 1697 | 12. Optionally set up a cronjob to truncate that the file from time to time: 1698 | 1699 | ```sh 1700 | 5 * * * * /usr/bin/truncate -s 0 /tmp/transactions.json 1701 | ``` 1702 | 1703 | 13. Set up an alert in Cloudwatch if streaming stops 1704 | 1705 | ### A simple stream consumer app in R 1706 | 1707 | As the `botor` package was already installed, we can rely on the power of `boto3` to interact with the Kinesis stream. The IAM role attached to the node already has the `AmazonKinesisFullAccess` policy attached, so we have permissions to read from the stream. 1708 | 1709 | First we need to create a shard iterator, then using that, we can read the actual records from the shard: 1710 | 1711 | ```r 1712 | library(botor) 1713 | botor(region = 'eu-west-1') 1714 | shard_iterator <- kinesis_get_shard_iterator('crypto', '0') 1715 | records <- kinesis_get_records(shard_iterator$ShardIterator) 1716 | str(records) 1717 | ``` 1718 | 1719 | Let's parse these records: 1720 | 1721 | ```r 1722 | records$Records[[1]] 1723 | records$Records[[1]]$Data 1724 | 1725 | library(jsonlite) 1726 | fromJSON(as.character(records$Records[[1]]$Data)) 1727 | ``` 1728 | 1729 | ### Parsing and structuring records read from the stream 1730 | 1731 | Exercises: 1732 | 1733 | * parse the loaded 25 records into a `data.table` object with proper column types. Get some help on the data format from the [Binance API docs](https://github.com/binance/binance-spot-api-docs/blob/master/web-socket-streams.md#trade-streams)! 1734 | * count the overall number of coins exchanged 1735 | * count the overall value of transactions in USD (hint: `binance_ticker_all_prices()` and `binance_coins_prices()`) 1736 | * visualize the distribution of symbol pairs 1737 | 1738 |
A potential solution that you should not look at before thinking ... 1739 | 1740 | ```shell 1741 | library(data.table) 1742 | dt <- rbindlist(lapply(records$Records, function(record) { 1743 | fromJSON(as.character(record$Data)) 1744 | })) 1745 | 1746 | str(dt) 1747 | 1748 | setnames(dt, 'a', 'seller_id') 1749 | setnames(dt, 'b', 'buyer_id') 1750 | setnames(dt, 'E', 'event_timestamp') 1751 | ## Unix timestamp / Epoch (number of seconds since Jan 1, 1970): https://www.epochconverter.com 1752 | dt[, event_timestamp := as.POSIXct(event_timestamp / 1000, origin = '1970-01-01')] 1753 | setnames(dt, 'q', 'quantity') 1754 | setnames(dt, 'p', 'price') 1755 | setnames(dt, 's', 'symbol') 1756 | setnames(dt, 't', 'trade_id') 1757 | setnames(dt, 'T', 'trade_timestamp') 1758 | dt[, trade_timestamp := as.POSIXct(trade_timestamp / 1000, origin = '1970-01-01')] 1759 | str(dt) 1760 | 1761 | for (id in grep('_id', names(dt), value = TRUE)) { 1762 | dt[, (id) := as.character(get(id))] 1763 | } 1764 | str(dt) 1765 | 1766 | for (v in c('quantity', 'price')) { 1767 | dt[, (v) := as.numeric(get(v))] 1768 | } 1769 | 1770 | library(binancer) 1771 | binance_coins_prices() 1772 | 1773 | dt[, .N, by = symbol] 1774 | dt[symbol=='ETHUSDT'] 1775 | dt[, from := substr(symbol, 1, 3)] 1776 | dt <- merge(dt, binance_coins_prices(), by.x = 'from', by.y = 'symbol', all.x = TRUE, all.y = FALSE) 1777 | dt[, value := as.numeric(quantity) * usd] 1778 | dt[, sum(value)] 1779 | ``` 1780 |
1781 | 1782 | ### Actual stream processing instead of analyzing batch data 1783 | 1784 | Let's write an R function to increment counters on the number of transactions per symbols: 1785 | 1786 | 1. Get sample raw data as per above (you might need to get a new shard iterator if expired): 1787 | 1788 | ```r 1789 | records <- kinesis_get_records(shard_iterator$ShardIterator)$Record 1790 | ``` 1791 | 1792 | 2. Function to parse and process it 1793 | 1794 | ```r 1795 | txprocessor <- function(record) { 1796 | symbol <- fromJSON(as.character(record$Data))$s 1797 | log_info(paste('Found 1 transaction on', symbol)) 1798 | redisIncr(paste('USERNAME', 'tx', symbol, sep = ':')) 1799 | } 1800 | ``` 1801 | 1802 | 3. Iterate on all records 1803 | 1804 | ```r 1805 | library(logger) 1806 | library(rredis) 1807 | redisConnect() 1808 | for (record in records) { 1809 | txprocessor(record) 1810 | } 1811 | ``` 1812 | 1813 | 4. Check counters 1814 | 1815 | ```r 1816 | symbols <- redisMGet(redisKeys('^USERNAME:tx:*')) 1817 | symbols 1818 | 1819 | symbols <- data.frame( 1820 | symbol = sub('^USERNAME:tx:', '', names(symbols)), 1821 | N = as.numeric(symbols)) 1822 | symbols 1823 | ``` 1824 | 1825 | 5. Visualize 1826 | 1827 | ```r 1828 | library(ggplot2) 1829 | ggplot(symbols, aes(symbol, N)) + geom_bar(stat = 'identity') 1830 | ``` 1831 | 1832 | 6. Rerun step (1) and (3) to do the data processing, then (4) and (5) for the updated data visualization. 1833 | 1834 | 7. 🤦 1835 | 1836 | 8. Let's make use of the next shard iterator: 1837 | 1838 | ```r 1839 | ## reset counters 1840 | redisDelete(redisKeys('USERNAME:tx:*')) 1841 | 1842 | ## get the first shard iterator 1843 | shard_iterator <- kinesis_get_shard_iterator('crypto', '0')$ShardIterator 1844 | 1845 | while (TRUE) { 1846 | 1847 | response <- kinesis_get_records(shard_iterator) 1848 | 1849 | ## get the next iterator 1850 | shard_iterator <- response$NextShardIterator 1851 | 1852 | ## extract records 1853 | records <- response$Record 1854 | for (record in records) { 1855 | txprocessor(record) 1856 | } 1857 | 1858 | ## summarize 1859 | symbols <- redisMGet(redisKeys('USERNAME:tx:*')) 1860 | symbols <- data.frame( 1861 | symbol = sub('^symbol:', '', names(symbols)), 1862 | N = as.numeric(symbols)) 1863 | 1864 | ## visualize 1865 | print(ggplot(symbols, aes(symbol, N)) + geom_bar(stat = 'identity') + ggtitle(sum(symbols$N))) 1866 | } 1867 | ``` 1868 | 1869 | ### Stream processor daemon 1870 | 1871 | 0. So far, we used the `boto3` Python module from R via `botor` to interact with AWS, but this time we will integrate Java -- by calling the AWS Java SDK to interact with our Kinesis stream, then later on to run a Java daemon to manage our stream processing application. 1872 | 1873 | 1. 💪 First, let's install Java and the `rJava` R package: 1874 | 1875 | ```shell 1876 | sudo apt install r-cran-rjava 1877 | ``` 1878 | 1879 | 2. 💪 Then the R package wrapping the AWS Java SDK and the Kinesis client, then update to the most recent dev version right away: 1880 | 1881 | ```shell 1882 | sudo apt install r-cran-awr.kinesis 1883 | sudo R -e "withr::with_libpaths(new = '/usr/local/lib/R/site-library', install.packages('AWR', repos = 'https://daroczig.gitlab.io/AWR'))" 1884 | sudo R -e "withr::with_libpaths(new = '/usr/local/lib/R/site-library', devtools::install_github('daroczig/AWR.Kinesis', upgrade = FALSE))" 1885 | ``` 1886 | 1887 | 3. 💪 Note, after installing Java, you might need to run `sudo R CMD javareconf` and/or restart R or the RStudio Server via `sudo rstudio-server restart` :/ 1888 | 1889 | ```shell 1890 | Error : .onLoad failed in loadNamespace() for 'rJava', details: 1891 | call: dyn.load(file, DLLpath = DLLpath, ...) 1892 | error: unable to load shared object '/usr/lib/R/site-library/rJava/libs/rJava.so': 1893 | libjvm.so: cannot open shared object file: No such file or directory 1894 | ``` 1895 | 1896 | 4. And after all, a couple lines of R code to get some data from the stream via the Java SDK (just like we did above with the Python backend): 1897 | 1898 | ```r 1899 | library(rJava) 1900 | library(AWR.Kinesis) 1901 | records <- kinesis_get_records('crypto', 'eu-west-1') 1902 | str(records) 1903 | records[1] 1904 | 1905 | library(jsonlite) 1906 | fromJSON(records[1]) 1907 | ``` 1908 | 1909 | 1. Create a new folder for the Kinesis consumer files: `streamer` 1910 | 1911 | 2. Create an `app.properties` file within that subfolder 1912 | 1913 | ``` 1914 | executableName = ./app.R 1915 | regionName = eu-west-1 1916 | streamName = crypto 1917 | applicationName = my_demo_app_sadsadsa 1918 | AWSCredentialsProvider = DefaultAWSCredentialsProviderChain 1919 | ``` 1920 | 1921 | 3. Create the `app.R` file: 1922 | 1923 | ```r 1924 | #!/usr/bin/Rscript 1925 | library(logger) 1926 | log_appender(appender_file('app.log')) 1927 | library(AWR.Kinesis) 1928 | library(methods) 1929 | library(jsonlite) 1930 | 1931 | kinesis_consumer( 1932 | 1933 | initialize = function() { 1934 | log_info('Hello') 1935 | library(rredis) 1936 | redisConnect(nodelay = FALSE) 1937 | log_info('Connected to Redis') 1938 | }, 1939 | 1940 | processRecords = function(records) { 1941 | log_info(paste('Received', nrow(records), 'records from Kinesis')) 1942 | for (record in records$data) { 1943 | symbol <- fromJSON(record)$s 1944 | log_info(paste('Found 1 transaction on', symbol)) 1945 | redisIncr(paste('symbol', symbol, sep = ':')) 1946 | } 1947 | }, 1948 | 1949 | updater = list( 1950 | list(1/6, function() { 1951 | log_info('Checking overall counters') 1952 | symbols <- redisMGet(redisKeys('symbol:*')) 1953 | log_info(paste(sum(as.numeric(symbols)), 'records processed so far')) 1954 | })), 1955 | 1956 | shutdown = function() 1957 | log_info('Bye'), 1958 | 1959 | checkpointing = 1, 1960 | logfile = 'app.log') 1961 | ``` 1962 | 1963 | 4. 💪 Allow writing checkpointing data to DynamoDB and CloudWatch in IAM 1964 | 1965 | 5. Convert the above R script into an executable using the Terminal: 1966 | 1967 | ```shell 1968 | cd streamer 1969 | chmod +x app.R 1970 | ``` 1971 | 1972 | 6. Run the app in the Terminal: 1973 | 1974 | ``` 1975 | /usr/bin/java -cp /usr/local/lib/R/site-library/AWR/java/*:/usr/local/lib/R/site-library/AWR.Kinesis/java/*:./ \ 1976 | com.amazonaws.services.kinesis.multilang.MultiLangDaemon \ 1977 | ./app.properties 1978 | ``` 1979 | 1980 | 7. Check on `app.log` 1981 | 1982 | ### Shiny app showing the progress 1983 | 1984 | 1. Reset counters 1985 | 1986 | ```r 1987 | library(rredis) 1988 | redisConnect() 1989 | keys <- redisKeys('symbol*') 1990 | redisDelete(keys) 1991 | ``` 1992 | 1993 | 2. 💪 Install the `treemap` package 1994 | 1995 | ``` 1996 | sudo apt install r-cran-httpuv r-cran-shiny r-cran-xtable r-cran-htmltools r-cran-igraph r-cran-lubridate r-cran-tidyr r-cran-quantmod r-cran-broom r-cran-zoo r-cran-htmlwidgets r-cran-tidyselect r-cran-rlist r-cran-rlang r-cran-xml r-cran-treemap r-cran-highcharter 1997 | ``` 1998 | 1999 | 3. Run the below Shiny app 2000 | 2001 | ```r 2002 | ## packages for plotting 2003 | library(treemap) 2004 | library(highcharter) 2005 | 2006 | ## connect to Redis 2007 | library(rredis) 2008 | redisConnect() 2009 | 2010 | library(shiny) 2011 | library(data.table) 2012 | ui <- shinyUI(highchartOutput('treemap', height = '800px')) 2013 | server <- shinyServer(function(input, output, session) { 2014 | 2015 | symbols <- reactive({ 2016 | 2017 | ## auto-update every 2 seconds 2018 | reactiveTimer(2000)() 2019 | 2020 | ## get frequencies 2021 | symbols <- redisMGet(redisKeys('symbol:*')) 2022 | symbols <- data.table( 2023 | symbol = sub('^symbol:', '', names(symbols)), 2024 | N = as.numeric(symbols)) 2025 | 2026 | ## color top 3 2027 | symbols[, color := 1] 2028 | symbols[symbol %in% symbols[order(-N)][1:3, symbol], color := 2] 2029 | 2030 | ## return 2031 | symbols 2032 | 2033 | }) 2034 | 2035 | output$treemap <- renderHighchart({ 2036 | tm <- treemap(symbols(), index = c('symbol'), 2037 | vSize = 'N', vColor = 'color', 2038 | type = 'value', draw = FALSE) 2039 | N <- sum(symbols()$N) 2040 | hc_title(hctreemap(tm, animation = FALSE), 2041 | text = sprintf('Transactions (N=%s)', N)) 2042 | }) 2043 | 2044 | }) 2045 | shinyApp(ui = ui, server = server, options = list(port = 3838)) 2046 | ``` 2047 | 2048 | We will learn more about Shiny in the upcoming Data Visualization 4 class :) 2049 | 2050 | 2051 | 2052 | Will be updated from week to week. 2053 | 2054 | ## Getting help 2055 | 2056 | File a [GitHub ticket](https://github.com/daroczig/CEU-R-prod/issues). 2057 | --------------------------------------------------------------------------------