├── .gitignore ├── LICENSE ├── README.md ├── common_issues ├── README.md ├── help_github_n_Rstudio.md └── help_rmarkdown.md ├── lecture00-intro ├── README.md ├── getting_started.Rmd ├── getting_started.md ├── getting_started_files │ └── figure-gfm │ │ └── unnamed-chunk-2-1.png ├── intro_to_R.R └── test_Rmarkdown.Rmd ├── lecture01-coding-basics ├── README.md ├── assignment_1.R ├── coding_basics.Rmd ├── coding_basics.md └── first_script.R ├── lecture02-data-imp-n-exp ├── README.md ├── complete_codes │ └── dataset_handling_fin.R ├── data │ └── hotels_vienna │ │ ├── clean │ │ ├── README.md │ │ ├── VARIABLES.xlsx │ │ └── hotels-vienna.csv │ │ ├── export │ │ └── export_here │ │ └── raw │ │ ├── hotelbookingdata.csv │ │ └── show_folder └── raw_codes │ └── dataset_handling.R ├── lecture03-tibbles ├── README.md ├── complete_codes │ └── intro_to_tibbles_fin.R ├── data │ ├── games.csv │ └── points.csv └── raw_codes │ └── intro_to_tibbles.R ├── lecture04-data-munging ├── README.md ├── complete_codes │ └── data_munging_fin.R └── raw_codes │ └── data_munging.R ├── lecture05-data-exploration ├── README.md ├── complete_codes │ └── data_exploration_fin.R └── raw_codes │ └── data_exploration.R ├── lecture06-rmarkdown101 ├── README.md ├── complete_codes │ ├── report_bpp_fin.Rmd │ ├── report_bpp_fin.html │ └── report_bpp_fin.pdf └── raw_codes │ └── report_bpp.Rmd ├── lecture07-ggplot-indepth ├── README.md ├── complete_codes │ └── ggplot_indepth_fin.R └── raw_codes │ ├── ggplot_indepth.R │ ├── homework_ggpplot_runfile.R │ ├── theme_RENAMEME.R │ └── theme_bluewhite.R ├── lecture08-conditionals ├── README.md ├── conditionals.R ├── conditionals.Rmd └── conditionals.md ├── lecture09-loops ├── README.md ├── loops.R ├── loops.Rmd └── loops.md ├── lecture10-random-numbers ├── README.md ├── random_numbers.R ├── random_numbers.Rmd ├── random_numbers.md └── random_numbers_files │ └── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-3-1.png │ ├── unnamed-chunk-4-.gif │ ├── unnamed-chunk-4-1.gif │ ├── unnamed-chunk-6-1.png │ ├── unnamed-chunk-7-1.png │ └── unnamed-chunk-9-1.png ├── lecture11-functions ├── README.md ├── functions.R ├── functions.Rmd ├── functions.md └── functions_files │ └── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-10-2.png │ ├── unnamed-chunk-10-3.png │ ├── unnamed-chunk-11-1.png │ ├── unnamed-chunk-11-2.png │ ├── unnamed-chunk-11-3.png │ ├── unnamed-chunk-9-1.png │ ├── unnamed-chunk-9-2.png │ └── unnamed-chunk-9-3.png ├── lecture12-intro-to-regression ├── README.md ├── complete_codes │ ├── hotels_intro_to_regression_fin.R │ └── hotels_vienna_regression_fin_w_logs.R └── raw_codes │ └── hotels_intro_to_regression.R ├── lecture13-feature-engineering ├── README.md ├── complete_codes │ ├── feature_engineering_part_II_fin.R │ └── feature_engineering_part_I_fin.R └── raw_codes │ ├── feature_engineering_part_I.R │ └── feature_engineering_part_II.R ├── lecture14-simple-regression ├── README.md ├── complete_codes │ ├── life_exp_analysis_fin.R │ ├── life_exp_clean.R │ └── life_exp_getdata_fin.R ├── data │ ├── clean │ │ └── WDI_lifeexp_clean.csv │ └── raw │ │ └── WDI_lifeexp_raw.csv └── raw_codes │ ├── life_exp_analysis.R │ └── life_exp_getdata.R ├── lecture15-advanced-linear-regression ├── README.md ├── complete_codes │ └── hotels_advanced_regression_fin.R └── raw_codes │ └── hotels_advanced_regression.R ├── lecture16-binary-models ├── README.md ├── complete_codes │ └── binary_models_fin.R └── raw_codes │ └── binary_models.R ├── lecture17-dates-n-times ├── README.md ├── complete_codes │ └── date_time_manipulations_fin.R └── raw_codes │ └── date_time_manipulations.R ├── lecture18-timeseries-regression ├── README.md ├── complete_codes │ └── intro_time_series_fin.R └── raw_codes │ ├── ggplotacorr.R │ └── intro_time_series.R ├── lecture19-advaced-rmarkdown ├── README.md ├── complete_codes │ ├── advanced_rmarkdown_fin.Rmd │ ├── advanced_rmarkdown_fin.log │ ├── advanced_rmarkdown_fin.pdf │ └── advanced_rmarkdown_fin_files │ │ └── figure-latex │ │ ├── create figure wi label-1.pdf │ │ ├── plot pred graph-1.pdf │ │ ├── setup-1.pdf │ │ └── show two graphs-1.pdf ├── extra │ ├── maschools_prep.R │ ├── maschools_report.Rmd │ └── maschools_report.pdf ├── hotels_analysis.pdf └── raw_codes │ ├── advanced_rmarkdown.Rmd │ └── advanced_rmarkdown_prep.R ├── lecture20-basic-spatial-vizz ├── README.md ├── complete_codes │ └── visualize_spatial_fin.R ├── data_map │ ├── BEZIRKSGRENZEOGDPolygon.dbf │ ├── BEZIRKSGRENZEOGDPolygon.shp │ ├── BEZIRKSGRENZEOGDPolygon.shx │ ├── London_Borough_Excluding_MHW.dbf │ ├── London_Borough_Excluding_MHW.shp │ └── London_Borough_Excluding_MHW.shx ├── output │ ├── heu_prices.png │ └── lifeexpectancy_world.png └── raw_codes │ └── visualize_spatial.R ├── lecture21-cross-validation ├── README.md └── crossvalidation_usedcars.R ├── lecture22-lasso ├── README.md ├── codes │ ├── ch14_aux_fncs.R │ └── lasso_aribnb.R └── data │ └── airbnb_hackney_workfile_adj_book1.csv ├── lecture23-regression-tree ├── README.md └── cart_usedcars.R ├── lecture24-random-forest ├── README.md ├── codes │ ├── airbnb_prepare.R │ └── randomforest_airbnb.R └── data │ ├── airbnb_london_workfile_adj_book.csv │ ├── gbm_model.RData │ ├── rf_model_1.RData │ └── rf_model_2.RData ├── lecture25-classification-wML ├── README.md ├── codes │ ├── auxfuncs_binarywML.R │ └── classification_wML.R └── data │ └── bisnode_firms_clean.RData ├── lecture26-long-term-time-series-wML ├── README.md └── long_term_swimming.R └── lecture27-short-term-time-series-ARIMA-VAR ├── README.md └── short_term_priceindex.R /.gitignore: -------------------------------------------------------------------------------- 1 | # History files 2 | .Rhistory 3 | .Rapp.history 4 | 5 | # Session Data files 6 | .RData 7 | 8 | # User-specific files 9 | .Ruserdata 10 | 11 | # Example code in package build process 12 | *-Ex.R 13 | 14 | # Output files from R CMD build 15 | /*.tar.gz 16 | 17 | # Output files from R CMD check 18 | /*.Rcheck/ 19 | 20 | # RStudio files 21 | .Rproj.user/ 22 | 23 | # produced vignettes 24 | vignettes/*.html 25 | vignettes/*.pdf 26 | 27 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 28 | .httr-oauth 29 | 30 | # knitr and R markdown default cache directories 31 | *_cache/ 32 | /cache/ 33 | 34 | # Temporary files created by R markdown 35 | *.utf8.md 36 | *.knit.md 37 | 38 | # R Environment Variables 39 | .Renviron 40 | .DS_Store 41 | lecture02-data-imp_n_exp/data/hotels_vienna/raw/hotelbookingdata.csv 42 | lecture02-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.csv 43 | lecture02-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.xlsx 44 | lecture02-data-imp_n_exp/data/hotels_vienna/export/hotelbookingdata.xlsx 45 | lecture02-data-imp_n_exp/data/hotels_vienna/export/hotelbookingdata.RData 46 | lecture02-data-imp_n_exp/data/hotels_vienna/export/my_rfile.RData 47 | lecture00-intro/test_Rmarkdown.html 48 | lecture00-intro/test_Rmarkdown.pdf 49 | TO DOs 50 | lecture03-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.csv 51 | lecture03-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.xlsx 52 | lecture03-data-imp_n_exp/data/hotels_vienna/export/my_rfile.RData 53 | lecture12_intro_to_regression/complete_codes/hotels_vienna_regression_fin_w_logs.R 54 | lecture19-advaced_rmarkdown/complete_codes/advanced_rmarkdown_fin.log 55 | lecture20-basic-spatial-vizz/visualize_spatial_old.R 56 | partIII-case-studies/seminar04-random-forest-airbnb/data/rf_model_2auto.RData 57 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Gabors Data Analysis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /common_issues/README.md: -------------------------------------------------------------------------------- 1 | # Common Issues that help navigate through this course 2 | 3 | In this course, we focus on R and its IDE: RStudio. This means you won’t learn anything about Python, Julia, or any other programming language useful for data science. They’re also excellent choices, and in practice, most data science teams use a mix of languages, often at least R and Python. 4 | 5 | Here we collect some of the common issues that we have experienced through the years of teaching coding with R. These issues are not specific to any of the lectures but relate to techniques (such as Git or GitHub) or troubleshooting (e.g. RMarkdown) which are rather specific to students. These are some general advice on how to start tackling these topics. 6 | 7 | This folder should be dynamic in the sense that it should adapt to new problems and show general guidance on how to solve them. 8 | 9 | ## Current issues 10 | 11 | - Troubles with knitting an RMarkdown document: [help_rmarkdown.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_rmarkdown.md). 12 | - Help for Git and GitHub: [help_github_n_Rstudio.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_github_n_Rstudio.md). 13 | - In general, we have found useful [cheatsheets by RStudio](https://www.rstudio.com/resources/cheatsheets/) for several topics. 14 | 15 | ## Archived - not relevant anymore 16 | 17 | No archived issue at the moment. 18 | -------------------------------------------------------------------------------- /common_issues/help_github_n_Rstudio.md: -------------------------------------------------------------------------------- 1 | # Help to connect RStudio to your GitHub account 2 | 3 | In this course, we not only learn how to use R and Rstudio but would like to enhance good habits for programming. 4 | One main feature of this is to use actively **version control**. Using Git and Github is a great example of this. RStudio allows a built-in direct connection for GitHub. 5 | 6 | In the following, we give help on how to establish this direct connection. 7 | 8 | ## 1) Create a **Personal Access Token** at *GitHub* 9 | To get access RStudio to your GitHub account you need to create a token. 10 | 11 | 1. Sign in to GitHub 12 | 2. Click on your personal icon button and select [**Settings**](https://github.com/settings/profile) (Note: not the Settings button for one of your repos) 13 | 3. In the left options panel, select [**Developer Settings**](https://github.com/settings/apps) 14 | 4. Select [**Personal Access Tokens**](https://github.com/settings/tokens) 15 | 5. Select **Generate New Token**. This will ask for your GitHub password. 16 | - You shall give a note of what is this token good for such as *RStudio access* or something similar 17 | - Set the **Expiration** to **No Expiration**. 18 | - You should check all **Scopes** boxes. You can safely check all, but at least you need to check: *repo, workflow, write:package, delete:package, notification, write:discussion* 19 | 6. Click on **Generate Token** at the bottom of the page. 20 | 7. You get your **key**, that **NEED TO SAVE to a temporary file or note**. 21 | - In case you have not saved the key, you can regenerate it by clicking on your already existing token, but then you will need to update all your apps using this key 22 | 23 | Some of these steps are nicely summarized and shown in [Ginny Fahs's blog](https://ginnyfahs.medium.com/github-error-authentication-failed-from-command-line-3a545bfd0ca8). 24 | 25 | ## 2) Create a new repo on your GitHub 26 | 27 | Good idea to create a new repo for this course. You can use this repo throughout the course. 28 | 29 | ## 3) Creating a version-controlled project in RStudio 30 | 31 | 1. Open RStudio. 32 | 2. Create a new project (File/New project) 33 | 3. Select **Version Control** 34 | 4. Select **Git** 35 | 5. Add your **repo's url** to clone it by RStudio and select your **path** on your computer, where your repo will be. Click **create project**. 36 | 6. A window will pop up and ask for your **GitHub account** then for the **token key**. 37 | 7. You have your first GitHub-controlled RStudio project! 38 | 39 | ## 4) Working with a version-controlled project in RStudio 40 | 41 | 1. Open and work on your created version-controlled project. 42 | 2. You can commit your work by: 43 | - Going to Tools/Version Control/Commit and a window comes up: 44 | - Upper left panel you can select files to commit. 45 | - Upper right panel you must specify your commit message. 46 | - Lower panel shows the changes in your files. 47 | - You can Push/Pull with the arrows in the upper-right section of the window. 48 | - You can follow your history, by changing to *History* from the *Changes* button on the top. 49 | - You can select branches by switching from *master* on the top as well. 50 | 3. You can Push/Pull within the commit window or by clicking on Tools/Version Control/Push or Pull 51 | 4. Now your work has been updated. 52 | 53 | ## 5) Additional comments 54 | 55 | - Try to create a good habit with version control: not only save to your computer but push it to your repo. 56 | - It is a good practice to save, commit and push your project after each working session or even more frequently. 57 | - Pay attention to your folder structure. 58 | - Always add `Readme.md` files to your folders, you can add them by *File/New File/Text File* and save them as `Readme.md`. 59 | - You do not need to use RStudio for version control, you can use the *Shell/Terminal* or specific programs such as *GitHub Desktop*. RStudio makes the same as these but in a compact way. 60 | - If you are interested in more on this topic, I highly recommend checking out https://happygitwithr.com/. 61 | -------------------------------------------------------------------------------- /common_issues/help_rmarkdown.md: -------------------------------------------------------------------------------- 1 | # Help to knit your first document in *RMarkdown* 2 | 3 | In many cases knitting a document, especially in pdf, result in an error. Here, I provide some possible solutions for these problems. 4 | You may iterate through these possible solutions and check whether you got a proper output after each fix. 5 | 6 | 1. Your working directory contains invalid characters 7 | - In general, you should avoid paths, which contains non-machine readable characters such as non-English characters *á,ë,ö* or characters which has their own purposes in coding such as *.,;\*\\\[\]\(\)\#* or *space*. 8 | - You should use **'_'** or **'-'** to colocate your folders/file names instead, which is machine-readable. 9 | - **Solution:** rename your folders in your path which contains the *.Rmd* file and the *.Rmd* file itself. 10 | 11 | 2. Try to re-install or update your RStudio and R 12 | 13 | 2.1. You can update internally both R-Studio and R 14 | - RStudio update: click on Help/Check for Updates 15 | - R update: 16 | * Windows users: use the `installr` package. 17 | 18 | ```r 19 | install.packages("installr") 20 | library(installr) 21 | updateR() 22 | ``` 23 | 24 | * Mac update: substitute your password at the last line. 25 | 26 | ```r, 27 | install.packages('devtools') #assuming it is not already installed 28 | library(devtools) 29 | install_github('andreacirilloac/updateR') 30 | library(updateR) 31 | updateR(admin_password = 'Admin user password') 32 | ``` 33 | 34 | 35 | 2.2. You can download both of them from their website, see links in the [Readme.md](https://github.com/regulyagoston/BA21_Coding/blob/main/README.md) 36 | 37 | 3. Error encountered during knitting a **pdf** file. 38 | 39 | 3.1 You do not have a **tex/latex** engine installed. 40 | - Install tinytex to RStudio: 41 | ```r 42 | install.packages('tinytex') 43 | ``` 44 | 45 | 3.2. Your **tex/latex** engine is out-of-date. 46 | - You have to update/re-install the latex engine. 47 | 48 | ```r 49 | tinytex::reinstall_tinytex() 50 | ``` 51 | 52 | - Restart your RStudio and try to knit your document. 53 | 54 | 3.3. Install another **tex/latex** engine if *tinytex* does not work... 55 | - An alternative **tex/latex** engine is *MikTex*. 56 | * Follow the steps written in [Søren L Kristiansen's blog](https://medium.com/@sorenlind/create-pdf-reports-using-r-r-markdown-latex-and-knitr-on-windows-10-952b0c48bfa9) 57 | * Alternatively you can watch this old [video](https://www.youtube.com/watch?v=k-xSGZ-RLBU&ab_channel=OutLieer) on how to install it in RStudio. 58 | - Try to stick with *tinytex* as much as possible, these alternatives are not as stable. However, sometimes I have found this is the only solution... 59 | -------------------------------------------------------------------------------- /lecture00-intro/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 00: Introduction to R and RStudio 2 | 3 | ## Motivation 4 | 5 | In this course, we focus on R and its IDE: RStudio. This means you won’t learn anything about Python, Julia, or any other programming language useful for data science. They’re also excellent choices, and in practice, most data science teams use a mix of languages, often at least R and Python. 6 | 7 | We also believe that R is a great place to start your data science carrier as it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammar for specific parts of the data science process. These mini languages help you think about problems as a data scientist while supporting fluent interaction between your brain and the computer. ([Hadley Wickham and Garrett Grolemund R for Data Science](https://r4ds.had.co.nz/introduction.html)) 8 | 9 | ## This lecture 10 | 11 | This is the starting lecture, that introduces students to R and RStudio (download and install), runs a pre-written script, asks them to knit a pdf/Html document, and highlights the importance of version control. 12 | 13 | The aim of this class is not to teach coding, but to make sure that everybody has R and RStudio on their laptop, installs `tidyverse` package, and (tries to) knit an RMarkdown document. The main aim of these steps is to reveal possible OS mismatches or other problems with R and RStudio. 14 | The material and steps are detailed in [`getting_started.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.md). 15 | 16 | 17 | ## Learning outcomes 18 | After successfully teaching the material (see: [`getting_started.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.md)), students will have 19 | 20 | - R and RStudio on their laptop/computers 21 | 22 | and understand, 23 | 24 | - How RStudio looks like, which window is which. 25 | - How to run a command via console. 26 | - What are libraries (packages), and how to install and load them. 27 | - Why version control is important and what are the main possibilities with Git and GitHub. 28 | 29 | Furthermore, students will, 30 | 31 | - knit a Rmarkdown in both *pdf* and *Html*, without any deeper knowledge on the issue. 32 | 33 | These steps are found to be extremely important, as fixing installation and knitting problems may take days to weeks. 34 | 35 | ## Datasets used 36 | * No dataset is used in this lecture 37 | 38 | ## Lecture Time 39 | 40 | Ideal overall time: **20-30 mins**. 41 | 42 | It can substantially differ from this if the teacher decides to do a live coding session with students and fixes the emerging problems during the class (up to ~90 mins). 43 | 44 | ## Homework 45 | 46 | No homework, apart from fixing possible issues with R, RStudio, and compiling a '.Rmd' in Html and pdf. 47 | 48 | ## Further material 49 | 50 | - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 1](https://r4ds.had.co.nz/introduction.html) on introduction, [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) on libraries, [Chapter 6](https://r4ds.had.co.nz/workflow-scripts.html) on windows and workflow. 51 | - Kieran H. (2019): Data Visualization [Chapter 2.2](https://socviz.co/gettingstarted.html#use-r-with-rstudio) introduces window structure in RStudio pretty well. 52 | - Andrew Heiss: Data Visualization with R, [Lesson 1](https://datavizs21.classes.andrewheiss.com/lesson/01-lesson/) provides some great videos and an introduction to R and Rmarkdown. 53 | - Git references: 54 | - [Technical foundations of informatics book](https://info201.github.io/git-basics.html) 55 | - [Software carpentry course](https://swcarpentry.github.io/git-novice/) (Strongly recommended) 56 | - [Github Learning Lab](https://lab.github.com/) 57 | - [If you are really committed](https://git-scm.com/book/en/v2) (pun intended) 58 | 59 | 60 | ## File structure 61 | 62 | - [`getting_started.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.md) provides material on the installation of R and RStudio, tidyverse and show some cool stuff with R. 63 | - [`getting_started.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.Rmd) is the generating Rmarkdown file for `getting_started.md` 64 | - [`intro_to_R.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/intro_to_R.R) includes codes to introduce scripts, install `tidyverse`, and show how cool R is. 65 | - [`test_Rmarkdown.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/test_Rmarkdown.Rmd) is a test file to reveal possible issues with knitting a Rmarkdown document. During the course, students need to be able to compile their work into pdf and/or Html. This is a test, which is super important to do as quickly as possible, while some fixes take a while... 66 | 67 | ## Help with RMarkdown and RStudio with git 68 | 69 | In case you have trouble with the knitting of a Rmarkdown document or connecting your GitHub to RStudio, I have collected the major solutions, which may help in [**common_issues**](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues) folder. 70 | 71 | - For RMarkdown: [help_rmarkdown.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_rmarkdown.md) file. 72 | - For GitHub: [help_github_n_Rstudio.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_github_n_Rstudio.md) file. 73 | -------------------------------------------------------------------------------- /lecture00-intro/getting_started_files/figure-gfm/unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture00-intro/getting_started_files/figure-gfm/unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /lecture00-intro/intro_to_R.R: -------------------------------------------------------------------------------- 1 | ######################### 2 | ## ## 3 | ## WELCOME TO R-STUDIO ## 4 | ## ## 5 | ## THIS IS A SCRIPT ## 6 | ## ## 7 | ## Lecture 00 ## 8 | ## ## 9 | ######################### 10 | 11 | # Cleaning the environment 12 | # just to be sure we are in the same setup 13 | rm(list = ls()) 14 | 15 | # Install a package 16 | install.packages('tidyverse') 17 | # load a package for the work 18 | library(tidyverse) 19 | 20 | 21 | # There are built-in data: 22 | # mpg is a dataset for cars with different characteristics: 23 | mpg 24 | 25 | 26 | # It is easy to create a plot to compare: 27 | # engine size (displ) vs. fuel efficiency (hwy) 28 | ggplot(data = mpg) + 29 | geom_point(mapping = aes(x = displ, y = hwy)) + 30 | labs(y = 'fuel efficiency', x = 'engine size') 31 | 32 | 33 | # You may say that there are specific groups 34 | # which are not highlighted by this simple graph 35 | # it is easy to plot some further patterns... 36 | ggplot(data = mpg) + 37 | geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 38 | labs(y = 'fuel efficiency', x = 'engine size') 39 | 40 | 41 | # You may want to quantify these relations 42 | # First, start with the overall pattern 43 | ggplot(data = mpg) + 44 | geom_point(mapping = aes(x = displ, y = hwy)) + 45 | geom_smooth(mapping = aes(x = displ, y = hwy)) + 46 | labs(y = 'fuel efficiency', x = 'engine size') 47 | 48 | # But it is as easy to refine the graph for more complex patterns: 49 | ggplot(data = mpg) + 50 | geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 51 | geom_smooth(mapping = aes(x = displ, y = hwy, color = class)) + 52 | labs(y = 'fuel efficiency', x = 'engine size') 53 | 54 | # But it may be overcrowded... 55 | # No worries, one can easily make multiple graphs as well! 56 | ggplot(data = mpg) + 57 | geom_point(mapping = aes(x = displ, y = hwy)) + 58 | geom_smooth(mapping = aes(x = displ, y = hwy)) + 59 | facet_wrap(~ class, nrow = 3) 60 | labs(y = 'fuel efficiency', x = 'engine size') 61 | 62 | 63 | ## 64 | # With R, we can get maps as well: 65 | install.packages('maps') 66 | ggplot(map_data('world'), aes(long, lat, group = group)) + 67 | geom_polygon(fill = 'white', colour = 'black') + 68 | coord_quickmap() 69 | 70 | ## 71 | # Or other pretty cool stuff that we are going to learn through the course! 72 | 73 | 74 | -------------------------------------------------------------------------------- /lecture00-intro/test_Rmarkdown.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tryout" 3 | output: pdf_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | ``` 9 | 10 | ## R Markdown 11 | 12 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . 13 | 14 | When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: 15 | 16 | ```{r cars} 17 | summary(cars) 18 | ``` 19 | 20 | ## Including Plots 21 | 22 | You can also embed plots, for example: 23 | 24 | ```{r pressure, echo=FALSE} 25 | plot(pressure) 26 | ``` 27 | 28 | Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. 29 | -------------------------------------------------------------------------------- /lecture01-coding-basics/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 01: Coding basics 2 | 3 | ## Motivation 4 | 5 | Coding has its charm you can do anything if you know the basics. We take the traditional programming approach to first introduce the building blocks of coding. This may seem cumbersome at first sight (e.g. in contrast to Hadley Wickham and Garrett Grolemund R for Data Science), but it leads to understanding the basic principles of coding. It is also a great help when searching for solutions on the web as among the many options most of them use blocks of these basic blocks. 6 | 7 | ## This lecture 8 | 9 | This is the first coding lecture, which introduces students to coding in R. 10 | It is a **live coding class**, where the teaching material is detailed in [`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md). 11 | 12 | 13 | ## Learning outcomes 14 | After successfully live-coding the material ([`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md)), students will have knowledge on 15 | 16 | - What is RStudio. 17 | - How to run a command via console. 18 | - What is a script and how to create one. 19 | - Difference between a script and a project. 20 | - What are the basics of coding: 21 | - how to name your variables, how to use comments, and format your code 22 | - what is an R-object and what operations can one do with it (numeric, logical, characters, and factors) 23 | - what is a built-in function and how does it work 24 | - how to decide the type of R-objects via functions 25 | - how to create a vector, what are the vector operations and how to get elements of it via indexing. 26 | - what are the special variables/values/issues (empty variable, NA, Inf, precision) 27 | - what are the different variable types 28 | - create different vectors 29 | - create lists and indexing with lists 30 | 31 | ## Datasets used 32 | 33 | - No dataset is used in this lecture 34 | 35 | ## Lecture Time 36 | 37 | Ideal overall time: **100 mins**. 38 | 39 | This lecture time is one of the hardest to predict, as it solely depends on the background of the students. This is an introductory class showing how coding works in general. Always aim for students with the least knowledge. Note, however, that there are extra 'good-to-know' parts that can be skipped. 40 | 41 | This lecture time assumes that R and RStudion already work on their laptops. 42 | 43 | ## Homework 44 | 45 | *Type*: quick practice, approx 15 mins, 7+1 lines of code. See [`assignment_1.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/assignment_1.R). 46 | 47 | ## Further material 48 | 49 | - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 4](https://r4ds.had.co.nz/workflow-basics.html) provide some basic principles and using the console. [Chapter 6](https://r4ds.had.co.nz/workflow-scripts.html) deals with scripts and some error handling.[Chapter 8](https://r4ds.had.co.nz/workflow-projects.html) shows how to work with projects, along with useful setup options and working directory settings. [Chapter 20](https://r4ds.had.co.nz/vectors.html) provides a similar but more detailed discussion. 50 | - Kieran H. (2019): Data Visualization [Chapter 2.2-2.3](https://socviz.co/gettingstarted.html#use-r-with-rstudio) introduces window structure in RStudio pretty well (Chapter 2.2) and basic syntax, objects, libraries (Chapter 2.3) 51 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 02](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/02_code_style.Rmd) provides useful guidelines on how to write and format codes. [Lecture 03](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/03_1d_data.Rmd) provides some additional exercises and insights into R-objects, variable types, and some basic functions. 52 | - provides a great overview on good coding style with the `tidyverse` approach. 53 | 54 | 55 | ## File structure 56 | 57 | - [`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md) provides material for the live coding session with explanations. 58 | - [`coding_basics.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.Rmd) is the generating Rmarkdown file for [`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md) 59 | - [**`first_script.R`**](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/first_script.R) is a possible realization of the live coding session 60 | - [`assignment_1.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/assignment_1.R) is the assignment after the first lecture. 61 | -------------------------------------------------------------------------------- /lecture01-coding-basics/assignment_1.R: -------------------------------------------------------------------------------- 1 | ######################### 2 | ## ## 3 | ## Assignment 1 ## 4 | ## ## 5 | ## Coding basics ## 6 | ## ## 7 | ## Deadline: ## 8 | ## ## 9 | ## ## 10 | ######################### 11 | 12 | ## 13 | # Task: fill out the blank parts with the needed commands 14 | 15 | # 1) Add the command which clears the environment 16 | 17 | 18 | # 2) Create a string variable, which states: 'This is my first assignment in R!' 19 | str_var <- 20 | 21 | # 3) Decide with the proper command whether this is truly a string variable or not: 22 | 23 | 24 | # 4) Create vector 'v', which contains the values of: 3, 363, 777, 2021, -987 and Inf 25 | v <- 26 | 27 | # 5) Multiply this vector with 10 and name it as v_10 28 | 29 | 30 | # 6) Create a list, which contains 'str_var' variable and 'v' vector 31 | mL <- 32 | 33 | # 7) Get the value of 'Inf' out of this 'mL' variable with indexing 34 | 35 | 36 | # +1) decide whether the former extracted value is infinite or not. 37 | # The result should be a logical value. 38 | 39 | 40 | 41 | -------------------------------------------------------------------------------- /lecture01-coding-basics/first_script.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | # # 3 | # Lecture 01 # 4 | # # 5 | # Introduction to coding # 6 | # # 7 | # - R-objects # 8 | # - Variables # 9 | # - Built in functions # 10 | # - Vectors # 11 | # - Indexing # 12 | # - Special values # 13 | # - Lists # 14 | # # 15 | # # 16 | ################################## 17 | 18 | 19 | ## 20 | # R-objects: 21 | 22 | # Character 23 | myString <- 'Hello world!' 24 | 25 | # Convention to name your variables 26 | my_fav_var <- 'bla' 27 | myFavVar <- 'bla' 28 | # Rarely use long names such as 29 | my_favourite_variable <- 'bla' 30 | 31 | 32 | # We can define numeric R-objects: 33 | a <- 2 34 | b <- 3 35 | 36 | # And do mathematical operations: 37 | a+b-(a*b)^a 38 | 39 | c <- a + b 40 | d <- a*c/b*c 41 | 42 | # Or create logical R-object: 43 | a == b 44 | 2 == 3 45 | (a + 1) == b 46 | # negation: 47 | a != b 48 | 49 | # other logical operators for multiple statement 50 | 2 == 2 & 3 == 2 51 | 2 == 2 | 3 == 2 52 | 53 | ## 54 | # Functions: 55 | 56 | # Remove variables from work space 57 | rm(d) 58 | # or calculate square root: 59 | sqrt(4) 60 | # if not sure what a function does: 61 | ?sqrt 62 | 63 | ## 64 | # Type of R-objects: 65 | typeof(myString) 66 | typeof(a) 67 | 68 | # Numeric values: integer and double 69 | num_val <- as.numeric(1.2) 70 | doub_val <- as.double(1.2) 71 | int_val <- as.integer(1.2) 72 | typeof(num_val) 73 | 74 | # Decide what type a variable has: 75 | is.character(myString) 76 | is.logical(2==3) 77 | is.double(doub_val) 78 | is.integer(int_val) 79 | is.numeric(doub_val) 80 | is.numeric(int_val) 81 | is.integer(doub_val) 82 | is.double(int_val) 83 | 84 | ## 85 | # Create vectors 86 | v <- c(2,5,10) 87 | # Operations with vectors 88 | z <- c(3,4,7) 89 | 90 | v+z 91 | v*z 92 | a+v 93 | 94 | # Number of elements 95 | num_v <- length(v) 96 | num_v 97 | 98 | # Create vector from vectors 99 | w <- c(v,z) 100 | w 101 | length(w) 102 | # R is case-sensitive: gives an error 103 | length(W) 104 | 105 | # Note: be careful w operation 106 | q <- c(2,3) 107 | v+q 108 | v+c(2,3,2) 109 | 110 | # Indexing with vectors: goes with [] 111 | v[1] 112 | v[2:3] 113 | v[c(1,3)] 114 | 115 | # Fix the addition of v+q 116 | v[1:2] + q 117 | 118 | 119 | ## Special variables/values/issues: 120 | null_vector <- c() 121 | # NaN value 122 | nan_vec <- c(NaN,1,2,3,4) 123 | na_vec <- c(NA,1,2,3,4) 124 | nan_vec + 3 125 | # Inf values 126 | inf_val <- Inf 127 | 5/0 128 | # Rounding issues 129 | sqrt(2)^2 == 2 130 | # and fix it: 131 | round(sqrt(2)^2) == 2 132 | 133 | 134 | #### 135 | # Lists 136 | my_list <- list('a',2,0==1) 137 | my_list2 <- list(c('a','b'),c(1,2,3),sqrt(2)^2==2) 138 | 139 | # indexing with lists: 140 | # you get the list's value - still a list (typeof(my_list2[1])) 141 | my_list2[1] 142 | # you get the vector's value - it is a character (typeof(my_list2[[1]])) 143 | my_list2[[1]] 144 | # you get the second element from the vector 145 | my_list2[[1]][2] 146 | -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 02: Import and Export data to R 2 | 3 | ## Motivation 4 | 5 | Data doesn’t grow on trees but needs to be collected with a lot of effort, and it’s essential to have high-quality data to get meaningful answers to our questions. In the end, data quality is determined by how the data was collected. Thus, it’s fundamental for data analysts to understand various data collection methods, how they affect data quality in general, and what the details of the actual collection of their data imply for its quality. Most important methods of data collection used in business, economics, and policy analysis, such as web scraping, using administrative sources, and conducting surveys all imply these sources need to be imported to R. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces students to importing and exporting data to R with `readr` from `tidyverse`. Various importation technique and format is discussed and several options on how to export data to the local computer. 10 | 11 | 12 | ## Learning outcomes 13 | After successfully completing the code in *raw_codes* students should be able to: 14 | 15 | [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R) 16 | - Import data *csv* or other formats via 17 | - clicking through the built-in options 18 | - using a local path 19 | - download directly via internet url 20 | - use API, namingly: `tidyquant` and `WDI` packages. 21 | - Export data on *csv*, *xlsx* or *RData* format to local computer 22 | 23 | ## Datasets used 24 | 25 | * [Hotels Vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna) 26 | * [Football](https://gabors-data-analysis.com/datasets/#football) as homework. 27 | 28 | 29 | ## Lecture Time 30 | 31 | Ideal overall time: **10-20 mins**. 32 | 33 | Showing [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R) takes around *10 minutes* while doing the tasks would take the rest. 34 | 35 | 36 | ## Homework 37 | 38 | *Type*: quick practice, approx 10 mins 39 | 40 | Import from OSF the [football](https://osf.io/zqm6c/) data tables. To be more precise you should import table containing manager's characteristics data (`football_managers.csv`) and football performance with teams (`football_managers_workfile.csv`). Make sure of using a tidy folder structure: create a data folder with raw and clean folders. For this time only, export the same data tables into an export folder as `xlsx` and `.RData` files. 41 | 42 | ## Further material 43 | 44 | - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 11](https://r4ds.had.co.nz/data-import.html) provides an overview of data import and export along with a detailed discussion of how these methods are done and how tidyverse approaches. 45 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 02](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/02_computational_reproducibility.Rmd) provides useful further guidelines on how to organize the folder structure and how to export and import data/figures/etc. 46 | 47 | 48 | ## Folder structure 49 | 50 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes) includes one code, which is ready to use during the course but requires some live coding in class. 51 | - [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R) 52 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/complete_codes) includes one code with solutions for 53 | - [`dataset_handling_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/complete_codes/dataset_handling_fin.R) solution for: [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R) 54 | - [data/hotels_vienna](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna) provides a folder structure for the class. It contains data that will be used during the lecture as well as folders for the outputs. 55 | - [clean](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna/clean) - this is a great example of how to organize a project's cleaned data folder. 56 | - [raw](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna/raw) - includes (a) raw files. Should save during the lecture the data on bookings of hotels as `hotelbookingdata.csv` into this folder. 57 | - [export](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna/export) - is a folder where you should export all the files during the course. 58 | 59 | 60 | 61 | -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/complete_codes/dataset_handling_fin.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | # # 3 | # Lecture 02 # 4 | # # 5 | # Import and Export Data # 6 | # to R with # 7 | # # 8 | # - Importing with clicking # 9 | # - read_csv(): # 10 | # - local and url # 11 | # - working directory # 12 | # - Export # 13 | # - write_csv # 14 | # - xlsx package # 15 | # - save to RData # 16 | # - API: # 17 | # - tidyquant and WDI # 18 | # # 19 | # # 20 | ################################## 21 | 22 | rm(list = ls()) 23 | # Tidyverse includes readr package 24 | # which we use for importing data! 25 | library(tidyverse) 26 | 27 | 28 | 29 | #################### 30 | ## Importing data: 31 | # 3 options to import data: 32 | 33 | 34 | ##### 35 | # 1) Import by clicking: File -> Import Dataset -> 36 | # -> From Text (readr) / this is for csv. You may use other to import other specific formats 37 | # 38 | # Notes: 39 | # - Do this exercise to find your data and realize that importing this way will show up in the console. 40 | # if second option does not work, check the path on the console! 41 | # - Check the library, that the import command used: it is called 'readr' which is part of 'tidyverse'! 42 | # you should avoid calling libraries multiple times, thus if tidyverse is already imported, 43 | # there is no need to import readr again. 44 | # (But in this case will not cause any problem. It may be a problem if you call different versions!) 45 | 46 | ####### 47 | # 2) Import by defining your path: 48 | # a) use an absolute path (you have to know from root folder the path of your csv) 49 | 50 | data_in <- '~/Documents/Egyetem/Bekes_Kezdi_Textbook/da-coding-rstats/lecture02-data-imp_n_exp/data/hotels_vienna/' 51 | df_0 <- read_csv(paste0(data_in,'clean/hotels-vienna.csv')) 52 | 53 | # b) use relative path: 54 | # R works in a specific folder called `working directory`, that you can check by: 55 | getwd() 56 | 57 | # after that, you can set your working directory by: 58 | setwd(data_in) 59 | # and simply call the data 60 | df_1 <- read_csv('clean/hotels-vienna.csv') 61 | 62 | 63 | # delete your data 64 | rm(hotels_vienna, df_0, df_1) 65 | 66 | 67 | ######## 68 | # 3) Import by using url - this is going to be our preferred method at this course! 69 | # Note: importing from the web is almost inferior to use your local disc, 70 | # but there are some exceptions: 71 | # a) The data is considerably large (>1GB) 72 | # b) It is important that there is no `refresh` or change in the data 73 | # in these case it is good practice to download to your computer the datas 74 | 75 | # Can access (almost) all the dat from 'ISF' 76 | # the hotels vienna dataset has the following url: 77 | df <- read_csv(url('https://osf.io/y6jvb/download')) 78 | 79 | 80 | ### 81 | # Quick check on the data: 82 | 83 | # glimpse on data 84 | glimpse(df) 85 | 86 | # Check some of the first observations 87 | head(df) 88 | 89 | # Have a built in summary for the variables 90 | summary(df) 91 | 92 | 93 | ########################### 94 | # Exporting your data: 95 | # 96 | # This is a special case: data_out is now the same as data_in (no cleaning...) 97 | data_out <- paste0(data_in, '/export/') 98 | write_csv(df, paste0(data_out, 'my_csvfile.csv')) 99 | 100 | # If due to some reason you would like to export as xls(x) 101 | install.packages('writexl') 102 | library(writexl) 103 | write_xlsx(df, paste0(data_out, 'my_csvfile.xlsx')) 104 | 105 | # Third option is to save as an R object 106 | save(df, file = paste0(data_out, 'my_rfile.RData')) 107 | 108 | ###### 109 | # Extra: using API 110 | # - tq_get - get stock prices from Yahoo/Google/FRED/Quandl, ect. 111 | # - WDI - get various data from World Bank's site 112 | # 113 | 114 | # tidyquant 115 | install.packages('tidyquant') 116 | library(tidyquant) 117 | # Apple stock prices from Yahoo 118 | aapl <- tq_get('AAPL', 119 | from = '2020-01-01', 120 | to = '2021-10-01', 121 | get = 'stock.prices') 122 | 123 | glimpse(aapl) 124 | 125 | # World Bank 126 | install.packages('WDI') 127 | library(WDI) 128 | # How WDI works - it is an API 129 | # Search for variables which contains GDP 130 | a <- WDIsearch('gdp') 131 | # Narrow down the serach for: GDP + something + capita + something + constant 132 | a <- WDIsearch('gdp.*capita.*constant') 133 | # Get data 134 | gdp_data <- WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019) 135 | 136 | glimpse(gdp_data) 137 | 138 | ## 139 | # Tasks: 140 | # 141 | # 1) Go to the webpage: https://gabors-data-analysis.com/ and find OSF database under `Data and Code` 142 | # 2) Go the the Gabor's OSF database and download manually 143 | # the `hotelbookingdata.csv` from `hotels-europe` dataset into your computer and save it to 'raw' folder. 144 | # 3) load the data from this path 145 | # 4) also load the data directly from the web (note you need to add `/download` to the url) 146 | # 5) write out this file as xlsx and as a .RData next to the original data. 147 | 148 | # Load from path 149 | df_t0 <- read_csv(paste0(data_in,'raw/hotelbookingdata.csv')) 150 | # Load from wed 151 | df_t1 <- read_csv('https://osf.io/yzntm/download') 152 | # Write as xlsx 153 | write_xlsx(df_t1, paste0(data_out, 'hotelbookingdata.xlsx')) 154 | # Write as .RData 155 | save(df_t1, file = paste0(data_out, 'hotelbookingdata.RData')) 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/data/hotels_vienna/clean/README.md: -------------------------------------------------------------------------------- 1 | **************************************************************** 2 | Prepared for Gabor's Data Analysis 3 | 4 | Data Analysis for Business, Economics, and Policy 5 | by Gabor Bekes and Gabor Kezdi 6 | Cambridge University Press 2021 7 | gabors-data-analysis.com 8 | 9 | Description of the 10 | hotels-vienna dataset 11 | 12 | used in case studies 13 | 2A Finding a good deal among hotels: data preparation 14 | 3A Finding a good deal among hotels: data exploration 15 | 7A Finding a good deal among hotels with simple regression 16 | 8A Finding a good deal among hotels with nonlinear function 17 | 8C Hotel ratings and measurement error 18 | 10B Finding a good deal among hotels with multiple regression 19 | 20 | **************************************************************** 21 | 22 | [see it on the website](https://gabors-data-analysis.com/dat_hotels-vienna) 23 | 24 | 25 | This is a README file for the `hotels-vienna` dataset that includes information on price and features of hotels in Vienna for one date. 26 | 27 | 28 | ## Data source 29 | 30 | Scraped from a price comparison website. 31 | It was anonymized and slightly altered to ensure confidentiality. It contains all the necessary information about the location and rating that helps to distinguish them. 32 | 33 | ## Data access and copyright 34 | 35 | The data was collected by the authors and may be used for education purposes only. 36 | 37 | ## About the data 38 | 39 | ### Raw data tables 40 | 41 | `hotelbookingdata-vienna.csv` 42 | 43 | The file contains data about hotel prices and features from a price comparison website. 44 | * for Vienna, Austria, 45 | * for a single weekday night in November 2017 46 | * The dataset has N=430 observations. 47 | * ID variable: hotel_id 48 | 49 | 50 | ### Tidy data table 51 | 52 | `hotels-vienna` is just a slightly cleaned version of the raw data excluding duplicates. 53 | 54 | * The dataset has N=428 observations. 55 | * ID variable: hotel_id 56 | 57 | 58 | | variable name | info | type | 59 | |-------------------- |------------------------------------------------ |--------- | 60 | | hotel_id | Hotel ID | numeric | 61 | | accommodation_type | Type of accomodation | factor | 62 | | country | Country | string | 63 | | city | City based on search | string | 64 | | city_actual | City actual of hotel | string | 65 | | neighbourhood | Neighburhood | string | 66 | | center1label | Centre 1 - name of location for distance | string | 67 | | distance | Distance - from main city center | numeric | 68 | | center2label | Centre 2 - name of location for distance_alter | string | 69 | | distance_alter | Distance - alternative - from Centre 2 | numeric | 70 | | stars | Number of stars | numeric | 71 | | rating | User rating average | numeric | 72 | | rating_count | Number of user ratings | numeric | 73 | | ratingta | User rating average (tripadvisor) | numeric | 74 | | ratingta_count | Number of user ratings (tripadvisor) | numeric | 75 | | hotel_id | Hotel ID | numeric | 76 | | year | Year (YYYY) | numeric | 77 | | month | Month (MM) | numeric | 78 | | weekend | Flag, if day is a weekend | binary | 79 | | holiday | Flag, if day is a public holiday | binary | 80 | | nnights | Number of nights | factor | 81 | | price | Pricee in EUR | numeric | 82 | | scarce_room | Flag, if room was noted as scarce | binary | 83 | | offer | Flag, if there was an offer available | binary | 84 | | offer_cat | Type of offer | factor | -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/data/hotels_vienna/clean/VARIABLES.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture02-data-imp-n-exp/data/hotels_vienna/clean/VARIABLES.xlsx -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/data/hotels_vienna/export/export_here: -------------------------------------------------------------------------------- 1 | Here you should export the files you have created during the lecture. -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/data/hotels_vienna/raw/show_folder: -------------------------------------------------------------------------------- 1 | To show raw folder on Github -------------------------------------------------------------------------------- /lecture02-data-imp-n-exp/raw_codes/dataset_handling.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | # # 3 | # Lecture 02 # 4 | # # 5 | # Import and Export Data # 6 | # to R with # 7 | # # 8 | # - Importing with clicking # 9 | # - read_csv(): # 10 | # - local and url # 11 | # - working directory # 12 | # - Export # 13 | # - write_csv # 14 | # - xlsx package # 15 | # - save to RData # 16 | # - API: # 17 | # - tidyquant and WDI # 18 | # # 19 | # # 20 | ################################## 21 | 22 | rm(list = ls()) 23 | # Tidyverse includes readr package 24 | # which we use for importing data! 25 | library(tidyverse) 26 | 27 | 28 | 29 | #################### 30 | ## Importing data: 31 | # 3 options to import data: 32 | 33 | 34 | ##### 35 | # 1) Import by clicking: File -> Import Dataset -> 36 | # -> From Text (readr) / this is for csv. You may use other to import other specific formats 37 | # 38 | # Notes: 39 | # - Do this exercise to find your data and realize that importing this way will show up in the console. 40 | # if second option does not work, check the path on the console! 41 | # - Check the library, that the import command used: it is called 'readr' which is part of 'tidyverse'! 42 | # you should avoid calling libraries multiple times, thus if tidyverse is already imported, 43 | # there is no need to import readr again. 44 | # (But in this case will not cause any problem. It may be a problem if you call different versions!) 45 | 46 | ####### 47 | # 2) Import by defining your path: 48 | # a) use an absolute path (you have to know from root folder the path of your csv) 49 | 50 | data_in <- '~/Documents/Egyetem/Bekes_Kezdi_Textbook/da-coding-rstats/lecture02-data-imp_n_exp/data/hotels_vienna/' 51 | df_0 <- read_csv(paste0(data_in,'clean/hotels-vienna.csv')) 52 | 53 | # b) use relative path: 54 | # R works in a specific folder called `working directory`, that you can check by: 55 | getwd() 56 | 57 | # after that, you can set your working directory by: 58 | setwd(data_in) 59 | # and simply call the data 60 | df_1 <- read_csv('clean/hotels-vienna.csv') 61 | 62 | 63 | # delete your data 64 | rm(hotels_vienna, df_0, df_1) 65 | 66 | 67 | ######## 68 | # 3) Import by using url - this is going to be our preferred method at this course! 69 | # Note: importing from the web is almost inferior to use your local disc, 70 | # but there are some exceptions: 71 | # a) The data is considerably large (>1GB) 72 | # b) It is important that there is no `refresh` or change in the data 73 | # in these case it is good practice to download to your computer the datas 74 | 75 | # Can access (almost) all the dat from 'ISF' 76 | # the hotels vienna dataset has the following url: 77 | df <- read_csv(url('https://osf.io/y6jvb/download')) 78 | 79 | 80 | ### 81 | # Quick check on the data: 82 | 83 | # glimpse on data 84 | glimpse(df) 85 | 86 | # Check some of the first observations 87 | head(df) 88 | 89 | # Have a built in summary for the variables 90 | summary(df) 91 | 92 | 93 | ########################### 94 | # Exporting your data: 95 | # 96 | # This is a special case: data_out is now the same as data_in (no cleaning...) 97 | data_out <- paste0(data_in, '/export/') 98 | write_csv(df, paste0(data_out, 'my_csvfile.csv')) 99 | 100 | # If due to some reason you would like to export as xls(x) 101 | install.packages('writexl') 102 | library(writexl) 103 | write_xlsx(df, paste0(data_out, 'my_csvfile.xlsx')) 104 | 105 | # Third option is to save as an R object 106 | save(df, file = paste0(data_out, 'my_rfile.RData')) 107 | 108 | ###### 109 | # Extra: using API 110 | # - tq_get - get stock prices from Yahoo/Google/FRED/Quandl, ect. 111 | # - WDI - get various data from World Bank's site 112 | # 113 | 114 | # tidyquant 115 | install.packages('tidyquant') 116 | library(tidyquant) 117 | # Apple stock prices from Yahoo 118 | aapl <- tq_get('AAPL', 119 | from = '2020-01-01', 120 | to = '2021-10-01', 121 | get = 'stock.prices') 122 | 123 | glimpse(aapl) 124 | 125 | # World Bank 126 | install.packages('WDI') 127 | library(WDI) 128 | # How WDI works - it is an API 129 | # Search for variables which contains GDP 130 | a <- WDIsearch('gdp') 131 | # Narrow down the serach for: GDP + something + capita + something + constant 132 | a <- WDIsearch('gdp.*capita.*constant') 133 | # Get data 134 | gdp_data <- WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019) 135 | 136 | glimpse(gdp_data) 137 | 138 | ## 139 | # Tasks: 140 | # 141 | # 1) Go to the webpage: https://gabors-data-analysis.com/ and find OSF database under `Data and Code` 142 | # 2) Go the the Gabor's OSF database and download manually 143 | # the `hotelbookingdata.csv` from `hotels-europe` dataset into your computer. 144 | # 3) load the data from this path 145 | # 4) also load the data directly from the web 146 | # 5) write out this file as xlsx and as a .RData next to the original data. 147 | 148 | 149 | -------------------------------------------------------------------------------- /lecture03-tibbles/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 03: Tibbles 2 | 3 | ## Motivation 4 | 5 | How to start working with data? Clarifying the concept of tidy data helps us to carry out analysis in a tractable way. Tidy data tables have the same structure for storing observations and variables. We discuss potential issues with storing observations and variables, and how to deal with those issues. We describe good practices for the process of converting non-tidy data into a tidy data frame. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces `tibble`-s as the 'Data' type of variable in `tidyverse`. It shows multiple columns and row manipulations with one `tibble` as well as how to merge two `tibble`s. It uses pre-written codes with tasks during the class. 10 | 11 | Data merging is based on [Chapter 02, C: Identifying successful football managers](https://gabors-data-analysis.com/casestudies/#ch02c-identifying-successful-football-managers). 12 | 13 | 14 | ## Learning outcomes 15 | After successfully completing codes in [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R) students should be able: 16 | 17 | - understand what is a 'Data' variable, why to use tibble and how it relates to `data_frame` and `data.frame` 18 | - How to do indexing with a tibble 19 | - indexing with integer numbers 20 | - indexing with logicals 21 | - when to use which and what are the connections 22 | - How to use simple functions with tibbles 23 | - `sum`, `mean`, `sd`, `add_column`, `select`, `add_row` 24 | - How to: 25 | - reset a cell's value in a tibble 26 | - add or remove a column (or variable) 27 | - add or remove a row (or an observation) 28 | - Wide vs long format and how to convert one to another 29 | - `pivot_wider` and `pivot_longer` functions 30 | - Merging - different ways to merge two tibbles: 31 | - new/other rows/observations are in the new tibble 32 | - new/other columns/variables are in the new tibble 33 | - difference between: `left_join`, `right_join`, `full_join` and `inner_join` 34 | - importance of the identifier variables and cases for non-unique identifications 35 | - `all_equal` to compare tibbles 36 | - extra: `semi_join` and `anti_join` 37 | 38 | ## Datasets used 39 | 40 | - [Football](https://gabors-data-analysis.com/datasets/#football) 41 | 42 | ## Lecture Time 43 | 44 | Ideal overall time: **30-40 mins**. 45 | 46 | Showing [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R) takes around *20-25 minutes*, while doing the tasks would take the rest. 47 | 48 | 49 | ## Homework 50 | 51 | *Type*: quick practice, approx 15 mins 52 | 53 | Use the created tibble from class (called `df`) and create two new tibble -- call it `df_2` and `df_3` -- with the following values: 54 | 55 | `df_2`: 56 | 57 | | id | age | grade | gender | 58 | | -- | --- | ----- | ------ | 59 | | 10 | 16 | C | F | 60 | | 11 | 40 | A | F | 61 | | 12 | 52 | B- | M | 62 | | 13 | 24 | C+ | M | 63 | | 14 | 28 | B+ | M | 64 | | 15 | 26 | A- | F | 65 | 66 | `df_3`: 67 | 68 | | id | height | 69 | | -- | ------ | 70 | | 1 | 165 | 71 | | 3 | 200 | 72 | | 5 | 187 | 73 | | 10 | 175 | 74 | | 12 | 170 | 75 | 76 | Do the following manipulations: 77 | 78 | - add `df_2` to `df` and call the newly merged tibble as `df_m` 79 | - merge `df_3` to `df_m` such that you have *all kinds of id* values (adding missing values), call it `df_m2` 80 | - merge `df_3` to `df_m` such that you have only such id-s that there are no missing values, call it `df_m3` 81 | - create a wide format from `df_m2` with names from grades and values from age (not really meaningful, but good practice) 82 | 83 | 84 | ## Further material 85 | 86 | - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 12](https://r4ds.had.co.nz/tidy-data.html) introduce to tidy approach and works with tibble. [Chapter 13](https://r4ds.had.co.nz/relational-data.html) provides a detailed discussion on merging. 87 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 05](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/05_tidy_data.Rmd) provides useful further guidelines on tidy approach and merging. 88 | - Another interesting material on this topic is by [Hansjörg Neth: Data Science for Psychologists](https://bookdown.org/hneth/ds4psy/), especially [Chapter 7.2](https://bookdown.org/hneth/ds4psy/7-2-tidy-essentials.html) on wide vs long format and [Chapter 8](https://bookdown.org/hneth/ds4psy/8-join.html) on merging. 89 | 90 | 91 | ## Folder structure 92 | 93 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes) includes one code, which is ready to use during the course but requires some live coding in class. 94 | - [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R) 95 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/complete_codes) includes one code with solutions for 96 | - [`intro_to_tibbles_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/complete_codes/intro_to_tibbles_fin.R) solution for: [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R) 97 | 98 | 99 | 100 | -------------------------------------------------------------------------------- /lecture03-tibbles/data/games.csv: -------------------------------------------------------------------------------- 1 | team,manager_id,manager_name,manager_games 2 | Arsenal,19,Arsène Wenger,380 3 | Arsenal,238,Unai Emery,38 4 | Aston Villa,14,Alex McLeish,38 5 | Aston Villa,71,Eric Black,7 6 | Aston Villa,80,Gary McAllister,5 7 | Aston Villa,96,Gérard Houllier,30 8 | Aston Villa,131,Kevin MacDonald,3 9 | Aston Villa,146,Martin O'Neill,76 10 | Aston Villa,173,Paul Lambert,101 11 | Aston Villa,204,Rémi Garde,21 12 | Aston Villa,229,Tim Sherwood,23 13 | Birmingham,14,Alex McLeish,76 14 | Blackburn,171,Paul Ince,17 15 | Blackburn,205,Sam Allardyce,76 16 | Blackburn,216,Steve Kean,59 17 | Blackpool,102,Ian Holloway,38 18 | Bolton,81,Gary Megson,56 19 | Bolton,165,Owen Coyle,96 20 | Bournemouth,68,Eddie Howe,152 21 | Brighton,40,Chris Hughton,76 22 | Burnley,32,Brian Laws,18 23 | Burnley,165,Owen Coyle,20 24 | Burnley,209,Sean Dyche,152 25 | Cardiff,57,David Kerslake,2 26 | Cardiff,138,Malky Mackay,18 27 | Cardiff,158,Neil Warnock,38 28 | Cardiff,163,Ole Gunnar Solskjær,18 29 | Chelsea,16,André Villas-Boas,27 30 | Chelsea,17,Antonio Conte,76 31 | Chelsea,37,Carlo Ancelotti,76 32 | Chelsea,95,Guus Hiddink,35 33 | Chelsea,120,José Mourinho,92 34 | Chelsea,137,Luiz Felipe Scolari,25 35 | Chelsea,149,Maurizio Sarri,38 36 | Chelsea,184,Rafael Benítez,26 37 | Chelsea,192,Roberto Di Matteo,23 38 | Crystal Palace,9,Alan Pardew,73 39 | Crystal Palace,77,Frank de Boer,4 40 | Crystal Palace,102,Ian Holloway,8 41 | Crystal Palace,124,Keith Millen,7 42 | Crystal Palace,158,Neil Warnock,16 43 | Crystal Palace,199,Roy Hodgson,72 44 | Crystal Palace,205,Sam Allardyce,21 45 | Crystal Palace,235,Tony Pulis,27 46 | Everton,58,David Moyes,190 47 | Everton,61,David Unsworth,6 48 | Everton,140,Marco Silva,38 49 | Everton,194,Roberto Martínez,113 50 | Everton,197,Ronald Koeman,47 51 | Everton,205,Sam Allardyce,24 52 | Fulham,46,Claudio Ranieri,16 53 | Fulham,72,Felix Magath,12 54 | Fulham,143,Mark Hughes,38 55 | Fulham,145,Martin Jol,89 56 | Fulham,189,René Meulensteen,13 57 | Fulham,199,Roy Hodgson,76 58 | Fulham,208,Scott Parker,10 59 | Fulham,211,Slaviša Jokanović,12 60 | Huddersfield,62,David Wagner,60 61 | Huddersfield,105,Jan Siewert,15 62 | Huddersfield,142,Mark Hudson,1 63 | Hull,100,Iain Dowie,9 64 | Hull,140,Marco Silva,18 65 | Hull,154,Mike Phelan,20 66 | Hull,213,Steve Bruce,76 67 | Leicester,28,Brendan Rodgers,11 68 | Leicester,45,Claude Puel,56 69 | Leicester,46,Claudio Ranieri,63 70 | Leicester,49,Craig Shakespeare,21 71 | Leicester,150,Michael Appleton,1 72 | Leicester,160,Nigel Pearson,38 73 | Liverpool,28,Brendan Rodgers,122 74 | Liverpool,126,Kenny Dalglish,56 75 | Liverpool,184,Rafael Benítez,76 76 | Liverpool,199,Roy Hodgson,20 77 | Man City,31,Brian Kidd,2 78 | Man City,139,Manuel Pellegrini,114 79 | Man City,143,Mark Hughes,54 80 | Man City,175,Pep Guardiola,114 81 | Man City,193,Roberto Mancini,134 82 | Man United,12,Alex Ferguson,190 83 | Man United,58,David Moyes,34 84 | Man United,120,José Mourinho,93 85 | Man United,136,Louis van Gaal,76 86 | Man United,163,Ole Gunnar Solskjær,21 87 | Man United,203,Ryan Giggs,4 88 | Middlesbrough,3,Aitor Karanka,27 89 | Middlesbrough,78,Gareth Southgate,38 90 | Middlesbrough,212,Steve Agnew,11 91 | Newcastle,9,Alan Pardew,156 92 | Newcastle,10,Alan Shearer,8 93 | Newcastle,40,Chris Hughton,19 94 | Newcastle,112,Joe Kinnear,24 95 | Newcastle,114,John Carver,18 96 | Newcastle,129,Kevin Keegan,3 97 | Newcastle,184,Rafael Benítez,86 98 | Newcastle,217,Steve McClaren,28 99 | Norwich,15,Alex Neil,38 100 | Norwich,40,Chris Hughton,71 101 | Norwich,157,Neil Adams,5 102 | Norwich,173,Paul Lambert,38 103 | Portsmouth,21,Avram Grant,25 104 | Portsmouth,97,Harry Redknapp,8 105 | Portsmouth,170,Paul Hart,27 106 | Portsmouth,231,Tony Adams,16 107 | QPR,42,Chris Ramsey,15 108 | QPR,97,Harry Redknapp,49 109 | QPR,143,Mark Hughes,30 110 | QPR,158,Neil Warnock,20 111 | Reading,34,Brian McDermott,29 112 | Reading,159,Nigel Adkins,8 113 | Reading,262,Eamonn Dolan,1 114 | Southampton,45,Claude Puel,38 115 | Southampton,143,Mark Hughes,22 116 | Southampton,147,Mauricio Pellegrino,30 117 | Southampton,148,Mauricio Pochettino,54 118 | Southampton,185,Ralph Hasenhüttl,24 119 | Southampton,197,Ronald Koeman,76 120 | Stoke,143,Mark Hughes,174 121 | Stoke,173,Paul Lambert,16 122 | Stoke,235,Tony Pulis,190 123 | Sunderland,58,David Moyes,38 124 | Sunderland,65,Dick Advocaat,17 125 | Sunderland,94,Gus Poyet,60 126 | Sunderland,127,Kevin Ball,2 127 | Sunderland,146,Martin O'Neill,56 128 | Sunderland,166,Paolo Di Canio,12 129 | Sunderland,190,Ricky Sbragia,23 130 | Sunderland,205,Sam Allardyce,30 131 | Sunderland,213,Steve Bruce,89 132 | Swansea,7,Alan Curtis,7 133 | Swansea,25,Bob Bradley,11 134 | Swansea,28,Brendan Rodgers,38 135 | Swansea,38,Carlos Carvalhal,18 136 | Swansea,73,Francesco Guidolin,24 137 | Swansea,79,Garry Monk,67 138 | Swansea,134,Leon Britton,2 139 | Swansea,151,Michael Laudrup,62 140 | Swansea,168,Paul Clement,37 141 | Tottenham,16,André Villas-Boas,54 142 | Tottenham,97,Harry Redknapp,144 143 | Tottenham,121,Juande Ramos,8 144 | Tottenham,148,Mauricio Pochettino,190 145 | Tottenham,229,Tim Sherwood,22 146 | Watford,106,Javi Gracia,52 147 | Watford,140,Marco Silva,24 148 | Watford,183,Quique Sánchez Flores,38 149 | Watford,240,Walter Mazzarri,38 150 | West Brom,8,Alan Irvine,19 151 | West Brom,9,Alan Pardew,18 152 | West Brom,52,Darren Moore,6 153 | West Brom,81,Gary Megson,2 154 | West Brom,123,Keith Downing,5 155 | West Brom,150,Michael Appleton,1 156 | West Brom,176,Pepe Mel,18 157 | West Brom,192,Roberto Di Matteo,25 158 | West Brom,199,Roy Hodgson,50 159 | West Brom,214,Steve Clarke,53 160 | West Brom,233,Tony Mowbray,38 161 | West Brom,235,Tony Pulis,107 162 | West Ham,6,Alan Curbishley,3 163 | West Ham,21,Avram Grant,36 164 | West Ham,58,David Moyes,27 165 | West Ham,85,Gianfranco Zola,72 166 | West Ham,130,Kevin Keen,3 167 | West Ham,139,Manuel Pellegrini,38 168 | West Ham,205,Sam Allardyce,114 169 | West Ham,210,Slaven Bilić,87 170 | Wigan,194,Roberto Martínez,152 171 | Wigan,213,Steve Bruce,38 172 | Wolves,152,Mick McCarthy,101 173 | Wolves,162,Nuno Espírito Santo,38 174 | Wolves,226,Terry Connor,13 175 | -------------------------------------------------------------------------------- /lecture03-tibbles/data/points.csv: -------------------------------------------------------------------------------- 1 | team,manager_id,manager_name,manager_points 2 | Arsenal,19,Arsène Wenger,721 3 | Arsenal,238,Unai Emery,70 4 | Aston Villa,71,Eric Black,1 5 | Aston Villa,80,Gary McAllister,8 6 | Aston Villa,96,Gérard Houllier,34 7 | Aston Villa,131,Kevin MacDonald,6 8 | Aston Villa,146,Martin O'Neill,126 9 | Aston Villa,173,Paul Lambert,101 10 | Aston Villa,204,Rémi Garde,12 11 | Aston Villa,229,Tim Sherwood,20 12 | Birmingham,14,Alex McLeish,89 13 | Blackburn,171,Paul Ince,13 14 | Blackburn,205,Sam Allardyce,99 15 | Blackburn,216,Steve Kean,53 16 | Blackpool,102,Ian Holloway,39 17 | Bolton,81,Gary Megson,59 18 | Bolton,165,Owen Coyle,103 19 | Bournemouth,68,Eddie Howe,177 20 | Brighton,40,Chris Hughton,76 21 | Burnley,165,Owen Coyle,20 22 | Burnley,209,Sean Dyche,167 23 | Cardiff,57,David Kerslake,1 24 | Cardiff,138,Malky Mackay,17 25 | Cardiff,158,Neil Warnock,34 26 | Cardiff,163,Ole Gunnar Solskjær,12 27 | Chelsea,16,André Villas-Boas,46 28 | Chelsea,17,Antonio Conte,163 29 | Chelsea,37,Carlo Ancelotti,157 30 | Chelsea,95,Guus Hiddink,69 31 | Chelsea,120,José Mourinho,184 32 | Chelsea,137,Luiz Felipe Scolari,49 33 | Chelsea,149,Maurizio Sarri,72 34 | Chelsea,184,Rafael Benítez,51 35 | Chelsea,192,Roberto Di Matteo,42 36 | Crystal Palace,9,Alan Pardew,88 37 | Crystal Palace,77,Frank de Boer,0 38 | Crystal Palace,102,Ian Holloway,3 39 | Crystal Palace,124,Keith Millen,3 40 | Crystal Palace,158,Neil Warnock,15 41 | Crystal Palace,199,Roy Hodgson,93 42 | Crystal Palace,205,Sam Allardyce,26 43 | Crystal Palace,235,Tony Pulis,41 44 | Everton,58,David Moyes,297 45 | Everton,61,David Unsworth,10 46 | Everton,140,Marco Silva,54 47 | Everton,194,Roberto Martínez,163 48 | Everton,197,Ronald Koeman,69 49 | Everton,205,Sam Allardyce,34 50 | Fulham,46,Claudio Ranieri,12 51 | Fulham,72,Felix Magath,12 52 | Fulham,143,Mark Hughes,49 53 | Fulham,145,Martin Jol,105 54 | Fulham,189,René Meulensteen,10 55 | Fulham,199,Roy Hodgson,99 56 | Fulham,208,Scott Parker,9 57 | Fulham,211,Slaviša Jokanović,5 58 | Huddersfield,62,David Wagner,48 59 | Huddersfield,105,Jan Siewert,5 60 | Huddersfield,142,Mark Hudson,0 61 | Hull,100,Iain Dowie,6 62 | Hull,140,Marco Silva,21 63 | Hull,154,Mike Phelan,13 64 | Hull,180,Phil Brown,59 65 | Hull,213,Steve Bruce,72 66 | Leicester,28,Brendan Rodgers,20 67 | Leicester,45,Claude Puel,70 68 | Leicester,46,Claudio Ranieri,102 69 | Leicester,49,Craig Shakespeare,29 70 | Leicester,150,Michael Appleton,3 71 | Leicester,160,Nigel Pearson,41 72 | Liverpool,28,Brendan Rodgers,219 73 | Liverpool,122,Jürgen Klopp,296 74 | Liverpool,126,Kenny Dalglish,85 75 | Liverpool,184,Rafael Benítez,149 76 | Liverpool,199,Roy Hodgson,25 77 | Man City,31,Brian Kidd,3 78 | Man City,139,Manuel Pellegrini,231 79 | Man City,143,Mark Hughes,76 80 | Man City,175,Pep Guardiola,276 81 | Man City,193,Roberto Mancini,276 82 | Man United,12,Alex Ferguson,433 83 | Man United,58,David Moyes,57 84 | Man United,120,José Mourinho,176 85 | Man United,136,Louis van Gaal,136 86 | Man United,163,Ole Gunnar Solskjær,40 87 | Man United,203,Ryan Giggs,7 88 | Middlesbrough,3,Aitor Karanka,22 89 | Middlesbrough,78,Gareth Southgate,32 90 | Middlesbrough,212,Steve Agnew,6 91 | Newcastle,9,Alan Pardew,209 92 | Newcastle,10,Alan Shearer,5 93 | Newcastle,40,Chris Hughton,19 94 | Newcastle,112,Joe Kinnear,25 95 | Newcastle,114,John Carver,12 96 | Newcastle,129,Kevin Keegan,4 97 | Newcastle,184,Rafael Benítez,102 98 | Newcastle,217,Steve McClaren,24 99 | Norwich,15,Alex Neil,34 100 | Norwich,40,Chris Hughton,76 101 | Norwich,157,Neil Adams,1 102 | Norwich,173,Paul Lambert,47 103 | Portsmouth,21,Avram Grant,21 104 | Portsmouth,97,Harry Redknapp,13 105 | Portsmouth,170,Paul Hart,24 106 | Portsmouth,231,Tony Adams,11 107 | QPR,42,Chris Ramsey,11 108 | QPR,97,Harry Redknapp,40 109 | QPR,143,Mark Hughes,24 110 | QPR,158,Neil Warnock,17 111 | Reading,34,Brian McDermott,23 112 | Reading,159,Nigel Adkins,5 113 | Reading,262,Eamonn Dolan,0 114 | Southampton,45,Claude Puel,46 115 | Southampton,143,Mark Hughes,17 116 | Southampton,147,Mauricio Pellegrino,28 117 | Southampton,148,Mauricio Pochettino,75 118 | Southampton,159,Nigel Adkins,22 119 | Southampton,185,Ralph Hasenhüttl,30 120 | Southampton,197,Ronald Koeman,123 121 | Stoke,143,Mark Hughes,219 122 | Stoke,173,Paul Lambert,13 123 | Stoke,235,Tony Pulis,225 124 | Sunderland,58,David Moyes,24 125 | Sunderland,65,Dick Advocaat,15 126 | Sunderland,94,Gus Poyet,63 127 | Sunderland,127,Kevin Ball,0 128 | Sunderland,146,Martin O'Neill,65 129 | Sunderland,166,Paolo Di Canio,9 130 | Sunderland,190,Ricky Sbragia,21 131 | Sunderland,200,Roy Keane,15 132 | Sunderland,205,Sam Allardyce,36 133 | Sunderland,213,Steve Bruce,102 134 | Swansea,7,Alan Curtis,5 135 | Swansea,25,Bob Bradley,8 136 | Swansea,28,Brendan Rodgers,47 137 | Swansea,38,Carlos Carvalhal,20 138 | Swansea,73,Francesco Guidolin,32 139 | Swansea,79,Garry Monk,88 140 | Swansea,134,Leon Britton,1 141 | Swansea,151,Michael Laudrup,70 142 | Swansea,168,Paul Clement,41 143 | Tottenham,16,André Villas-Boas,99 144 | Tottenham,97,Harry Redknapp,250 145 | Tottenham,121,Juande Ramos,2 146 | Tottenham,229,Tim Sherwood,42 147 | Watford,106,Javi Gracia,65 148 | Watford,140,Marco Silva,26 149 | Watford,183,Quique Sánchez Flores,45 150 | Watford,240,Walter Mazzarri,40 151 | West Brom,8,Alan Irvine,17 152 | West Brom,9,Alan Pardew,8 153 | West Brom,52,Darren Moore,11 154 | West Brom,81,Gary Megson,2 155 | West Brom,123,Keith Downing,6 156 | West Brom,176,Pepe Mel,15 157 | West Brom,192,Roberto Di Matteo,26 158 | West Brom,199,Roy Hodgson,67 159 | West Brom,214,Steve Clarke,64 160 | West Brom,233,Tony Mowbray,32 161 | West Brom,235,Tony Pulis,125 162 | West Ham,6,Alan Curbishley,6 163 | West Ham,21,Avram Grant,33 164 | West Ham,58,David Moyes,33 165 | West Ham,85,Gianfranco Zola,80 166 | West Ham,130,Kevin Keen,0 167 | West Ham,139,Manuel Pellegrini,52 168 | West Ham,205,Sam Allardyce,133 169 | West Ham,210,Slaven Bilić,116 170 | Wigan,194,Roberto Martínez,157 171 | Wigan,213,Steve Bruce,45 172 | Wolves,152,Mick McCarthy,99 173 | Wolves,162,Nuno Espírito Santo,57 174 | -------------------------------------------------------------------------------- /lecture04-data-munging/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 04: Data Munging with dplyr 2 | 3 | ## Motivation 4 | 5 | Before analyzing the data, data analysts spend a lot of time organizing, managing, and cleaning it to prepare it for analysis. This is called data wrangling or data munging. It is often said that 80 percent of data analysis time is spent on these tasks. Data wrangling is an iterative process: we usually start by organizing and cleaning our data, then start doing the analysis, and then go back to the cleaning process as problems emerge during analysis. 6 | 7 | Here we introduce students to a (relatively) easy way of carrying out this task and use the case study of [finding a good deal among hotels]((https://gabors-data-analysis.com/casestudies/#ch02a-finding-a-good-deal-among-hotels-data-preparation)). The initial data preparation, continues to work towards finding hotels that are underpriced relative to their location and quality. In this lecture, we illustrate how to find problems with observations and variables and how to solve those problems. 8 | 9 | ## This Lecture 10 | 11 | This lecture introduces students to how to manipulate raw data in various ways with `dplyr` from `tidyverse`. 12 | 13 | This lecture is based on [Chapter 02, A: Finding a good deal among hotels: data preparation](https://gabors-data-analysis.com/casestudies/#ch02a-finding-a-good-deal-among-hotels-data-preparation). 14 | 15 | 16 | ## Learning outcomes 17 | After successfully completing [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R), students should be able to: 18 | 19 | - Add variables 20 | - Separate a character variable into two (or more) variables with `separate` 21 | - Convert different type of variables to specific types: 22 | - character to numeric 23 | - character to factor -> understanding `factor` variable type 24 | - Further string manipulations (`gsub` and string expressions) 25 | - Rename variables with `rename` 26 | - Filter out different observations with `filter` 27 | - select observations with specific values 28 | - tabulate different values of a variable with `table` 29 | - filter out missing values 30 | - replace specific values with others 31 | - handle duplicates with `duplicated` 32 | - use pipes `%>%` to do multiple manipulations at once 33 | - sort data ascending or descending according to a specific variable with `arrange` 34 | 35 | ## Datasets used 36 | * [Hotels Europe](https://gabors-data-analysis.com/datasets/#hotels-europe) 37 | 38 | 39 | ## Lecture Time 40 | 41 | Ideal overall time: **40-60 mins**. 42 | 43 | Showing [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R)takes around *30 minutes* while doing the tasks would take the rest. 44 | 45 | 46 | ## Homework 47 | 48 | *Type*: quick practice, approx 10 mins 49 | 50 | Use the same [hotel-europe data from OSF](https://osf.io/r6uqb/), but now 51 | - Download both `hotels-europe_price.csv` and `hotels-europe_features.csv` 52 | - `left_join` them in this order by `hotel_id` 53 | - filter for : 54 | - time: 2018/01 and weekend == 1 55 | - city: Vienna or London. Hint: for multiple matches, use something like: 56 | ```r 57 | city %in% c('City_A','City_B') 58 | ``` 59 | - accommodation should be Apartment, 3-4 stars (only) with more than 10 reviews 60 | - price is less than 600$ 61 | - arrange the data in ascending order by price 62 | 63 | ## Further material 64 | 65 | - More materials on the case study can be found in Gabor's [da_case_studies repository](https://github.com/gabors-data-analysis/da_case_studies): [ch02-hotels-data-prep](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch02-hotels-data-prep/ch02-hotels-data-prep.R) 66 | - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 5](https://r4ds.had.co.nz/transform.html) provides an overview of the type of variables, selecting, filtering, and arranging along with others. [Chapter 15](https://r4ds.had.co.nz/factors.html) provides further material on factors. [Chapter 18](https://r4ds.had.co.nz/pipes.html) discusses pipes in various applications. 67 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 3](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/03_1d_data.Rmd) is relevant for factors, but includes many more. [Lecture 6](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/06_slicing_dicing.Rmd) introduces similar manipulations with tibble. 68 | - Grant McDermott: Data Science for Economists, Course material, [Lecture 5](https://github.com/uo-ec607/lectures/blob/master/05-tidyverse/05-tidyverse.pdf) is a nice overview on tidyverse with easy data manipulations. 69 | 70 | 71 | ## Folder structure 72 | 73 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture04-data-munging/raw_codes) includes one code, which is ready to use during the course but requires some live coding in class. 74 | - [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R) 75 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture04-data-munging/complete_codes) includes one code with solutions for 76 | - [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture04-data-munging/complete_codes/data_munging_fin.R) solution for: [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R) 77 | -------------------------------------------------------------------------------- /lecture06-rmarkdown101/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 06: Introduction to RMarkdown 2 | 3 | ## Motivation 4 | 5 | You want to know and articulate whether online and offline prices differ in your country for products that are sold in both ways. You have access to data on a sample of products with their online and offline prices. How would you use this data to establish whether prices tend to be different or the same for all products? 6 | 7 | We introduce RMarkdown, which is a powerful tool to create R-based reports in many formats and, where the report can be easily updated. By the end of this lecture students should know how to put together a descriptive based report with hypothesis testing with simple formatting. 8 | 9 | 10 | ## This lecture 11 | 12 | This lecture introduces students to *RMarkdown*, which is a great tool to create reports in pdf or Html. The aim of this session is to prepare students to create a simple report in pdf or Html on a descriptive analysis. This lecture uses exploratory analysis of [lecture05-data-exploration](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture05-data-exploration) and put it into an RMarkdown document. 13 | 14 | Case studies connected to this lecture are similar to [lecture05-data-exploration](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture05-data-exploration), but this lecture focuses on how to create a report and does not cover patterns of associations. 15 | - [Chapter 03, A: Finding a good deal among hotels: data exploration](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration) - emphasis on one variable descriptive analysis, different data 16 | - [Chapter 06, A: Comparing online and offline prices: testing the difference](https://gabors-data-analysis.com/casestudies/#ch06a-comparing-online-and-offline-prices-testing-the-difference) - focuses on hypothesis testing, association and one variable descriptive is not emphasized. 17 | 18 | 19 | ## Learning outcomes 20 | After completing [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd) students should be able to: 21 | 22 | - Knit Rmd documents into Html and pdf 23 | - Understanding the structure of an RMarkdown file: 24 | - YAML header, chunks of (R) codes surronded by ``` and text mixes with simple formatting 25 | - Header of chunks of R codes: 26 | - Use general commands, such as `include`, `echo`, `warning` or `eval` 27 | - Text formatting: 28 | - sections and sub-sections 29 | - bulleted, numbered and nested lists 30 | - bold and italic 31 | - add plain and embedded url 32 | - in-line reported code values 33 | - simple greek letters 34 | - color text (in pdf) 35 | - Reporting descriptive statistics with `modelsummary` and `kableExtra` packages 36 | - rename the reported variable names 37 | - add caption and notes 38 | - set position of the table with `kable_styling()` 39 | - `kable` to report a `tibble` 40 | - add column (or row) names 41 | - add caption 42 | - report in pdf with setting position and convert format theme 43 | - report in Html with setting position, and change format theme 44 | - Report a `ggplot2` object 45 | - set size of the plot with `fig.width` and `fig.height` 46 | - align the plot with `fig.align` and `fig.pos` with `float` package in YAML 47 | - add caption 48 | - set plot labels, theme etc to fit the formatting 49 | 50 | ## Datasets used 51 | * [Billion Prices](https://gabors-data-analysis.com/datasets/#billion-prices) 52 | 53 | 54 | ## Lecture Time 55 | 56 | Ideal overall time: **20-40mins**. 57 | 58 | Showing [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd) takes around *20-30 minutes* while doing the tasks would take the rest. 59 | 60 | Issues with RMarkdown knitting should be resolved by now. 61 | 62 | 63 | ## Homework 64 | 65 | *Type*: quick practice, approx 15 mins 66 | 67 | Use the [hotel-europe data from OSF](https://osf.io/r6uqb/) data and filter to have: 68 | - Time: year 2017, november and `weekday = 0` 69 | - Cities: London and Vienna 70 | - Accomodation: 3-4 stars hotels 71 | 72 | Create a max 2-page report in pdf **and** Html, where you 73 | - describe the data filtering you have done with a list 74 | - show a histogram of the prices 75 | - report a descriptive table for the prices grouped by cities 76 | - and carry out a simple t-test to decide if the mean prices in the two cities are the same. Hint: `t.test( price ~ city, data )` would compare the prices in the two cities. 77 | - draw a conclusion in text with greek letters and in-line codes. 78 | 79 | Note: there is no need for a comprehensive argument, here focus on rather the coding and pretty-reporting part. 80 | 81 | ## Further material 82 | 83 | - Hadley Wickham and Garrett Grolemund: R for Data Science: [Chapter 27](https://r4ds.had.co.nz/r-markdown.html) reviews the basics of RMarkdown such as chunks, general setup, problem-solving, and citation. [Chapter 29](https://r4ds.had.co.nz/r-markdown-formats.html) shows different types of outputs, that are not covered in this lecture but can be handy. 84 | - [Yihui Xie, Christophe Dervieux, Emily Riederer: R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/) is a detailed book on all RMarkdown topics and issues. 85 | - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch06-online-offline-price-test](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch06-online-offline-price-test) 86 | 87 | ## Folder structure 88 | 89 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture06-rmarkdown101/raw_codes) includes one RMarkdown file, which is ready to use during the course but requires some live coding in class. 90 | - [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd) 91 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture06-rmarkdown101/complete_codes) includes 92 | - [`report_bpp_fin.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd) RMarkdown file with solution for: [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd) 93 | - [`report_bpp_fin.pdf`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.pdf) is the generated pdf from [`report_bpp_fin.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd) 94 | - [`report_bpp_fin.html`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.html) is the generated Html from [`report_bpp_fin.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd) 95 | -------------------------------------------------------------------------------- /lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Report on Billion Price Project" 3 | author: 'Name of author' 4 | output: 5 | # html_document 6 | pdf_document: 7 | extra_dependencies: ["float"] 8 | 9 | --- 10 | 11 | ```{r setup, include=FALSE} 12 | rm(list = ls()) 13 | # Here comes the packages 14 | library(tidyverse) 15 | library(modelsummary) 16 | library(kableExtra) 17 | ``` 18 | 19 | ## Introduction 20 | 21 | This is a report on *The Billion Price Project*. 22 | 23 | HERE COMES THE MOTIVATION WHY THIS IS A MEANINGFUL PROJECT AND WHAT IS THE MAIN GOAL! 24 | 25 | For more details on the project see: or [this embedded link](http://www.thebillionpricesproject.com/). 26 | 27 | ## Data 28 | 29 | ```{r data import, include=FALSE} 30 | # Import data and basic data munging 31 | bpp_orig <- read_csv('https://osf.io/yhbr5/download') 32 | 33 | bpp_orig <- bpp_orig %>% mutate(p_diff = price_online - price) 34 | 35 | bpp <- bpp_orig %>% 36 | filter(is.na(sale_online)) %>% 37 | filter(!is.na(price)) %>% 38 | filter(!is.na(price_online)) %>% 39 | filter(PRICETYPE == 'Regular Price') %>% 40 | filter(price < 1000) %>% 41 | filter(p_diff < 500 & p_diff > -500) 42 | 43 | ``` 44 | 45 | HERE COMES A DETAILLED EXPLANATION ABOUT WHERE THE DATA COMES FROM AND IF IT IS REPRESENTATIVE OR NOT. 46 | 47 | Our main interest is whether online prices are lower or higher than simple retail store prices. We investigated the data on the collected prices and we have the following descriptive statistics on online, in-store prices and in their differences. 48 | 49 | ```{r data descriptive, echo=FALSE} 50 | 51 | # Descriptive statistics with pretty names 52 | P95 <- function(x){ quantile(x, 0.95, na.rm=T) } 53 | P05 <- function(x){ quantile(x, 0.05, na.rm=T) } 54 | 55 | # Use datasummary: 56 | # - rewrite names to human readable 57 | # - add title and notes 58 | # - fix position with kable_styling() 59 | datasummary((`Retail` = price) + (`Online` = price_online) + (`Price difference` = p_diff) ~ 60 | Mean + Median + SD + Min + Max + P05 + P95, 61 | data = bpp , 62 | title = 'Descriptive statistics of prices', 63 | notes = 'Data are available from: https://osf.io/yhbr5/') %>% 64 | kableExtra::kable_styling(latex_options = 'hold_position') 65 | ``` 66 | 67 | The number of observations is `r sum(!is.na(bpp$price))` for all of our key variables. 68 | 69 | DESCRIPTION OF THE SUMMARY STATS: WHAT CAN WE LEARN FROM THEM? 70 | 71 | As the focus is the price difference, the next Figure shows the histogram for this variable. 72 | 73 | ```{r data hist, echo=FALSE, warning=FALSE, fig.width=3, fig.height = 2, fig.align="center", fig.cap='Distribution of price differences', fig.pos = 'H' } 74 | 75 | # Add plot: in header, specify the figure size and alignment 76 | # 77 | # add simple plot: take care of labels and limits (and theme) 78 | ggplot(data = bpp) + 79 | geom_density(aes(x = p_diff), fill = 'navyblue', bins = 30) + 80 | labs(x = 'Price differences', 81 | y = 'Relative Frequency') + 82 | # following commands will be covered more in details in lecture-07-ggplot-indepth 83 | xlim(-4,4) + # limits for x-axis 84 | theme_bw() + # add a built-in theme 85 | theme(axis.text = element_text(size = 8), # change the font size of axis text/numbers 86 | axis.title = element_text(size = 8)) # change the font size of axis titles 87 | ``` 88 | 89 | DESCRIPTION OF THE FIGURE. WHAT DOES IT TELS US? 90 | 91 | (May change the order of descriptive stats and graph.) 92 | 93 | ## Testing Price Differences 94 | 95 | ```{r test, echo = FALSE } 96 | 97 | test_out <- t.test(bpp$p_diff, mu = 0) 98 | 99 | ``` 100 | 101 | We test the hypothesis, whether the price difference is zero, therefore there is no difference between retail and online prices: 102 | 103 | $$H_0:=\text{price online} - \text{price retail} = 0$$ $$H_A:=\text{price online} - \text{price retail} \neq 0$$ Running a two-sided t-test, we have the t-statistic as `r round(test_out$statistic, 2)` and the p-value as `r round(test_out$p.value, 2)`. The 95% confidence intervals are: `r round(test_out$conf.int[1], 2)` and `r round(test_out$conf.int[2], 2)`. **Based on these results with 95% confidence we can reject the hypothesis that the two price would be the same in this particular sample.** 104 | 105 | ## Robustness check / 'Heterogeneity analysis' 106 | 107 | Task: 108 | 109 | - calculate and report t-tests for each countries. 110 | - You should report: 111 | - country, 112 | - mean of price differences 113 | - standard errors of the mean for price differences 114 | - number of observations in each country 115 | - t-statistic 116 | - p-value. 117 | 118 | Hints: 119 | 120 | 1. use 'kable()' and to hold the table position you can define the following argument: 'position = "H"' 121 | 2. Take care of caption, number of digits you use and the name of variables you report! 122 | 3. You may check how the output changes if you use 'booktabs = TRUE' input for kable! 123 | 4. In case of html output use something like: 124 | 125 | 126 | ```{r, eval=FALSE} 127 | kable(..., 128 | 'html', booktabs = F, position = 'H') %>% 129 | kable_classic(full_width = F, html_font = 'Cambria') 130 | ``` 131 | 132 | ```{r test countries, echo = FALSE } 133 | 134 | # When creating a factor, it will use the sorted values: 135 | levels_fac <- sort(unique(bpp$COUNTRY)) 136 | # If want to have pretty variable nems when groupping, need to use 'labels' input! 137 | # create a new variable for that: 138 | labs <- c('Brazil','China','Germany','Japan','South Africa','USA') 139 | # Note order must be the same as in `levels_fac`! 140 | bpp$country <- factor(bpp$COUNTRY, labels = labs) 141 | # Multiple hypothesis testing 142 | testing <- bpp %>% 143 | select(country, p_diff) %>% 144 | group_by(country) %>% 145 | summarise(mean_pdiff = mean(p_diff) , 146 | se_pdiff = 1/sqrt(n())*sd(p_diff), 147 | num_obs = n()) 148 | 149 | # Testing in R is easy if one understands the theory! 150 | testing <- mutate(testing, t_stat = mean_pdiff / se_pdiff) 151 | testing <- mutate(testing, p_val = pt(-abs(t_stat), df = num_obs - 1)) 152 | testing <- mutate(testing, p_val = round(p_val, digit = 4)) 153 | 154 | # Report a tibble or any other dataframe/matrix variable 155 | kable(testing, digits = 4, 156 | caption = 'Online and retail price differences by countries and t-tests', 157 | col.names = c('Country','Mean','SE','Num.Obs','t-stat','p-val'), 158 | 'latex', booktabs = TRUE, position = 'H') # Comment out if html 159 | #'html', booktabs = F, position = 'H') %>% # Uncomment if html 160 | # kable_classic(full_width = F, html_font = 'Cambria') 161 | 162 | ``` 163 | 164 | Extra: In words, select those countries, where you can not reject the alternative that the prices are different. With the command '\textcolor{red}{this is red} you can highlight these countries! 165 | 166 | Countries, where we can not reject the alternative with 95% confidence (or with 5% significance level), that the prices are different, hence retail and online prices might differ: \textcolor{red}{`r testing$country[ testing$p_val < 0.05]`} 167 | 168 | ## Conclusion 169 | 170 | HERE COMES WHAT WE HAVE LEARNED AND WHAT WOULD STRENGHTEN AND WEAKEN OUR ANALYSIS. 171 | -------------------------------------------------------------------------------- /lecture06-rmarkdown101/complete_codes/report_bpp_fin.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture06-rmarkdown101/complete_codes/report_bpp_fin.pdf -------------------------------------------------------------------------------- /lecture06-rmarkdown101/raw_codes/report_bpp.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Report on Billion Price Project" 3 | author: 'Name of author' 4 | output: 5 | pdf_document: 6 | extra_dependencies: ["float"] 7 | # 8 | # html_document 9 | --- 10 | 11 | ```{r setup, include=FALSE} 12 | rm(list = ls()) 13 | # Here comes the packages 14 | library(tidyverse) 15 | library(modelsummary) 16 | library(kableExtra) 17 | ``` 18 | 19 | ## Introduction 20 | 21 | This is a report on *The Billion Price Project*. 22 | 23 | HERE COMES THE MOTIVATION WHY THIS IS A MEANINGFUL PROJECT AND WHAT IS THE MAIN GOAL! 24 | 25 | For more details on the project see: or [this embedded link](http://www.thebillionpricesproject.com/). 26 | 27 | ## Data 28 | 29 | ```{r data import, include=FALSE} 30 | # Import data and basic data munging 31 | bpp_orig <- read_csv('https://osf.io/yhbr5/download') 32 | 33 | bpp_orig <- bpp_orig %>% mutate(p_diff = price_online - price) 34 | 35 | bpp <- bpp_orig %>% 36 | filter(is.na(sale_online)) %>% 37 | filter(!is.na(price)) %>% 38 | filter(!is.na(price_online)) %>% 39 | filter(PRICETYPE == 'Regular Price') %>% 40 | filter(price < 1000) %>% 41 | filter(p_diff < 500 & p_diff > -500) 42 | 43 | ``` 44 | 45 | HERE COMES A DETAILLED EXPLANATION ABOUT WHERE THE DATA COMES FROM AND IF IT IS REPRESENTATIVE OR NOT. 46 | 47 | Our main interest is whether online prices are lower or higher than simple retail store prices. We investigated the data on the collected prices and we have the following descriptive statistics on online, in-store prices and in their differences. 48 | 49 | ```{r data descriptive, echo=FALSE} 50 | 51 | # Descriptive statistics with pretty names 52 | P95 <- function(x){ quantile(x, 0.95, na.rm=T) } 53 | P05 <- function(x){ quantile(x, 0.05, na.rm=T) } 54 | 55 | # Use datasummary: 56 | # - rewrite names to human readable 57 | # - add title and notes 58 | # - fix position with kable_styling() 59 | datasummary((`Retail` = price) + (`Online` = price_online) + (`Price difference` = p_diff) ~ 60 | Mean + Median + SD + Min + Max + P05 + P95, 61 | data = bpp , 62 | title = 'Descriptive statistics of prices', 63 | notes = 'Data are available from: https://osf.io/yhbr5/') %>% 64 | kableExtra::kable_styling(latex_options = 'hold_position') 65 | ``` 66 | 67 | The number of observations is `r sum(!is.na(bpp$price))` for all of our key variables. 68 | 69 | DESCRIPTION OF THE SUMMARY STATS: WHAT CAN WE LEARN FROM THEM? 70 | 71 | As the focus is the price difference, the next Figure shows the histogram for this variable. 72 | 73 | ```{r data hist, echo=FALSE, warning=FALSE, fig.width=3, fig.height = 2, fig.align="center", fig.cap='Distribution of price differences', fig.pos = 'H' } 74 | 75 | # Add plot: in header, specify the figure size and alignment 76 | # 77 | # add simple plot: take care of labels and limits (and theme) 78 | ggplot(data = bpp) + 79 | geom_density(aes(x = p_diff), fill = 'navyblue' , bins = 30) + 80 | labs(x = 'Price differences', 81 | y = 'Relative Frequency') + 82 | # following commands will be covered more in details in lecture-07-ggplot-indepth 83 | xlim(-4,4) + # limits for x-axis 84 | theme_bw() + # add a built-in theme 85 | theme(axis.text = element_text(size = 8), # change the font size of axis text/numbers 86 | axis.title = element_text(size = 8)) # change the font size of axis titles 87 | ``` 88 | 89 | DESCRIPTION OF THE FIGURE. WHAT DOES IT TELS US? 90 | 91 | (May change the order of descriptive stats and graph.) 92 | 93 | ## Testing Price Differences 94 | 95 | ```{r test, echo = FALSE } 96 | 97 | test_out <- t.test(bpp$p_diff, mu = 0) 98 | 99 | ``` 100 | 101 | We test the hypothesis, whether the price difference is zero, therefore there is no difference between retail and online prices: 102 | 103 | $$H_0:=\text{price online} - \text{price retail} = 0$$ $$H_A:=\text{price online} - \text{price retail} \neq 0$$ Running a two-sided t-test, we have the t-statistic as `r round(test_out$statistic, 2)` and the p-value as `r round(test_out$p.value, 2)`. The 95% confidence intervals are: `r round(test_out$conf.int[1], 2)` and `r round(test_out$conf.int[2], 2)`. **Based on these results with 95% confidence we can reject the hypothesis that the two price would be the same in this particular sample.** 104 | 105 | ## Robustness check / 'Heterogeneity analysis' 106 | 107 | Task: 108 | 109 | - calculate and report t-tests for each countries. 110 | - You should report: 111 | - country, 112 | - mean of price differences 113 | - standard errors of the mean for price differences 114 | - number of observations in each country 115 | - t-statistic 116 | - p-value. 117 | 118 | Hints: 119 | 120 | 1. use 'kable()' and to hold the table position you can define the following argument: 'position = "H"' 121 | 2. Take care of caption, number of digits you use and the name of variables you report! 122 | 3. You may check how the output changes if you use 'booktabs = TRUE' input for kable! 123 | 4. In case of html output use something like: 124 | 125 | 126 | 127 | ```{r, eval=FALSE} 128 | kable(..., 129 | 'html', booktabs = F, position = 'H') %>% 130 | kable_classic(full_width = F, html_font = 'Cambria') 131 | ``` 132 | 133 | 134 | 135 | Extra: In words, select those countries, where you can not reject the alternative that the prices are different. With the command '\textcolor{red}{this is red} you can highlight these countries! 136 | 137 | Countries, where we can not reject the alternative with 95% confidence (or with 5% significance level), that the prices are different, hence retail and online prices might differ: 138 | 139 | Task: put here country names in red with p-values less than 5%. 140 | 141 | 142 | ## Conclusion 143 | 144 | HERE COMES WHAT WE HAVE LEARNED AND WHAT WOULD STRENGHTEN AND WEAKEN OUR ANALYSIS. 145 | -------------------------------------------------------------------------------- /lecture07-ggplot-indepth/raw_codes/homework_ggpplot_runfile.R: -------------------------------------------------------------------------------- 1 | ######################### 2 | # # 3 | # Lecture 07 # 4 | # # 5 | # Runner for # 6 | # assignment # 7 | # # 8 | # Deadline: # 9 | # # 10 | # # 11 | # # 12 | ######################### 13 | 14 | ## 15 | # Create your own theme for ggplot! 16 | # 17 | # 0) Clear your environment 18 | 19 | # 1) Load tidyverse 20 | 21 | # 2) use the same data with the same filter as in class! 22 | 23 | # 3) Call your personalized ggplot function 24 | 25 | # 4) Run the following piece of command: 26 | ggplot(filter(df, city == 'Vienna'), aes(x = price)) + 27 | geom_histogram(alpha = 0.8, binwidth = 20) + 28 | labs(x='Hotel Prices in Vienna',y='Density')+ 29 | theme_YOURFUNCTIONNAME() 30 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /lecture07-ggplot-indepth/raw_codes/theme_RENAMEME.R: -------------------------------------------------------------------------------- 1 | ######################### 2 | # # 3 | # Assignment # 4 | # Lecture 07 # 5 | # # 6 | # Deadline: # 7 | # # 8 | ######################### 9 | 10 | ## 11 | # Create your own theme for ggplot! 12 | # In principle you should use this ggplot theme in the remainder of the course for assignments ect. 13 | # Of course you can change along, but I would like to encourage all of you to use a personalized theme! 14 | # 15 | # !! Please RENAME this file and call it accordingly in the runfile !! 16 | # 17 | # To get 7 points you will need to modify at least 7 parameters of the theme_classic or theme_bw! 18 | # 19 | # Useful resources you may want to check: 20 | # https://www.datanovia.com/en/blog/ggplot-themes-gallery/ 21 | # https://ggplot2.tidyverse.org/reference/theme.html 22 | # Or the book's theme: 23 | # https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch00-tech-prep/theme_bg.R 24 | # Some more advanced/elaborated examples: 25 | # https://bookdown.org/rdpeng/RProgDA/building-a-new-theme.html 26 | # https://towardsdatascience.com/5-steps-for-creating-your-own-ggplot-theme-656e79a96b9 27 | # 28 | # and many more.... -------------------------------------------------------------------------------- /lecture07-ggplot-indepth/raw_codes/theme_bluewhite.R: -------------------------------------------------------------------------------- 1 | ####################################################### 2 | # # 3 | # Lecture 07 # 4 | # # 5 | # ggplot in-depth # 6 | # `theme_bluewhite()` # 7 | # # 8 | # first external function: # 9 | # creating your own theme # 10 | # # 11 | # For complete list of theme options # 12 | # see: # 13 | # https://ggplot2.tidyverse.org/reference/theme.html # 14 | # # 15 | ####################################################### 16 | 17 | theme_bluewhite <- function(base_size = 11, base_family = '') { 18 | # Inherit the basic properties of theme_bw 19 | theme_bw() %+replace% 20 | # Replace the following items: 21 | theme( 22 | # The grids on the background 23 | panel.grid.major = element_line(color = 'white'), 24 | # The background color 25 | panel.background = element_rect(fill = 'lightblue'), 26 | # the axis line 27 | axis.line = element_line(color = 'red'), 28 | # Littel lines called ticks on the axis 29 | axis.ticks = element_line(color = 'steelblue'), 30 | # Color and font size for the numbers on the axis 31 | axis.text = element_text(color = 'navyblue', size = 8) 32 | ) 33 | } 34 | -------------------------------------------------------------------------------- /lecture08-conditionals/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 08: Conditional Programming 2 | 3 | ## Motivation 4 | 5 | Deciding what to do on a case by case is widely used in decision making and also in programming. Conditional programming enables writing codes with this in mind. If a certain condition holds execute a command otherwise do something different. Conditional programming is an element of the basic programming technique, which emerges in multiple situations. Adding this technique to the programming toolbox is a must for data scientists. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces students to conditional programming with `if-else` statements. It covers the essentials as well as logical operations with vectors, creating new variables with conditionals and some extra material. 10 | 11 | 12 | ## Learning outcomes 13 | After successfully live-coding the material (see: [`conditionals.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.md)), students will have knowledge on 14 | 15 | - How a conditional statement works 16 | - What are the crucial elements of an `if-else` statement 17 | - Good practices writing a conditional 18 | - How multiple conditions work 19 | - single-valued variables with multiple conditions 20 | - vector variables with conditions 21 | - with vectors: 22 | - understanding the differences between `|`, `||`, `&`, `&&`, `any()` and `all()` 23 | - understanding pairwise comparison of vectors 24 | - understanding different levels of evaluation of logical operators. 25 | - creating new variable with conditional 26 | - base R method with logicals 27 | - `ifelse()` function with `tidyverse` 28 | - extra material 29 | - conditional installation of packages 30 | - spacing and formatting the `if-else` statements 31 | - `xor` function 32 | - `switch` statement 33 | 34 | ## Datasets used 35 | 36 | - [wms-management](https://gabors-data-analysis.com/datasets/#wms-management-survey) 37 | 38 | ## Lecture Time 39 | 40 | Ideal overall time: **10-20 mins**. 41 | 42 | This is a relatively short lecture, and it can be even shorter if logical operators with vectors is neglected. Although good understanding of the anatomy of an `if-else` statement is important 43 | 44 | ## Homework 45 | 46 | *Type*: quick practice, approx 15 mins, together with [lecture09-loops](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture09-loops), [lecture10-random-numbers](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture10-random-numbers), and [lecture11-functions](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture11-functions). 47 | 48 | Check the common homework [here](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/README.md). 49 | 50 | ## Further material 51 | 52 | - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 19.4](https://r4ds.had.co.nz/functions.html) provides further material on conditionals. 53 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 10](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/10_functional_programming.Rmd) provides useful guidelines on conditionals along with other programming skills. 54 | 55 | 56 | ## File structure 57 | 58 | - [`conditionals.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.md) provides material for the live coding session with explanations. 59 | - [`conditionals.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.Rmd) is the generating Rmarkdown file for `conditionals.md` 60 | - [`conditionals.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.R) is a possible realization of the live coding session 61 | -------------------------------------------------------------------------------- /lecture08-conditionals/conditionals.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | ## ## 3 | ## Lecture 08 ## 4 | ## ## 5 | ## Conditional Programming ## 6 | ## ## 7 | ## ## 8 | ################################## 9 | 10 | 11 | # Simple if-statement 12 | x <- 5 13 | if (x == 5){ 14 | print('x is equal to 5') 15 | } 16 | 17 | # Create an if-else statement 18 | x2 <- 4 19 | if (x2 == 5){ 20 | print('x2 is equal to 5') 21 | } else{ 22 | print('x2 is not equal to 5') 23 | } 24 | 25 | # Multiple if-else statement! 26 | # play around with the value of x 27 | x <- -5 28 | if (x > 0){ 29 | print('positive number') 30 | } else if(x == 0){ 31 | print('zero value') 32 | } else{ 33 | print('negative number') 34 | } 35 | 36 | 37 | ##### 38 | # Multiple conditions 39 | 40 | # Multiple logical statements 41 | y <- 10 42 | if (x > 0 && y > 0){ 43 | print('x and y are positive numbers') 44 | } else{ 45 | print('one of y or x is non-positive') 46 | } 47 | 48 | ### 49 | # Conditional with one vector 50 | v <- c(0 , 1 , 10) 51 | 52 | # First, let check if elements of v larger than 0 53 | v > 0 54 | 55 | # any or all functions 56 | if (any(v > 0)){ 57 | print('We have at least one element in v, which is larger than zero!') 58 | } else { 59 | print('All elements in v, are smaller than zero!') 60 | } 61 | 62 | 63 | ### 64 | # Conditional with two or more vector 65 | q <- c(2 , 0 , 8) 66 | 67 | # use of single-operators 68 | v | q > 0 69 | v & q > 0 70 | 71 | # At this point we can check the differences between single-operators and double-operators 72 | 73 | v | q > 0 74 | v || q > 0 75 | 76 | # Using double-operators will imply `any()` for `||` and `all()` for `&&`: 77 | (v || q > 0) == any(v | q > 0) 78 | (v && q > 0) == all(v & q > 0) 79 | 80 | # be careful, when using these operators with vectors, 81 | # as the results can be different if mixing these up, e.g. 82 | v && q > 0 83 | any(v & q > 0) 84 | 85 | ##### 86 | # Using conditionals when creating new variables 87 | 88 | # Import wms-management data 89 | library(tidyverse) 90 | wms <- read_csv('https://osf.io/uzpce/download') 91 | 92 | # Method 1: use base-R commands 93 | wms$firm_size <- NA_character_ 94 | wms$firm_size[ wms$emp_firm >= 1000 ] = 'large' 95 | wms$firm_size[ wms$emp_firm < 1000 & wms$emp_firm >= 200 ] = 'medium' 96 | wms$firm_size[ wms$emp_firm < 200 ] = 'small' 97 | 98 | # Method 2: ifelse function 99 | wms <- wms %>% mutate(firm_size2 = ifelse(emp_firm >= 1000 , 'large', 100 | ifelse(wms$emp_firm < 1000 & wms$emp_firm >= 200 , 'medium', 101 | ifelse(wms$emp_firm < 200, 'small', NA_character_)))) 102 | 103 | # Task: check they are the same: 104 | all(wms$firm_size == wms$firm_size2, na.rm = T) 105 | 106 | ###### 107 | # Extra material 108 | 109 | 110 | # Spacing and formatting 111 | 112 | if (x > 5){ print(' x > 5') } else { print('x <= 5') } 113 | # However, it is not recommended as it makes reading the code much harder. 114 | 115 | 116 | # The xor() operator 117 | # xor, which takes two logical value/vectors as inputs. 118 | xor(c(T,F,F,T),c(T,T,F,F)) 119 | 120 | 121 | # `switch` statement 122 | type <- 'apple' 123 | switch(type, 124 | apple = 'I love apple!', 125 | banana = 'I love banana!', 126 | orange = 'I love orange!', 127 | error('type must be either \'apple\',\'banana\', or \'orange\'') 128 | ) 129 | 130 | # try different inputs for types which are not in the listed values! 131 | -------------------------------------------------------------------------------- /lecture09-loops/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 09: Programming loops 2 | 3 | ## Motivation 4 | 5 | There are many cases when one needs to do repetitive coding: carry out the same commands but on a different object/data. Writing loops is one of the best tools to carry out such repetition with only a few modifications to the codes. It also reduces the code duplication, which has three main benefits: 6 | 7 | 1. It’s easier to see the intent of your code because your eyes are drawn to what’s different, not what stays the same. 8 | 2. It’s easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied and pasted the code. 9 | 3. You’re likely to have fewer bugs because each line of code is used in more places. 10 | 11 | *([Hadley Wickham and Garrett Grolemund R for Data Science Ch. 21.1](https://r4ds.had.co.nz/iteration.html))* 12 | 13 | 14 | ## This lecture 15 | 16 | This lecture introduces students to imperative programming with `for` and `while` loops. Furthermore, it provides an exercise with [sp500](https://gabors-data-analysis.com/datasets/#sp500) dataset to calculate yearly and monthly returns. 17 | 18 | [Chapter 05, A: What likelihood of loss to expect on a stock portfolio?](https://gabors-data-analysis.com/casestudies/#ch05a-what-likelihood-of-loss-to-expect-on-a-stock-portfolio) case study was a starting point to develop the exercise. 19 | 20 | 21 | ## Learning outcomes 22 | After successfully live-coding the material (see: [`loops.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.md)), students will know 23 | 24 | - What is imperative programming and what is functional programming for iterations 25 | - What is a for loop 26 | - what are the possible inputs for an iteration vector 27 | - how to measure CPU time 28 | - what are the possible issues with the for-loop 29 | - What is a while loop 30 | - what are the possible drawbacks of a while loop 31 | - how to use a for loop instead 32 | - `break` command 33 | - Calculate returns with different time periods. 34 | 35 | ## Datasets used 36 | 37 | - [sp500](https://gabors-data-analysis.com/datasets/#sp500) 38 | 39 | ## Lecture Time 40 | 41 | Ideal overall time: **10-20 mins**. 42 | 43 | This is a relatively short lecture, and it can be even shorter if measuring CPU time and/or exercise is/are neglected. 44 | 45 | ## Homework 46 | 47 | *Type*: quick practice, approx 15 mins, together with [lecture08-conditionals](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture08-conditionals), [lecture10-random-numbers](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture10-random-numbers), and [lecture11-functions](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture11-functions). 48 | 49 | Check the common homework [here](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/README.md). 50 | 51 | ## Further material 52 | 53 | - More materials on the case study can be found in Gabor's da_case_studies repository: [ch05-stock-market-loss-generalize](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch05-stock-market-loss-generalize) 54 | - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 21](https://r4ds.had.co.nz/iteration.html) provide further material on iterations, both imperative and functional programming. 55 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 10](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/10_functional_programming.Rmd) provides useful guidelines on iterations along with other programming skills. 56 | 57 | 58 | ## File structure 59 | 60 | - [`loops.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.md) provides material for the live coding session with explanations. 61 | - [`loops.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.Rmd) is the generating Rmarkdown file for `loops.md` 62 | - [`loops.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.R) is a possible realization of the live coding session 63 | -------------------------------------------------------------------------------- /lecture09-loops/loops.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | ## ## 3 | ## Imperative Programming ## 4 | ## for and while loops ## 5 | ## ## 6 | ################################## 7 | 8 | rm(list = ls()) 9 | 10 | # Case 1) purest form of a for loop 11 | for (i in 1 : 5){ 12 | print(i) 13 | } 14 | 15 | # Case 2) 16 | for (i in seq(50, 58)){ 17 | print(i) 18 | } 19 | 20 | # Case 3) 21 | for (i in c(10,9,-10,8)){ 22 | print(i) 23 | } 24 | 25 | # Play around with lists 26 | for (i in list(2, 'a', TRUE, sqrt(2))){ 27 | print(i) 28 | } 29 | 30 | # Create a loop which gives the cumulative sum: 31 | v <- c(10, 6, 5, 32, 45, 23) 32 | cs_v <- v 33 | for (i in 2 : length(v)){ 34 | cs_v[ i ] <- cs_v[ i - 1 ] + cs_v[ i ] 35 | } 36 | v 37 | cs_v 38 | cumsum(v) 39 | 40 | # Also good to know 41 | seq_along(v) 42 | 43 | cs_v2 <- 0 44 | for (i in seq_along(v)){ 45 | cs_v2 <- cs_v2 + v[ i ] 46 | } 47 | cs_v2 48 | 49 | # Task check if all the elements in cs_v is the same as 50 | # the cumsum(v) function and if it is true 51 | # print out Good job! otherwise: there is a mistake! 52 | 53 | if (all(cs_v == cumsum(v))){ 54 | print('Good job!') 55 | } else { 56 | print('There is a mistake!') 57 | } 58 | 59 | ## Measure CPU time 60 | if (!require(tictoc)){ 61 | install.packages('tictoc') 62 | library(tictoc) 63 | } 64 | 65 | 66 | iter_num <- 10000 67 | 68 | # Sloppy way to do loops: 69 | tic('Sloppy way') 70 | q <- c() 71 | for (i in 1 : iter_num){ 72 | q <- c(q, i) 73 | } 74 | toc() 75 | 76 | # Proper way 77 | tic('Good way') 78 | r <- double(length = iter_num) 79 | for (i in 1 : iter_num){ 80 | r[ i ] <- i 81 | } 82 | toc() 83 | 84 | ## 85 | # While loop 86 | x <- 0 87 | while (x < 10) { 88 | x <- x + 1 89 | print(x) 90 | } 91 | x 92 | 93 | # Instead use a for loop with break 94 | max_iter <- 10000 95 | x <- 0 96 | flag <- FALSE 97 | for (i in 1 : max_iter){ 98 | if (x < 10){ 99 | x <- x + 1 100 | } else{ 101 | flag <- TRUE 102 | break 103 | } 104 | } 105 | x 106 | if (flag) { 107 | print('Successful iteration!') 108 | }else{ 109 | print('Did not satisfied stopping criterion!') 110 | } 111 | 112 | #### 113 | # Exercise sp500 114 | library(tidyverse) 115 | # Load data 116 | sp500 <- read_csv('https://osf.io/h64z2/download', na = c('', '#N/A')) 117 | # Filter out missing and create a year variable 118 | sp500 <- sp500 %>% filter(!is.na(VALUE)) %>% 119 | mutate(year = format(DATE, '%Y')) 120 | 121 | # Get unique years and create tibble 122 | years <- unique(sp500$year) 123 | return_yearly <- tibble(years = years, return = NA) 124 | 125 | # Initialize 126 | aux <- sp500$VALUE[ sp500$year == years[ 1 ] ] 127 | lyp <- aux[ length(aux) ] 128 | rm(aux) 129 | # start from 2007 130 | for (i in 2 : length(years)){ 131 | # get the values for specific year 132 | value_year_i <- sp500$VALUE[ sp500$year == years[ i ] ] 133 | # last day's price 134 | ldp <- value_year_i[ length(value_year_i) ] 135 | # calculate the return 136 | return_yearly$return[ i ] <- (ldp - lyp) / lyp * 100 137 | # save this years last value as last year value 138 | lyp <- ldp 139 | } 140 | 141 | # Check results 142 | return_yearly 143 | 144 | 145 | -------------------------------------------------------------------------------- /lecture10-random-numbers/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 10: Random Numbers and Random Sampling 2 | 3 | ## Motivation 4 | 5 | While dealing with data, the use of random numbers is essential to understanding modern data analytics. In many cases, you will not use them directly, but many advanced models (e.g. Machine Learning techniques) use them. A general understanding how these methods work, and what are the limitations is beneficial. Some examples of random number usage: 6 | 7 | - get a random (sub)-sample (e.g. cross-validation techniques) 8 | - bootstrapping (e.g. calculate standard errors) 9 | - estimating models (e.g. random forest, Markov-Chain-Monte-Carlo or (quasi) maximum-likelihood methods) 10 | - ‘stochastic’ optimization methods (e.g. genetic algorithms) 11 | 12 | We cover the main properties of random number generators and how to use them for reproducible results. 13 | 14 | ## This lecture 15 | 16 | This lecture introduces students to how to generate random numbers and deal with random sampling in R. 17 | 18 | Relates to case studies: 19 | - [Chapter 03, D: Distributions of body height and income](https://gabors-data-analysis.com/casestudies/#ch03d-distributions-of-body-height-and-income) to show random numbers generated from theoretical distributions vs empirical distributions. 20 | - [Chapter 05, A: What likelihood of loss to expect on a stock portfolio?](https://gabors-data-analysis.com/casestudies/#ch05a-what-likelihood-of-loss-to-expect-on-a-stock-portfolio) to show random sampling. 21 | 22 | 23 | ## Learning outcomes 24 | After successfully live-coding the material (see: [`random_numbers.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.md)), students will have knowledge on 25 | 26 | - When and why random numbers are used in R 27 | - Different `distributions` and their properties available in R 28 | - Specifically get familiar with `runif`, `rnorm`, and `rlnorm` 29 | - How to control for randomness via `set.seed` 30 | - How the number of observations generated by `rnorm` is associated with the theoretical normal distribution 31 | - How random sampling works: 32 | - `sample_n` function 33 | - other alternatives such as `slice_sample`, `sample` and `sample.int` 34 | 35 | ## Datasets used 36 | 37 | - [height-income-distributions](https://gabors-data-analysis.com/datasets/#height-income-distributions) 38 | - [sp500](https://gabors-data-analysis.com/datasets/#sp500) 39 | 40 | ## Lecture Time 41 | 42 | Ideal overall time: **10-20 mins**. 43 | 44 | This is a relatively short lecture with a little coding, but much background knowledge is needed on how random number generation works. 45 | If want to shorten the lecture skip exercise with height and income distributions. 46 | 47 | ## Homework 48 | 49 | *Type*: quick practice, approx 15 mins, together with [lecture08-conditionals](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture08-conditionals), [lecture09-loops](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture09-loops), and [lecture11-functions](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture11-functions) 50 | 51 | Check the common homework [here](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/README.md). 52 | 53 | ## Further material 54 | 55 | - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 10](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/10_functional_programming.Rmd) touch the topic, but not too deeply. 56 | - There are some useful bookdown material by [Ko Chiu Yu: Techincal Analysis with R, Chapter 4.2](https://bookdown.org/kochiuyu/Technical-Analysis-with-R/random-number.html) and [Nathaniel D. Phillips: YaRrr! The Pirate’s Guide to R, Chapter 5.3](https://bookdown.org/ndphillips/YaRrr/generating-random-data.html) 57 | 58 | 59 | ## File structure 60 | 61 | - [`random_numbers.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.md) provides material for the live coding session with explanations. 62 | - [`random_numbers.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.Rmd) is the generating Rmarkdown file for `random_numbers.md` 63 | - [`random_numbers.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.R) is a possible realization of the live coding session 64 | -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | ## ## 3 | ## Random Numbers and ## 4 | ## Random Sampling in R ## 5 | ## ## 6 | ################################## 7 | 8 | # Clear memory and load packages 9 | rm(list=ls()) 10 | library(tidyverse) 11 | 12 | 13 | # 1) case uniform distribution random sampling 14 | n <- 10 15 | x <- runif(n, min = 0, max = 10) 16 | x 17 | 18 | # 2) Set the seed for the computer for rng 19 | set.seed(123) 20 | x <- runif(n, min = 0, max = 10) 21 | 22 | rm(x, n) 23 | 24 | 25 | # Play around with n 26 | n <- 10000 27 | y <- rnorm(n, mean = 1, sd = 2) 28 | df <- tibble(var1 = y) 29 | ggplot(df, aes(x = var1)) + 30 | geom_histogram(aes(y = ..density..), fill = 'navyblue') + 31 | stat_function(fun = dnorm, args = list(mean = 1, sd = 2), 32 | color = 'red', size = 1.5) 33 | 34 | # There are some other type of distributions: 35 | # rbinom, rexp, rlnorm, etc. 36 | 37 | ### 38 | # Exercise with height-income distributions 39 | 40 | # Get data from OSF 41 | df <- read_csv('https://osf.io/rnuh2/download') 42 | # set height as numeric 43 | df <- df %>% mutate(height = as.numeric(height)) 44 | 45 | # Create a empirical histogram of height with theoretical normal 46 | emp_height <- ggplot(df, aes(x = height)) + 47 | geom_histogram(aes(y = ..density..), binwidth = 0.03, 48 | fill = 'navyblue', alpha = 0.6) + 49 | stat_function(fun = dnorm, color = 'red', 50 | args = with(df, c(mean = mean(height, na.rm = T), sd = sd(height, na.rm = T)))) + 51 | labs(x='Height (meters)', y='Density') + 52 | theme_bw() 53 | 54 | emp_height 55 | 56 | # Calculate the empirical mean and standard deviation 57 | mu <- with(filter(df, hhincome < 1000), 58 | log(mean(hhincome)^2 / sqrt(var(hhincome) + mean(hhincome)^2))) 59 | sigma <- with(filter(df, hhincome < 1000), 60 | sqrt(log(var(hhincome) / mean(hhincome)^2 + 1))) 61 | 62 | emp_inc <- ggplot(filter(df, hhincome < 1000), aes(x = hhincome)) + 63 | geom_histogram(aes(y = ..density..), binwidth = 10, 64 | fill = 'navyblue', alpha = 0.6) + 65 | stat_function(fun = dlnorm, colour= 'red', 66 | args = c(mean = mu, sd = sigma)) + 67 | labs(x='Income (thousand $)', y='Density') + 68 | theme_bw() 69 | 70 | emp_inc 71 | 72 | # Generate artificial data 73 | set.seed(123) 74 | artif <- tibble(height_art = rnorm(nrow(df), mean(df$height, na.rm = T), 75 | sd = sd(df$height, na.rm = T)), 76 | inc_art = rlnorm(nrow(df), meanlog = mu, sdlog = sigma)) 77 | 78 | # Compare height 79 | emp_height + geom_histogram(data = artif, aes(x = height_art, y = ..density..), 80 | binwidth = 0.03, boundary = 1.3, 81 | fill = 'orange', alpha = 0.3) 82 | 83 | # Compare income 84 | emp_inc + geom_histogram(data = artif, aes(x = inc_art, y = ..density..), binwidth = 10, 85 | fill = 'orange', alpha = 0.3) + 86 | xlim(0,500) 87 | 88 | # Task: log-income 89 | 90 | # Create log income and artificial as well 91 | set.seed(123) 92 | df <- df %>% mutate(lninc = ifelse(hhincome > 0, log(hhincome), 0), 93 | lninc_art = rnorm(nrow(df), mean = mean(lninc, na.rm = T), 94 | sd = sd(lninc, na.rm = T))) 95 | 96 | ggplot(df) + 97 | geom_histogram(aes(x = lninc, y = ..density..), binwidth = 0.3, 98 | fill = 'navyblue', alpha = 0.6) + 99 | geom_histogram(aes(x = lninc_art, y = ..density..), binwidth = 0.3, 100 | fill = 'orange', alpha = 0.3) + 101 | stat_function(fun = dnorm, colour= 'red', 102 | args = with(df, c(mean = mean(lninc, na.rm = T), 103 | sd = sd(lninc, na.rm = T)))) + 104 | labs(x='Log-Income (thousand $)', y='Density') + 105 | theme_bw() 106 | 107 | 108 | 109 | ##### 110 | # Random sampling from a data/variable: 111 | 112 | sp500 <- read_csv('https://osf.io/h64z2/download') 113 | head(sp500) 114 | 115 | 116 | # Sample_1 is without replacement 117 | set.seed(123) 118 | sample_1 <- slice_sample(sp500, n = 1000, replace = F) 119 | head(sample_1) 120 | # Sample_2 with replacement -> useful for bootstrapping 121 | sample_2 <- slice_sample(sp500, n = 1000, replace = T) 122 | 123 | # alternatively: 124 | set.seed(123) 125 | sample_1a <- sample_n(sp500, 1000, replace = FALSE) 126 | set.seed(123) 127 | sample_1b <- tibble(VALUE = sample(sp500$VALUE, 1000, replace = FALSE)) 128 | set.seed(123) 129 | sample_1c <- sp500[sample.int(1000, replace = FALSE),] 130 | # Note: all the other are the same except sample_1c, this is due to the fact 131 | # that set.seed controls for the output of the function, but the function may alter the seed. 132 | 133 | 134 | -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-.gif -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-1.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-1.gif -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /lecture11-functions/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 11: Writing Functions 2 | 3 | ## Motivation 4 | 5 | One of the best ways to improve your reach as a data scientist is to write functions. Functions allow automating common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste: 6 | 7 | 1. You can give a function an evocative name that makes your code easier to understand. 8 | 2. As requirements change, you only need to update code in one place, instead of many. 9 | 3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another). 10 | 11 | Writing good functions is a lifetime journey. Even after using R for many years, one can still learn new techniques and better ways of approaching old problems. The goal is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately. ([Hadley Wickham and Garrett Grolemund: R for Data Science, Ch. 19](https://r4ds.had.co.nz/functions.html)) 12 | 13 | ## This lecture 14 | 15 | This lecture introduces functions, how they are structured and how to write them. Students will know how to write basic functions, control for input(s) and output(s), and error-handling. 16 | 17 | Case studies related to lecture: 18 | - [Chapter 05, A: What likelihood of loss to expect on a stock portfolio?](https://gabors-data-analysis.com/casestudies/#ch05a-what-likelihood-of-loss-to-expect-on-a-stock-portfolio) as homework to calculate bootstrap standard errors and calculate confidence intervals. 19 | - [Chapter 06, A: Comparing online and offline prices: testing the difference](https://gabors-data-analysis.com/casestudies/#ch06a-comparing-online-and-offline-prices-testing-the-difference) and [Chapter 06, B: Testing the likelihood of loss on a stock portfolio](https://gabors-data-analysis.com/casestudies/#ch06b-testing-the-likelihood-of-loss-on-a-stock-portfolio) as at the end of the lecture we build a function to show the distribution of t-statistics. 20 | 21 | In addition to writing functions, it uses data from the case study [Chapter 04, A: Management quality and firm size: describing patterns of association](https://gabors-data-analysis.com/casestudies/#ch04a-management-quality-and-firm-size-describing-patterns-of-association). 22 | 23 | 24 | ## Learning outcomes 25 | After successfully live-coding the material (see: [`functions.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.md)), students will know on 26 | 27 | - What is the structure of a function 28 | - Out of the function 29 | - simple output 30 | - controlling for the output with `return` 31 | - multiple outputs with lists 32 | - Controlling for the input 33 | - `stopifnot` function 34 | - other methods and error-handling in general 35 | - pre-set inputs 36 | - Exercise for the sampling distribution of the t-statistics, to use: 37 | - conditionals 38 | - loops 39 | - random numbers and random sampling 40 | - writing a function 41 | 42 | ## Datasets used 43 | 44 | - [wms-management-survey](https://gabors-data-analysis.com/datasets/#wms-management-survey) 45 | - [sp500](https://gabors-data-analysis.com/datasets/#sp500) as homework. 46 | 47 | ## Lecture Time 48 | 49 | Ideal overall time: **20-30 mins**. 50 | 51 | This is a relatively short lecture, and it can be even shorter if less emphasis is put on output and input controlling and error-handling. 52 | 53 | ## Homework 54 | 55 | *Type*: quick practice, approx 15 mins, together with [lecture08-conditionals](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture08-conditionals), [lecture09-loops](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture09-loops), and [lecture10-random-numbers](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture10-random-numbers) 56 | 57 | Bootstrapping - using the [`sp500`](https://gabors-data-analysis.com/datasets/#sp500) data 58 | 59 | - download the cleaned data for `sp500` from [OSF](https://osf.io/h64z2/) 60 | - write a function, which calculates the bootstrap standard errors and confidence intervals based on these standard errors. 61 | - function should have an input for a) vector of prices, b) number of bootstraps, c) level for the confidence interval 62 | - create a new variable for `sp500`: `daily_return`, which is the difference in the prices from one day to the next day. 63 | - use this `daily_return` variable and calculate the 80% confidence interval based on bootstrap standard errors along with the mean. 64 | 65 | 66 | ## Further material 67 | 68 | - Case study materials from Gabor's da_case_studies repository on generalization (with bootstrapping) is: [ch05-stock-market-loss-generalize](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch05-stock-market-loss-generalize) on testing are: [ch06-online-offline-price-test](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch06-online-offline-price-test) and [ch06-stock-market-loss-test](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch06-stock-market-loss-test) 69 | - Hadley Wickham and Garrett Grolemund: R for Data Science [Chapter 19](https://r4ds.had.co.nz/functions.html) provide further material on functions with exercises. 70 | - Grant McDermott: Data Science for Economists - [Lecture 10](https://github.com/uo-ec607/lectures/blob/master/10-funcs-intro/10-funcs-intro.md) is a great alternative to introduce functions. 71 | - Roger D. Peng, Sean Kross, and Brooke Anderson: Mastering Software Development in R, [Chapter 2](https://bookdown.org/rdpeng/RProgDA/advanced-r-programming.html) is a great place to start deepening programming skills. 72 | - Hadley Wickham: [Advanced R](http://adv-r.had.co.nz/Introduction.html) is also a great place to start hard-core programming in R. 73 | 74 | 75 | ## File structure 76 | 77 | - [`functions.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.md) provides material for the live coding session with explanations. 78 | - [`functions.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.Rmd) is the generating Rmarkdown file for [`functions.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.md) 79 | - [`functions.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.R) is a possible realization of the live coding session 80 | -------------------------------------------------------------------------------- /lecture11-functions/functions.R: -------------------------------------------------------------------------------- 1 | ######################### 2 | ## ## 3 | ## Functions ## 4 | ## ## 5 | ######################### 6 | rm(list=ls()) 7 | 8 | # 1) simplest case - calculate the mean 9 | my_avg <- function(x){ 10 | sum_x <- sum(x) 11 | sum_x / length(x) 12 | } 13 | 14 | # Import wms-management data 15 | library(tidyverse) 16 | wms <- read_csv('https://osf.io/uzpce/download') 17 | # save management score x1 18 | x1 <- wms$management 19 | # Remove wms to keep environment tidy 20 | rm(wms) 21 | # Print out 22 | my_avg(x1) 23 | 24 | # or save it as a variable 25 | avg_x <- my_avg(x1) 26 | avg_x 27 | 28 | # 2) Calculate mean and standard deviation with checking inputs 29 | my_fun1 <- function(x){ 30 | sum_x <- sum(x) 31 | # number of observations 32 | N <- length(x) 33 | # Mean of x 34 | mean_x <- sum_x / N 35 | # Variance of x 36 | var_x <- sum((x - mean_x)^2 / N) 37 | # Standard deviation of x 38 | sqrt(var_x) 39 | } 40 | 41 | # Get the standard deviation for x1 42 | my_fun1(x1) 43 | 44 | # 3) Control for output 45 | my_fun2 <- function(x){ 46 | sum_x <- sum(x) 47 | # number of observations 48 | N <- length(x) 49 | # Mean of x 50 | mean_x <- sum_x / N 51 | # Variance of x 52 | var_x <- sum((x - mean_x)^2) / (N - 1) 53 | # Standard deviation of x 54 | sqrt(var_x) 55 | return(mean_x) 56 | } 57 | 58 | # Get the mean for x1 59 | my_fun2(x1) 60 | 61 | # 4) Multiple output 62 | my_fun3 <- function(x){ 63 | sum_x <- sum(x) 64 | # number of observations 65 | N <- length(x) 66 | # Mean of x 67 | mean_x <- sum_x / N 68 | # Variance of x 69 | var_x <- sum((x - mean_x)^2) / (N - 1) 70 | # Standard deviation of x 71 | sd_x <- sqrt(var_x) 72 | out <- list('sum' = sum_x, 'mean' = mean_x, 'var' = var_x ,'sd' = sd_x) 73 | return(out) 74 | } 75 | 76 | # Check the output 77 | out3 <- my_fun3(x1) 78 | # get all the output as list 79 | out3 80 | # get e.g. the mean 81 | out3$mean 82 | 83 | # 5) Controlling for input 84 | my_avg_chck <- function(x){ 85 | stopifnot(is.numeric(x)) 86 | sum_x <- sum(x) 87 | sum_x / length(x) 88 | } 89 | 90 | # Good input 91 | my_avg_chck(x1) 92 | # Bad input 93 | my_avg_chck('Hello world') 94 | 95 | # 6) Multiple input 96 | conf_interval <- function(x, level = 0.95){ 97 | # mean of x 98 | mean_x <- mean(x, na.rm = TRUE) 99 | # standard deviation 100 | sd_x <- sd(x, na.rm = TRUE) 101 | # number of observations in x 102 | n_x <- sum(!is.na(x)) 103 | # Calculate the theoretical SE for mean of x 104 | se_mean_x <- sd_x / sqrt(n_x) 105 | # Calculate the CI 106 | if (level == 0.95){ 107 | CI_mean <- c(mean_x - 1.96*se_mean_x, mean_x + 1.96*se_mean_x) 108 | } else if (level == 0.99){ 109 | CI_mean <- c(mean_x - 2.58*se_mean_x, mean_x + 2.58*se_mean_x) 110 | } else { 111 | stop('No such level implemented for confidence interval, use 0.95 or 0.99') 112 | } 113 | out <- list('mean'=mean_x,'CI_mean' = CI_mean) 114 | return(out) 115 | } 116 | # Get some CI values 117 | conf_interval(x1, level = 0.95) 118 | conf_interval(x1) 119 | conf_interval(x1, level = 0.99) 120 | conf_interval(x1, level = 0.98) 121 | 122 | # Task - flexible level 123 | conf_interval2 <- function(x, level = 0.95){ 124 | # mean of x 125 | mean_x <- mean(x, na.rm = TRUE) 126 | # standard deviation 127 | sd_x <- sd(x, na.rm = TRUE) 128 | # number of observations in x 129 | n_x <- sum(!is.na(x)) 130 | # Calculate the theoretical SE for mean of x 131 | se_mean_x <- sd_x / sqrt(n_x) 132 | # Calculate the CI 133 | if (level >= 0 | level <= 1){ 134 | crit_val <- qnorm(level + (1 - level)/2) 135 | CI_mean <- c(mean_x - crit_val*se_mean_x, mean_x + crit_val*se_mean_x) 136 | } else { 137 | stop('level must be between 0 and 1') 138 | } 139 | out <- list('mean'=mean_x,'CI_mean' = CI_mean) 140 | return(out) 141 | } 142 | # Get some CI values 143 | conf_interval2(x1, level = 0.95) 144 | conf_interval2(x1) 145 | conf_interval2(x1, level = 0.99) 146 | conf_interval2(x1, level = 0.98) 147 | 148 | ########## 149 | # A solution for Execrice: sampling distribution 150 | library(tidyverse) 151 | 152 | # Function for sampling distribution 153 | get_sampling_dists <- function(y, rep_num = 1000, sample_size = 1000){ 154 | # Check inputs 155 | stopifnot(is.numeric(y)) 156 | stopifnot(is.numeric(rep_num), length(rep_num) == 1, rep_num > 0) 157 | stopifnot(is.numeric(sample_size), length(sample_size) == 1 , 158 | sample_size > 0, sample_size <= length(y)) 159 | # initialize the for loop 160 | set.seed(100) 161 | mean_stat <- double(rep_num) 162 | t_stat_A <- double(rep_num) 163 | t_stat_B <- double(rep_num) 164 | # Usual scaler for SE 165 | sqrt_n <- sqrt(sample_size) 166 | for (i in 1:rep_num) { 167 | # Need a new sample 168 | y_i <- sample(y, sample_size, replace = FALSE) 169 | # Mean for sample_i 170 | mean_stat[ i ] <- mean(y_i) 171 | # SE for Mean 172 | se_mean <- sd(y_i) / sqrt_n 173 | # T-statistics for hypotheses 174 | t_stat_A[ i ] <- (mean_stat[ i ] - 1) / se_mean 175 | t_stat_B[ i ] <- mean_stat[ i ] / se_mean 176 | } 177 | out <- tibble(mean_stat = mean_stat, t_stat_A = t_stat_A, 178 | t_stat_B = t_stat_B) 179 | } 180 | 181 | # Create y 182 | set.seed(123) 183 | y <- runif(10000, min = 0, max = 2) 184 | # Get some sampling distribution 185 | sampling_y <- get_sampling_dists(y, rep_num = 1000, sample_size = 100) 186 | 187 | # Plot these distributions 188 | ggplot(sampling_y, aes(x = mean_stat)) + 189 | geom_histogram(aes(y = ..density..), bins = 60, color = 'navyblue', fill = 'navyblue') + 190 | geom_vline(xintercept = 1, linetype = 'dashed', color = 'blue', size = 1)+ 191 | geom_vline(xintercept = mean(y), color = 'red', size = 1) + 192 | geom_vline(xintercept = mean(sampling_y$mean_stat), color = 'black', size = 1)+ 193 | stat_function(fun = dnorm, args = list(mean = mean(y), sd = sd(y) / sqrt(100)) , 194 | color = 'red', size = 1) + 195 | labs(x = 'Sampling distribution of the mean', y = 'Density') + 196 | theme_bw() 197 | 198 | 199 | # Plot distribution for t-stats - Hypothesis A 200 | ggplot(sampling_y, aes(x = t_stat_A)) + 201 | geom_histogram(aes(y = ..density..), bins = 60, fill = 'navyblue') + 202 | stat_function(fun = dnorm, args = list(mean = 0, sd = 1) , 203 | color = 'red', size = 1) + 204 | labs(x = 'Sampling distribution of t-stats: hypothesis A', y = 'Density') + 205 | theme_bw() 206 | 207 | 208 | # Plot distribution for t-stats - Hypothesis B 209 | ggplot(sampling_y, aes(x = t_stat_B)) + 210 | geom_histogram(aes(y = ..density..), bins = 60, fill = 'navyblue') + 211 | stat_function(fun = dnorm, args = list(mean = 0, sd = 1) , 212 | color = 'red', size = 1) + 213 | scale_x_continuous(limits = c(-4,30))+ 214 | labs(x = 'Sampling distribution of t-stats: hypothesis B', y = 'Density') + 215 | theme_bw() 216 | 217 | 218 | -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-2.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-3.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-2.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-3.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-2.png -------------------------------------------------------------------------------- /lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-3.png -------------------------------------------------------------------------------- /lecture12-intro-to-regression/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 12: Introduction to regression 2 | 3 | ## Motivation 4 | 5 | You want to identify hotels in a city that are good deals: underpriced for their location and quality. You have scraped the web for data on all hotels in the city, and you have cleaned the data. You have carried out exploratory data analysis that revealed that hotels closer to the city center tend to be more expensive, but there is a lot of variation in prices between hotels at the same distance. How should you identify hotels that are underpriced relative to their distance to the city center? In particular, how should you capture the average price–distance relationship that would provide you a benchmark, to which you can compare actual prices to find good deals? 6 | 7 | The analysis of hotel prices and distance to the city center reveals that hotels further away from the center are less expensive by a certain amount, on average. Can you use this result to estimate how much more revenue a hotel developer could expect if it were to build a hotel closer to the center rather than farther away? Regression is a model for the conditional mean: the mean of y for different values of one or more x variables. Regression is used to uncover patterns of association. That, in turn, is used in the causal analysis, to uncover the effect of x on y, and in predictions, to arrive at a good guess of what the value of y is if we don’t know it, but we know the value of x. 8 | 9 | In this lecture, we introduce simple non-parametric regression and simple linear regression, and we show how to visualize their results. We then discuss simple linear regression in detail. We introduce the regression equation, how its coefficients are uncovered (estimated) in actual data, and we emphasize how to interpret the coefficients. We introduce the concepts of predicted value and residual and goodness of fit, and we discuss the relationship between regression and correlation. 10 | 11 | ## This lecture 12 | 13 | This lecture introduces regressions via [hotels-vienna dataset](https://gabors-data-analysis.com/datasets/#hotels-vienna). It overviews models based on simple binary means, binscatters, lowess nonparametric regression, and introduces simple linear regression techniques. The lecture illustrates the use of predicted values and regression residuals with linear regression, but as homework, the same exercise is repeated with a binscatter-based model. 14 | 15 | This lecture is based on [Chapter 07, A: *Finding a good deal among hotels with simple regression*](https://gabors-data-analysis.com/casestudies/#ch07a-finding-a-good-deal-among-hotels-with-simple-regression) 16 | 17 | ## Learning outcomes 18 | After successfully completing [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R) students should be able: 19 | 20 | - Binary means: 21 | - Calculate prediction based on means of two categories and create an annotated graph 22 | - Binscatter: 23 | - Create means based on differently defined bins for the X variable 24 | - Show two different graphs: simple mean predictions for each bins as a dot and scatter with step functions 25 | - Lowess nonparametric regression: 26 | - How to create a lowess (loess) graph 27 | - What is an output of a loess model? What are the main advantages and disadvantages? 28 | - Simple linear regression 29 | - How to create a simple linear regression line in a scatterplot 30 | - The classical `lm` command and its limitation 31 | - `feols` package: estimate two models w and w/o heteroscedastic robust SE and compare the two model 32 | - Have an idea about `estimatr` package and `lm_robust` command 33 | - How to get predicted values and errors of predictions 34 | - Get the best and worst deals: identify hotels with the smallest/largest errors 35 | - Visualize the errors via histogram and scatter plot with annotating the best and worst 5 deals. 36 | 37 | ## Dataset used 38 | 39 | - [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna) 40 | 41 | ## Lecture Time 42 | 43 | Ideal overall time: **60 mins**. 44 | 45 | Going through [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R) takes around *45-50 minutes*, the rests are the tasks. It builds on [lecture07-ggplot-indepth](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture07-ggplot-indepth) as it requires building a boxplot. Can be skipped. 46 | 47 | 48 | ## Homework 49 | 50 | *Type*: quick practice, approx 15 mins 51 | 52 | Use the binscatter model with 7 bins and save the predicted values and errors (true price minus the predicted value). Find the best and worst 10 deals and visualize with a scatterplot, highlighting the under/overpriced hotels with these best/worst deals according to this model. Compare to the simple linear regression. Which model would you use? Argue! 53 | 54 | 55 | ## Further material 56 | 57 | - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch07-hotels-simple-reg](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch07-hotels-simple-reg) 58 | - On ggplot, see Chapter 3.5-6 and Chapter 5.6 [Kieran H. (2019): Data Visualization](https://socviz.co/makeplot.html#mapping-aesthetics-vs-setting-them) or [Winston C. (2022): R Graphics Cookbook, Chapter 5](https://r-graphics.org/chapter-scatter) 59 | - On regression [Grant McDermott: Data Science for Economists, Course material Lecture 08](https://github.com/uo-ec607/lectures/tree/master/08-regression) provides a somewhat different approach, but can be a nice supplement 60 | 61 | 62 | ## Folder structure 63 | 64 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture12-intro-to-regression/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class. 65 | - [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R), is the main material for this lecture. 66 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture12-intro-to-regression/complete_codes) includes code with solution for [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R) as [`hotels_intro_to_regression_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/complete_codes/hotels_intro_to_regression_fin.R) 67 | 68 | -------------------------------------------------------------------------------- /lecture14-simple-regression/complete_codes/life_exp_clean.R: -------------------------------------------------------------------------------- 1 | ############################################# 2 | # # 3 | # Lecture 14 # 4 | # # 5 | # Auxiliary file to clean data # 6 | # - can practice, but not recommended # 7 | # # 8 | # Case Study: # 9 | # Life-expectancy and income # 10 | # # 11 | ############################################# 12 | 13 | 14 | 15 | # Clear memory 16 | rm(list=ls()) 17 | 18 | library(tidyverse) 19 | library(modelsummary) 20 | 21 | # Call the data from github 22 | my_url <- 'https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/main/lecture14-simple-regression/data/raw/WDI_lifeexp_raw.csv' 23 | df <- read_csv(my_url) 24 | 25 | ## Check the observations: 26 | # Lot of grouping observations 27 | # usually contains a number 28 | d1 <- df %>% filter(grepl('[[:digit:]]', df$iso2c)) 29 | d1 30 | # Filter these out 31 | df <- df %>% filter(!grepl('[[:digit:]]', df$iso2c)) 32 | 33 | # Some grouping observations are still there, check each of them 34 | # HK - Hong Kong, China 35 | # OE - OECD members 36 | # all with starting X, except XK which is Kosovo 37 | # all with starting Z, except ZA-South Africa, ZM-Zambia and ZW-Zimbabwe 38 | 39 | # 1st drop specific values 40 | drop_id <- c('EU','HK','OE') 41 | # Check for filtering 42 | df %>% filter(grepl(paste(drop_id, collapse='|'), df$iso2c)) 43 | # Save the opposite 44 | df <- df %>% filter(!grepl(paste(drop_id, collapse='|'), df$iso2c)) 45 | 46 | # 2nd drop values with certain starting char 47 | # Get the first letter from iso2c 48 | fl_iso2c <- substr(df$iso2c, 1, 1) 49 | retain_id <- c('XK','ZA','ZM','ZW') 50 | # Check 51 | d1 <- df %>% filter(grepl('X', fl_iso2c) | grepl('Z', fl_iso2c) & 52 | !grepl(paste(retain_id, collapse='|'), df$iso2c)) 53 | # Save observations which are the opposite (use of !) 54 | df <- df %>% filter(!(grepl('X', fl_iso2c) | grepl('Z', fl_iso2c) & 55 | !grepl(paste(retain_id, collapse='|'), df$iso2c))) 56 | 57 | # Clear non-needed variables 58 | rm(d1, drop_id, fl_iso2c, retain_id) 59 | 60 | ### 61 | # Check for missing observations 62 | m <- df %>% filter(!complete.cases(df)) 63 | # Drop if life-expectancy, gdp or total population missing -> if not complete case except iso2c 64 | df <- df %>% filter(complete.cases(df) | is.na(df$iso2c)) 65 | 66 | ### 67 | # CLEAN VARIABLES 68 | # 69 | # Recreate table: 70 | # Rename variables and scale them 71 | # Drop all the others !! in this case write into readme it is referring to year 2018!! 72 | df <-df %>% transmute(country = country, 73 | population=SP.POP.TOTL/1000000, 74 | gdppc=NY.GDP.PCAP.PP.KD/1000, 75 | lifeexp=SP.DYN.LE00.IN) 76 | 77 | ### 78 | # Check for extreme values 79 | # all HISTOGRAMS 80 | df %>% 81 | keep(is.numeric) %>% 82 | gather() %>% 83 | ggplot(aes(value)) + 84 | facet_wrap(~key, scales = 'free') + 85 | geom_histogram(bins=30) 86 | 87 | # It seems we have a large value(s) for population: 88 | df %>% filter(population > 500) 89 | # These are India and China... not an extreme value 90 | 91 | # Check for summary as well 92 | datasummary_skim(df) 93 | 94 | # Save the raw data file for your working directory 95 | my_path <- 'ENTER YOUR OWN PATH' 96 | write_csv(df, paste0(my_path,'data/clean/WDI_lifeexp_clean.csv')) 97 | 98 | # I have pushed it into github as well! 99 | 100 | 101 | 102 | -------------------------------------------------------------------------------- /lecture14-simple-regression/complete_codes/life_exp_getdata_fin.R: -------------------------------------------------------------------------------- 1 | ############################################# 2 | # # 3 | # Lecture 14 # 4 | # # 5 | # Getting the data for analysis # 6 | # - practice with WDI package # 7 | # # 8 | # Case Study: # 9 | # Life-expectancy and income # 10 | # # 11 | ############################################# 12 | 13 | 14 | # Clear memory 15 | rm(list=ls()) 16 | 17 | # Call packages 18 | if (!require(WDI)){ 19 | install.packages('WDI') 20 | library(WDI) 21 | } 22 | library(tidyverse) 23 | 24 | 25 | # Reminder on how WDI works - it is an API 26 | # Search for variables which contains GDP 27 | a <- WDIsearch('gdp') 28 | # Narrow down the serach for: GDP + something + capita + something + constant 29 | a <- WDIsearch('gdp.*capita.*constant') 30 | 31 | # Get GDP data 32 | gdp_data = WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019) 33 | 34 | ## 35 | # Task: get the GDP data, along with `population, total' and `life expectancy at birth' 36 | # for year 2019 and save to your raw folder! 37 | # Note: I have pushed it to Github, we will use that later, just to be on the same page! 38 | a <- WDIsearch('population, total') 39 | b <- WDIsearch('life expectancy at birth') 40 | 41 | # Get all the data for year 2019 42 | data_raw <- WDI(indicator=c('NY.GDP.PCAP.PP.KD','SP.DYN.LE00.IN','SP.POP.TOTL'), 43 | country='all', start=2019, end=2019) 44 | 45 | # Save the raw data file for your working directory 46 | my_path <- 'ENTER YOUR OWN PATH' 47 | write_csv(data_raw, paste0(my_path,'data/raw/WDI_lifeexp_raw.csv')) 48 | 49 | # I have pushed it to Github, we will use that! 50 | # Note this is only the raw files! I am cleaning them in a separate file and save the results to the clean folder! 51 | 52 | 53 | -------------------------------------------------------------------------------- /lecture14-simple-regression/raw_codes/life_exp_getdata.R: -------------------------------------------------------------------------------- 1 | ############################################# 2 | # # 3 | # Lecture 14 # 4 | # # 5 | # Getting the data for analysis # 6 | # - practice with WDI package # 7 | # # 8 | # Case Study: # 9 | # Life-expectancy and income # 10 | # # 11 | ############################################# 12 | 13 | 14 | # Clear memory 15 | rm(list=ls()) 16 | 17 | # Call packages 18 | if (!require(WDI)){ 19 | install.packages('WDI') 20 | library(WDI) 21 | } 22 | library(tidyverse) 23 | 24 | 25 | # Reminder on how WDI works - it is an API 26 | # Search for variables which contains GDP 27 | a <- WDIsearch('gdp') 28 | # Narrow down the serach for: GDP + something + capita + something + constant 29 | a <- WDIsearch('gdp.*capita.*constant') 30 | 31 | # Get GDP data 32 | gdp_data = WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019) 33 | 34 | ## 35 | # Task: get the GDP data, along with `population, total' and `life expectancy at birth' 36 | # for year 2019 and save to your raw folder! 37 | # Note: I have pushed it to Github, we will use that later, just to be on the same page! 38 | # Note this is only the raw files! I am cleaning them in a separate file and save the results to the clean folder! 39 | 40 | 41 | -------------------------------------------------------------------------------- /lecture16-binary-models/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 16: Binary outcome - modeling probabilities 2 | 3 | ## Motivation 4 | 5 | Does smoking make you sick? And can smoking make you sick in late middle age even if you stopped years earlier? You have data on many healthy people in their fifties from various countries, and you know whether they stayed healthy four years later. You have variables on their smoking habits, age, income, and many other characteristics. How can you use this data to estimate how much more likely non-smokers are to stay healthy? How can you uncover if that depends on whether they never smoked or are former smokers? And how can you tell if that association is the result of smoking itself or, instead, underlying differences in smoking by education, income, and other factors? 6 | 7 | The lecture is related to the chapter that discusses probability models: regressions with binary y variables. In a sense, we can treat a binary y variable just like any other variable and use regression analysis as we would otherwise. with a binary y variable, we can estimate nonlinear probability models instead of the linear ones. Data analysts need to have a good understanding of when to use these different probability models, and how to interpret and evaluate their results. 8 | 9 | ## This lecture 10 | 11 | This lecture introduces binary outcome models with an analysis of health outcomes with multiple variables based on the [share-health](https://gabors-data-analysis.com/datasets/#share-health) dataset. First, we introduce saturated models (smoking on health) and linear probability models with multiple explanatory variables. We check the predicted outcome probabilities for certain groups. Then we focus on non-linear binary models: the logit and probit model. We estimate marginal effects, to interpret the average (marginal) effects of variables on the outcome probabilities. We overview goodness of fit statistics (R2, Pseudo-R2, Brier score, and Log-loss) along with visual and descriptive inspection of the predicted probabilities. Finally, we calculate the estimated bias and the calibration curve to understand model perform better. 12 | 13 | This lecture is based on [Chapter 11, A: Does smoking pose a health risk?](https://gabors-data-analysis.com/casestudies/#ch11a-does-smoking-pose-a-health-risk) 14 | 15 | ## Learning outcomes 16 | After successfully completing codes in [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R), students should be able: 17 | 18 | 19 | - Calculated by hand or estimate saturated models 20 | - Visualize and understand binary outcome scatter-plots 21 | - Estimate Linear Probability Models (LPM) 22 | - Use `feols` to estimate regressions with multiple explanatory variables 23 | - Use `etable` to compare multiple candidate models and report model statistics such as R2 to evaluate models. 24 | - Understand the limitations of LPM 25 | - Carry out sub-group analysis based on predicted probabilities 26 | - Estimate Non-Linear Probability Models 27 | - Use `feglm` with `link = 'logit'` or `'probit'`, to estimate logit or probit models 28 | - Estimate `marginaleffects` with package `marginaleffects` 29 | - Use `etable` to compare logit and probit coefficients 30 | - Use `modelsummary` (from package `modelsummary`) to compare, LPM, logit/probit and logit/probit with marginal effects 31 | - Handle `modelsummary` function to get relevant goodness-of-fit measures 32 | - Use `fitstat_register()` function for `etable` to calculate user-supplied goodness-of-fit statistics, such as *Brier score* or *Log-loss* measures 33 | - Understand the usefulness of comparing the distribution of predicted probabilities for different models 34 | - Understanding the usefulness of comparing descriptive statistics of the predicted probabilities for different models 35 | - Calculate the bias of the model along with the calibration curve 36 | 37 | ## Datasets used 38 | 39 | - [share-health](https://gabors-data-analysis.com/datasets/#share-health) 40 | 41 | ## Lecture Time 42 | 43 | Ideal overall time: **100 mins**. 44 | 45 | Going through [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R) takes around *80-90 minutes* as there are many discussion and interpretation of the models. Solving the tasks takes the remaining *10-20 minutes*. 46 | 47 | 48 | ## Homework 49 | 50 | *Type*: quick practice, approx 20 mins 51 | 52 | Use the same [share-health](https://gabors-data-analysis.com/datasets/#share-health) dataset, but now use `smoking` as your outcome variable as this task is going to ask you to predict if a person is a smoker or not. Use similar variables except `stayshealthy` to explain `smoking`. Run a LPM, logit and probit model. Compare the coefficients of these models along with the average marginal effects. Compute the goodness of fit statistics (R2, Pseudo-R2, Brier score, log-loss) of all of the models. Choose one, calculate the bias, and plot the calibration curve. 53 | 54 | 55 | 56 | ## Further material 57 | 58 | - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch11-smoking-health-risk](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch11-smoking-health-risk) 59 | - Coding and multiple linear regression: partially related in Chapter 4, especially Ch 4.2 from [James-Witten-Hastie-Tibshirani (2013) - An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/) 60 | - Some other useful resources which are R-related, using base-R methods: [Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer: Introduction to Econometrics with R, Chapter 11](https://www.econometrics-with-r.org/11-rwabdv.html) or [Ramzi W. Nahhas: Introduction to Regression Methods for Public Health Using R 61 | ](https://bookdown.org/rwnahhas/RMPH/blr.html). 62 | 63 | 64 | ## Folder structure 65 | 66 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture16-binary-models/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class. 67 | - [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R), is the main material for this lecture. 68 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture16-binary-models/complete_codes) includes code with solution for [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R) as [`binary_models_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/complete_codes/binary_models_fin.R) 69 | 70 | -------------------------------------------------------------------------------- /lecture17-dates-n-times/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 17: Date and time manipulations 2 | 3 | ## Motivation 4 | 5 | Time series data is often used to analyze business, economic, and policy questions. Time series data presents additional opportunities as well as additional challenges for regression analysis. Unlike cross-sectional data, it enables examining how y changes when x changes, and it also allows us to examine what happens to y right away or with a delay. However, variables in time series data come with some special features that affect how we should estimate regressions, and how we can interpret their coefficients. 6 | 7 | One of these differences is the frequency of the time series. It can vary from seconds to years. Time series with more frequent observations have higher frequency, e.g. monthly frequency is higher than yearly frequency, but it is lower than daily frequency. The frequency may also be irregular with gaps in-between. Gaps in time series data can be viewed as missing values of variables. But they tend to have specific causes. To run a regression of y on x in time series data, the two variables need to be at the same time series frequency. When the time series frequencies of y and x are different, we need to adjust one of them. Most often that means aggregating the variable at a higher frequency (e.g., from weekly to monthly). With flow variables, such as sales, aggregation means adding up; with stock variables and other kinds of variables, such as prices, it is often taking an average for the period or taking the last value, such as the closing price. 8 | 9 | Another fundamental feature of time series data is that variables evolve with time. They may hover around a stable average value, or they may drift upwards or downwards. A variable in time series data follows a trend if it tends to change in one direction; in other words, it has a tendency to increase or decrease. Another possible issue is seasonality. Seasonality means that the value of the variable is expected to follow a cyclical pattern, tracking the seasons of the year, days of the week, or hours of the day. Because of such systematic changes, later observations tend to be different from earlier observations. Understanding trends and seasonality is important because they make regression analysis challenging. They are examples of a broader concept, non-stationarity. Stationarity means stability; non-stationarity means the lack of stability. Stationary time series variables have the same expected 10 | value and the same distribution at all times. Trends and seasonality violate stationarity because the expected value is different at different times. 11 | 12 | ## This lecture 13 | 14 | This lecture introduces basic date and time-variable manipulations. The first part starts with the basics using `lubridate` package by overviewing basic time-related functions and manipulations with time-related values and variables. The second part discusses time-series data aggregation from different frequencies along with visualization for time-series data and unit root tests. 15 | 16 | This lecture utilizes the case study of [Chapter 12, A: Returns on a company stock and market returns](https://gabors-data-analysis.com/casestudies/#ch12a-returns-on-a-company-stock-and-market-returns) as homework, and uses [`stocks-sp500`](https://gabors-data-analysis.com/datasets/#stocks-sp500) dataset. 17 | 18 | ## Learning outcomes 19 | After successfully completing [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R), students should be: 20 | 21 | - Familiar with the `lubridate` package, especially with 22 | - creating specific time variables, converting other types of variables into a date or datetime object 23 | - understand the importance of time zones 24 | - Get specific parts of a date object such as `year, quarter, month, wday, yday, day, leap_year` 25 | - Rounding to the closest month, year, quarter, etc. 26 | - Understand the difference between duration and periods 27 | - Carry out time aggregation 28 | - Aggregate different time series objects to lower frequencies, using mean/median/max/end date, etc. 29 | - Adding `lag`-ged and differenced variables to data 30 | - Visualize time series with 31 | - handle time variable on x-axis with `scale_x_date()` 32 | - `facet_wrap` to stack multiple graphs as an alternative to `ggpurb` 33 | - standardize variables and put multiple lines into one graph 34 | - Unit root tests using `aTSA` package's `pp.test` function 35 | - understanding the result of the Philip-Perron test and deciding if the variable needs to be differenced or not. 36 | 37 | ## Datasets used 38 | 39 | - [`stocks-sp500`](https://gabors-data-analysis.com/datasets/#stocks-sp500) 40 | 41 | ## Lecture Time 42 | 43 | Ideal overall time: **35-40 mins**. 44 | 45 | Going through [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R) takes around *30 minutes*. There are some discussions and interpretations of the time series (e.g. stationarity). Solving the tasks takes the remaining *5-10 minutes*. The lecture can be shortened by only showing the methods. It will be partially repeated in [lecture18-timeseries-regression](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression). 46 | 47 | 48 | ## Homework 49 | 50 | *Type*: quick practice, approx 10 mins 51 | 52 | Estimate the *beta* coefficient between quarterly SP500 log returns on Microsoft stocks log return. Use the [`stocks-sp500`](https://gabors-data-analysis.com/datasets/#stocks-sp500) dataset. Take care when aggregating the data to a) use the last day in the quarter and then take the logs and then difference the variable to get log returns. When estimating the regression use heteroskedastic robust standard error (next lecture we learn how to use Newey-West SE). 53 | 54 | 55 | ## Further material 56 | 57 | - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch12-stock-returns-risk](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch12-stock-returns-risk/ch12-stock-returns-risk.R) 58 | - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 16](https://r4ds.had.co.nz/dates-and-times.html) discuss time and date formates more in detail. 59 | - [`timetk` package](https://business-science.github.io/timetk/index.html) is a well-documented advanced time-series related package. There are many possibilities with great solutions. A good starting point for further material in time series with R. 60 | - [`lubridate` package](https://lubridate.tidyverse.org/index.html) has good documentation, worth checking. 61 | 62 | ## Folder structure 63 | 64 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture17-dates-n-times/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class. 65 | - [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R), is the main material for this lecture. 66 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture17-dates-n-times/complete_codes) includes code with solution for [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R) as [`date_time_manipulations_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/complete_codes/date_time_manipulations_fin.R) 67 | 68 | -------------------------------------------------------------------------------- /lecture18-timeseries-regression/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 18: Introduction to time-series regression 2 | 3 | ## Motivation 4 | 5 | Heating and cooling are potentially important uses of electricity. To investigate how weather conditions affect electricity consumption, you have collected data on temperature and residential electricity consumption in a hot region. How should you estimate the association between temperature and electricity consumption? How should you define the variables of interest, and how should you prepare the data, which has daily observations on temperature and monthly observations on electricity consumption? Should you worry about the fact that both electricity consumption and temperature vary a lot across months within years, and if yes, what should you do about it? 6 | 7 | Time series data is often used to analyze business, economic, and policy questions. Time series data presents additional opportunities as well as additional challenges for regression analysis. Unlike cross-sectional data, it enables examining how y changes when x changes, and it also allows us to examine what happens to y right away or with a delay. However, variables in time series data come with some special features that affect how we should estimate regressions, and how we can interpret their coefficients. 8 | 9 | ## This lecture 10 | 11 | This lecture introduces time-series regression via the [arizona-electricity](https://gabors-data-analysis.com/datasets/#arizona-electricity) dataset. During this lecture, students manipulate time-series data along time dimensions, create multiple time-series related graphs and get familiar with (partial) autocorrelation. Differenced variables, lags of the outcome, and lags of the explanatory variables, (deterministic) seasonality are used during regression models. Estimating these models are via `feols` with Newey-West standard errors. Model comparisons and estimating cumulative effects with valid SEs are shown. 12 | 13 | This lecture is based on [Chapter 12, B: Electricity consumption and temperature](https://gabors-data-analysis.com/casestudies/#ch12b-electricity-consumption-and-temperature) 14 | 15 | ## Learning outcomes 16 | After successfully completing [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R), students should be able: 17 | 18 | - Merge different time-series data 19 | - Create time-series related descriptives and graphs 20 | - handle date as the axis with different formatting 21 | - import source code from URL via `source_url` from `devtools` 22 | - create autocorrelation and partial autocorrelation graphs and interpret 23 | - Run time-series regression with `feols` from `fixest` 24 | - Understand why defining period and id is important with `fixest` package 25 | - Estimate Newey-West standard errors and understand the role of lags 26 | - Control for seasonality via dummies 27 | - Add lagged variables to the model (and possibly leads as well) 28 | - How and why to use the same time interval when comparing competing time-series models 29 | - Estimate the standard error(s) for the cumulative effect 30 | 31 | ## Datasets used 32 | 33 | - [arizona-electricity](https://gabors-data-analysis.com/datasets/#arizona-electricity) 34 | 35 | ## Lecture Time 36 | 37 | Ideal overall time: **60-80 mins**. 38 | 39 | Going through [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R) takes around *50-70 minutes* as there are some discussions and interpretations of the time series (e.g. stationarity, a transformation of variables, etc). Solving the tasks takes the remaining *5-10 minutes*. 40 | 41 | 42 | ## Homework 43 | 44 | *Type*: quick practice, approx 20 mins 45 | 46 | You will use the [case-shiller-la](https://gabors-data-analysis.com/datasets/#case-shiller-la) dataset to build a model for unemployment based on the Shiller price index. Load the data and consider only `pn` (Shiller price index) and `un` (unemployment) as the variables of interest. Both are seasonally adjusted. Decide which transformation to use to make the variables stationary. Create models, where you predict unemployment based on the Shiller price index. At least you should have one model where you use only the contemporaneous effects and one when you use lagged variables for both variables as explanatory variables. 47 | 48 | 49 | ## Further material 50 | 51 | - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch12-electricity-temperature](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch12-electricity-temperature) 52 | - Handy, but a somewhat different approach for time-series analysis can be found [James Long and Paul Teetor: R Cookbook (2019), Chapter 14](https://rc2e.com/timeseriesanalysis) 53 | - A good starting point for advanced methods in the time-series analysis is: [`modeltime`](https://business-science.github.io/modeltime/) introduces automated, machine learning, and deep learning-based analysis, its supplementary package [`timetk`](https://business-science.github.io/timetk/index.html) has many great time-series related manipulations. 54 | 55 | ## Folder structure 56 | 57 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class. 58 | - [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R), is the main material for this lecture. 59 | - [`ggplot.acorr.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/ggplot.acorr.R) is an auxillary function to plot (partial) autocorrelation graphs, by [Kevin Liu](https://rh8liuqy.github.io/ACF_PACF_by_ggplot2.html). This file is `source_url`-ed to [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R). 60 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/complete_codes) includes code with solution for [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R) as [`intro_time_series_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/complete_codes/intro_time_series_fin.R) 61 | 62 | -------------------------------------------------------------------------------- /lecture18-timeseries-regression/raw_codes/ggplotacorr.R: -------------------------------------------------------------------------------- 1 | ## Auto-correlation function and Partial-autocorrelation function 2 | # by: https://rh8liuqy.github.io/ACF_PACF_by_ggplot2.html 3 | 4 | 5 | ggplotacorr <- function(data, lag.max = 24, ci = 0.95, large.sample.size = TRUE, horizontal = TRUE,...) { 6 | 7 | require(ggplot2) 8 | require(dplyr) 9 | require(cowplot) 10 | 11 | if(horizontal == TRUE) {numofrow <- 1} else {numofrow <- 2} 12 | 13 | list.acf <- acf(data, lag.max = lag.max, type = "correlation", plot = FALSE) 14 | N <- as.numeric(list.acf$n.used) 15 | df1 <- data.frame(lag = list.acf$lag, acf = list.acf$acf) 16 | df1$lag.acf <- dplyr::lag(df1$acf, default = 0) 17 | df1$lag.acf[2] <- 0 18 | df1$lag.acf.cumsum <- cumsum((df1$lag.acf)^2) 19 | df1$acfstd <- sqrt(1/N * (1 + 2 * df1$lag.acf.cumsum)) 20 | df1$acfstd[1] <- 0 21 | df1 <- select(df1, lag, acf, acfstd) 22 | 23 | list.pacf <- acf(data, lag.max = lag.max, type = "partial", plot = FALSE) 24 | df2 <- data.frame(lag = list.pacf$lag,pacf = list.pacf$acf) 25 | df2$pacfstd <- sqrt(1/N) 26 | 27 | if(large.sample.size == TRUE) { 28 | plot.acf <- ggplot(data = df1, aes(x = lag, y = acf)) + 29 | geom_area(aes(x = lag, y = qnorm((1+ci)/2)*acfstd), fill = "#B9CFE7") + 30 | geom_area(aes(x = lag, y = -qnorm((1+ci)/2)*acfstd), fill = "#B9CFE7") + 31 | geom_col(fill = "#4373B6", width = 0.7) + 32 | scale_x_continuous(breaks = seq(0,max(df1$lag),6)) + 33 | scale_y_continuous(name = element_blank(), 34 | limits = c(min(df1$acf,df2$pacf),1)) + 35 | ggtitle("ACF") + 36 | theme_bw() 37 | 38 | plot.pacf <- ggplot(data = df2, aes(x = lag, y = pacf)) + 39 | geom_area(aes(x = lag, y = qnorm((1+ci)/2)*pacfstd), fill = "#B9CFE7") + 40 | geom_area(aes(x = lag, y = -qnorm((1+ci)/2)*pacfstd), fill = "#B9CFE7") + 41 | geom_col(fill = "#4373B6", width = 0.7) + 42 | scale_x_continuous(breaks = seq(0,max(df2$lag, na.rm = TRUE),6)) + 43 | scale_y_continuous(name = element_blank(), 44 | limits = c(min(df1$acf,df2$pacf),1)) + 45 | ggtitle("PACF") + 46 | theme_bw() 47 | } 48 | else { 49 | plot.acf <- ggplot(data = df1, aes(x = lag, y = acf)) + 50 | geom_col(fill = "#4373B6", width = 0.7) + 51 | geom_hline(yintercept = qnorm((1+ci)/2)/sqrt(N), 52 | colour = "sandybrown", 53 | linetype = "dashed") + 54 | geom_hline(yintercept = - qnorm((1+ci)/2)/sqrt(N), 55 | colour = "sandybrown", 56 | linetype = "dashed") + 57 | scale_x_continuous(breaks = seq(0,max(df1$lag),6)) + 58 | scale_y_continuous(name = element_blank(), 59 | limits = c(min(df1$acf,df2$pacf),1)) + 60 | ggtitle("ACF") + 61 | theme_bw() 62 | 63 | plot.pacf <- ggplot(data = df2, aes(x = lag, y = pacf)) + 64 | geom_col(fill = "#4373B6", width = 0.7) + 65 | geom_hline(yintercept = qnorm((1+ci)/2)/sqrt(N), 66 | colour = "sandybrown", 67 | linetype = "dashed") + 68 | geom_hline(yintercept = - qnorm((1+ci)/2)/sqrt(N), 69 | colour = "sandybrown", 70 | linetype = "dashed") + 71 | scale_x_continuous(breaks = seq(0,max(df2$lag, na.rm = TRUE),6)) + 72 | scale_y_continuous(name = element_blank(), 73 | limits = c(min(df1$acf,df2$pacf),1)) + 74 | ggtitle("PACF") + 75 | theme_bw() 76 | } 77 | cowplot::plot_grid(plot.acf, plot.pacf, nrow = numofrow) 78 | } 79 | -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin.pdf -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/create figure wi label-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/create figure wi label-1.pdf -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/plot pred graph-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/plot pred graph-1.pdf -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/setup-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/setup-1.pdf -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/show two graphs-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/show two graphs-1.pdf -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/extra/maschools_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/extra/maschools_report.pdf -------------------------------------------------------------------------------- /lecture19-advaced-rmarkdown/hotels_analysis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/hotels_analysis.pdf -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 20: Spatial data visualization 2 | 3 | ## Motivation 4 | 5 | Visualizing data spatially can allow us to make insights as to what is going on beyond our bubble. Aside from being great visuals that immediately engage audiences, map data visualizations provide a critical context for the metrics. Combining geospatial information with data creates a greater scope of understanding. Some benefits of using maps in your data visualization include: 6 | 7 | 1. A greater ability to more easily understand the distribution of your variable across the city, state, country, or world. 8 | 2. The ability to compare the activity across several locations at a glance 9 | 3. More intuitive decision making for company leaders 10 | 4. Contextualizing your data in the real world 11 | 12 | 13 | There is lots of room for creativity when making map dashboards because there are numerous ways to convey information with this kind of visualization. In R, we map geographical regions colored, shaded, or graded according to some variable. They are visually striking, especially when the spatial units of the map are familiar entities. 14 | 15 | | Life expectancy map | Hotel prices in cities | 16 | |-------------------------|-------------------------| 17 | | ![alt text 1](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/output/lifeexpectancy_world.png) | ![alt text 2](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/output/heu_prices.png) | 18 | 19 | 20 | ## This lecture 21 | 22 | This lecture introduces spatial data visualization using maps. During the lecture, students learn how to use the `maps` package which offers built-in maps with the [worldbank-lifeexpectancy](https://gabors-data-analysis.com/datasets/#worldbank-lifeexpectancy) data. Plotting the raw life expectancy at birth on a world map is already a powerful tool, but students will learn how to show deviance from the expected value given by the regression model. In the second part, students import raw `shp` files with auxiliary files, which contain the map of London boroughs and Vienna districts. With the [hotels-europe](https://gabors-data-analysis.com/datasets/#hotels-europe) dataset the average price for each unit on the map is shown. 23 | 24 | Case studies used during the lecture: 25 | - [Chapter 08, B: How is life expectancy related to the average income of a country?](https://gabors-data-analysis.com/casestudies/#ch08b-how-is-life-expectancy-related-to-the-average-income-of-a-country) 26 | - [Chapter 03, B: Comparing hotel prices in Europe: Vienna vs London](https://gabors-data-analysis.com/casestudies/#ch03b-comparing-hotel-prices-in-europe-vienna-vs-london) 27 | 28 | ## Learning outcomes 29 | After successfully completing [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R), students should be able: 30 | 31 | - Part I 32 | - Use of `maps` package to import world map 33 | - Understand how `geom_polygon` works 34 | - Shaping the outlook of the map with `coord_equal` or `coord_map` 35 | - Use of `theme_map()` 36 | - Use different coloring with `scale_fill_gradient` 37 | - How to match different data tables to be able to plot a map 38 | - Use custom values as a filler on the map based on life-expectancy case study 39 | - Part II 40 | - Use `rgdal` package with `readOGR` function to import 'shp' files and other needed auxiliary files as 'shx' and 'dbf' 41 | - Convert an `S4 object` to a tibble and format, such that it can be used for `ggplot2` 42 | - `geom_path` to color the edges of the map 43 | - Map manipulations to show only inner-London boroughs 44 | - Add (borough or district) names to a map with `aggregate` and `geom_text` 45 | - Control for limits of legend colors with `scale_fill_gradientn()` 46 | - Use nice color maps with `wesanderson` package 47 | - Task for Vienna: replicate the same as for London 48 | - `ggarrange` with a common legend and add common title with `annotate_figure()` 49 | 50 | ## Lecture Time 51 | 52 | Ideal overall time: **40-60 mins**. 53 | 54 | Going through [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R) takes around 20-40 minutes. Solving the tasks takes the remaining 20-40 minutes as there are two long tasks. 55 | 56 | 57 | ## Homework 58 | 59 | *Type*: quick practice, approx 10 mins 60 | 61 | Get countries' GDP growth rates with the `WDI` package. Plot the values in a world map. 62 | 63 | 64 | ## Further material 65 | 66 | - This lecture is based on [Kieran Healy: Data Visualization, Chapter 7](https://socviz.co/maps.html#maps). Check out for more content. 67 | - Great content can be found in (advanced) spatial data analysis [Edzer Pebesma, Roger Bivand: Spatial Data Science with applications in R](https://keen-swartz-3146c4.netlify.app/), specifically [a blog content](https://r-spatial.org/r/2018/10/25/ggplot2-sf.html) related to this book can be interesting. 68 | 69 | ## Folder structure 70 | 71 | - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class. 72 | - [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R), is the main material for this lecture. 73 | - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/complete_codes) includes code with solution for [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R) as [`visualize_spatial_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/complete_codes/visualize_spatial_fin.R) 74 | - [data_map](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture20-basic-spatial-vizz/data_map) includes raw map data 75 | - [London boroughs](https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london): `London_Borough_Excluding_MHW.dbf`, `London_Borough_Excluding_MHW.shp`, `London_Borough_Excluding_MHW.shx` 76 | - [Vienna boroughs](https://www.data.gv.at/katalog/dataset/stadt-wien_bezirksgrenzenwien): `BEZIRKSGRENZEOGDPolygon.dbf`, `BEZIRKSGRENZEOGDPolygon.shp`, `BEZIRKSGRENZEOGDPolygon.shx` 77 | -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.dbf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.dbf -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shp -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shx -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.dbf: -------------------------------------------------------------------------------- 1 | q 2 | !JNAMECGSS_CODEC HECTARESN NONLD_AREAN ONS_INNERCSUB_2009CSUB_2006C 3 | Kingston upon Thames E09000021 3726.117 0.000F Croydon E09000008 8649.441 0.000F Bromley E09000006 15013.487 0.000F Hounslow E09000018 5658.541 60.755F Ealing E09000009 5554.428 0.000F Havering E09000016 11445.735 210.763F Hillingdon E09000017 11570.063 0.000F Harrow E09000015 5046.330 0.000F Brent E09000005 4323.270 0.000F Barnet E09000003 8674.837 0.000F Lambeth E09000022 2724.940 43.927T Southwark E09000028 2991.340 105.139T Lewisham E09000023 3531.706 16.795T Greenwich E09000011 5044.190 310.785F Bexley E09000004 6428.649 370.619F Enfield E09000010 8220.025 0.000F Waltham Forest E09000031 3880.793 0.000F Redbridge E09000026 5644.225 2.300F Sutton E09000029 4384.698 0.000F Richmond upon Thames E09000027 5876.111 135.443F Merton E09000024 3762.466 0.000F Wandsworth E09000032 3522.022 95.600T Hammersmith and FulhamE09000013 1715.409 75.648T Kensington and ChelseaE09000020 1238.379 25.994T Westminster E09000033 2203.005 54.308T Camden E09000007 2178.932 0.000T Tower Hamlets E09000030 2157.501 179.707T Islington E09000019 1485.664 0.000T Hackney E09000012 1904.902 0.000T Haringey E09000014 2959.837 0.000T Newham E09000025 3857.806 237.637T Barking and Dagenham E09000002 3779.934 169.150F City of London E09000001 314.942 24.546T  -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shp -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shx -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/output/heu_prices.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/output/heu_prices.png -------------------------------------------------------------------------------- /lecture20-basic-spatial-vizz/output/lifeexpectancy_world.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/output/lifeexpectancy_world.png -------------------------------------------------------------------------------- /lecture21-cross-validation/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 21: Cross-validating linear models 2 | 3 | ## Motivation 4 | 5 | You have a car that you want to sell in the near future. You want to know what price you can expect if you were to sell it. You may also want to know what you could expect if you were to wait one more year and sell your car then. You have data on used cars with their age and other features, and you can predict price with several kinds of regression models with different righthand-side variables in different functional forms. How should you select the regression model that would give the best prediction? 6 | 7 | We introduce point prediction versus interval prediction; we discuss the components of prediction error and how to find the best prediction model that will likely produce the best fit (smallest prediction error) in the live data, using observations in the original data. We introduce loss functions in general and mean squared error (MSE) and its square root (RMSE) in particular, to evaluate predictions. We discuss three ways of finding the best predictor model, using all data and the Bayesian Information Criterion (BIC) as the measure of fit, using training–test splitting of the data, and using k-fold cross-validation, which is an improvement on the training–test split. 8 | 9 | ## This lecture 10 | 11 | This lecture refreshes methods for data cleaning and refactoring data as well as some basic feature engineering practices. After data is set, multiple competing regressions are run and compared via BIC and k-fold cross validation. Cross validation is carried out by the `caret` package as well. After the best-performing model is chosen (by RMSE), prediction performance and risks associated are discussed. In the case, when log-transformed outcome is used as the model, transformation back to level and evaluation of the prediction performance is also covered. 12 | 13 | Case studies used: 14 | - [Chapter 13, A: Predicting used car value with linear regressions](https://gabors-data-analysis.com/casestudies/#ch13a-predicting-used-car-value-with-linear-regressions) 15 | - [Chapter 14, A: Predicting used car value: log prices](https://gabors-data-analysis.com/casestudies/#ch14a-predicting-used-car-value-log-prices) 16 | 17 | ## Learning outcomes 18 | After successfully completing [`crossvalidation_usedcars.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture21-cross-validation/crossvalidation_usedcars.R), students should be able: 19 | 20 | - Clean and prepare data for modeling 21 | - Decide for functional forms and do meaningful variable transformations 22 | - Run multiple regressions and compare performance based on BIC 23 | - Carry out k-fold cross validation with `caret` package for different regression models 24 | - Compare the prediction performance of the models 25 | - Understand what happens if a log-transformed outcome is used 26 | - convert prediction back to level 27 | - compare prediction performance of other (non-log) models 28 | 29 | ## Dataset used 30 | 31 | - [used-cars](https://gabors-data-analysis.com/datasets/#used-cars) 32 | 33 | ## Lecture Time 34 | 35 | Ideal overall time: **100 mins**. 36 | 37 | 38 | ## Further material 39 | 40 | - This lecture is a modified and combined version of [`ch13-used-cars.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch13-used-cars-reg/ch13-used-cars.R) and [`ch14-used-cars-log.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch14-used-cars-log/ch14-used-cars-log.R) codes from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 41 | 42 | -------------------------------------------------------------------------------- /lecture22-lasso/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 22: Prediction with LASSO 2 | 3 | ## Motivation 4 | 5 | You want to predict the rental prices of apartments in a big city using their location, size, amenities, and other features. You have access to data on many apartments with many variables. You know how to select the best regression model for prediction from several candidate models. But how should you specify those candidate models, to begin with? In particular, which of the many variables should they include, in what functional forms, and in what interactions? More generally, how can you make sure that the candidates include truly good predictive models? 6 | 7 | How should we specify the regression models? In particular, when we have many candidate predictor variables, how should we select from them, and how should we decide on their functional forms? 8 | 9 | ## This lecture 10 | 11 | This lecture discusses how to build regression models for prediction and how to evaluate the predictions they produce. We discuss how to select 12 | variables out of a large pool of candidate x variables, and how to decide on their functional forms. We introduce LASSO via `glmnet`, an algorithm that can help with variable selection. With respect to evaluating predictions, we discuss why we need a holdout sample for evaluation that is separate from all of the rest of the data we use for model building and selection. 13 | 14 | Case study: 15 | - [Chapter 14, B: Predicting AirBnB apartment prices: selecting a regression model](https://gabors-data-analysis.com/casestudies/#ch14b-predicting-airbnb-apartment-prices-selecting-a-regression-model) 16 | 17 | ## Learning outcomes 18 | After successfully completing [`lasso_aribnb.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture22-lasso/codes/lasso_aribnb.R), students should be able: 19 | 20 | - Data cleaning and refactoring to prepare for LASSO type modelling 21 | - Basic feature engineering for LASSO 22 | - Understand the three sample approach: 23 | - train and test sample to select model (cross validation for tuning parameters) 24 | - hold-out sample to evaluate model prediction performance 25 | - Model selection with 26 | - (linear) regression models 27 | - LASSO, RIDGE and Elastic Net via `glmnet` package 28 | - Model diagnostics 29 | - Performance measure(s) on hold-out set to evalate competing models 30 | - stability of the prediction 31 | - specific diagnostic figures for LASSO 32 | 33 | ## Dataset used 34 | 35 | - [airbnb](https://gabors-data-analysis.com/datasets/#airbnb) 36 | 37 | ## Lecture Time 38 | 39 | Ideal overall time: **100 mins**. 40 | 41 | 42 | ## Further material 43 | 44 | - This lecture is a modified version of [`Ch16-airbnb-random-forest.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch16-airbnb-random-forest/Ch16-airbnb-random-forest.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 45 | 46 | -------------------------------------------------------------------------------- /lecture22-lasso/codes/ch14_aux_fncs.R: -------------------------------------------------------------------------------- 1 | price_diff_by_variables2 <- function(df, factor_var, dummy_var, factor_lab, dummy_lab){ 2 | # Looking for interactions. 3 | # It is a function it takes 3 arguments: 1) Your dataframe, 4 | # 2) the factor variable (like room_type) 5 | # 3)the dummy variable you are interested in (like TV) 6 | 7 | # Process your data frame and make a new dataframe which contains the stats 8 | factor_var <- as.name(factor_var) 9 | dummy_var <- as.name(dummy_var) 10 | 11 | stats <- df %>% 12 | group_by(!!factor_var, !!dummy_var) %>% 13 | dplyr::summarize(Mean = mean(price, na.rm=TRUE), 14 | se = sd(price)/sqrt(n())) 15 | 16 | stats[,2] <- lapply(stats[,2], factor) 17 | 18 | ggplot(stats, aes_string(colnames(stats)[1], colnames(stats)[3], fill = colnames(stats)[2]))+ 19 | geom_bar(stat='identity', position = position_dodge(width=0.9), alpha=0.8)+ 20 | geom_errorbar(aes(ymin=Mean-(1.96*se),ymax=Mean+(1.96*se)), 21 | position=position_dodge(width = 0.9), width = 0.25)+ 22 | scale_color_manual(name=dummy_lab, 23 | values=c('red','blue')) + 24 | scale_fill_manual(name=dummy_lab, 25 | values=c('red','blue')) + 26 | ylab('Mean Price')+ 27 | xlab(factor_lab) + 28 | theme_bw()+ 29 | theme(panel.grid.major=element_blank(), 30 | panel.grid.minor=element_blank(), 31 | panel.border=element_blank(), 32 | axis.line=element_line(), 33 | legend.position = "top", 34 | #legend.position = c(0.7, 0.9), 35 | legend.box = "vertical", 36 | legend.text = element_text(size = 5), 37 | legend.title = element_text(size = 5, face = "bold"), 38 | legend.key.size = unit(x = 0.4, units = "cm") 39 | ) 40 | } -------------------------------------------------------------------------------- /lecture23-regression-tree/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 23: Prediction with regression trees (CART) 2 | 3 | ## Motivation 4 | 5 | You want to predict the price of used cars as a function of their age and other features. You want to specify a model that includes the most important interactions and nonlinearities of those features, but you don’t know how to start. In particular, you are worried that you can’t start with a very complex regression model and use LASSO or some other method to simplify it because there are way too many potential interactions. Is there an alternative approach to regression that includes the most important interactions without you having to specify them? 6 | 7 | To carry out the prediction of used car prices, we show how to use the regression tree, an alternative to linear regressions that are designed to build a model with the most important interactions and nonlinearities for a prediction. However, the regression tree you build appears to overfit your original data. How can you build a regression tree model that is less prone to overfitting the original data and can thus give a better prediction in the live data? 8 | 9 | 10 | ## This lecture 11 | 12 | This lecture introduces the regression tree via `rpart`, an alternative to linear regression for prediction purposes that can find the most important predictor variables and their interactions and can approximate any functional form automatically. Regression trees split the data into small bins (subsamples) by the value of the x variables. For a quantitative y, they use the average y value in those small sets to predict y. We introduce the regression tree model and the most widely used algorithm to build a regression tree model. Somewhat confusingly, both the model and the algorithm are called CART (for classification and regression trees), but we reserve this name for the algorithm. We show that a regression tree is an intuitively appealing method to model nonlinearities and interactions among the x variables, but it is rarely used for prediction in itself because it is prone to overfit the original data. Instead, the regression tree forms the basic element of very powerful prediction methods that we’ll cover in the next seminar. 13 | 14 | Case study: 15 | - [Chapter 15, A: Predicting used car value with regression trees](https://gabors-data-analysis.com/casestudies/#ch15a-predicting-used-car-value-with-regression-trees) 16 | 17 | ## Learning outcomes 18 | After successfully completing [`cart_usedcars.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture23-regression-tree/cart_usedcars.R), students should be able: 19 | 20 | - Understand the how regression tree works 21 | - Estimate a regression tree via `rpart` package through `caret` 22 | - Visualize regression tree(s) multiple ways (with `rpart.plot` and with `ggplot2`) 23 | - Set stopping criteria for CART 24 | - Depth or level of the tree 25 | - Number of leaves 26 | - minimum fit measure increase by a split 27 | - Pruning a large tree 28 | - find optimal complexity parameter (also known as pruning parameter) 29 | - Variable importance plot 30 | - Prediction evaluation 31 | - comparing trees 32 | - comparing tree vs linear regressions 33 | 34 | ## Dataset used 35 | 36 | - [used-cars](https://gabors-data-analysis.com/datasets/#used-cars) 37 | 38 | ## Lecture Time 39 | 40 | Ideal overall time: **100 mins**. 41 | 42 | 43 | ## Further material 44 | 45 | - This lecture is a modified version of [`ch15-used-cars-cart.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch15-used-cars-cart/ch15-used-cars-cart.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 46 | 47 | -------------------------------------------------------------------------------- /lecture24-random-forest/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 24: Predicting with Random Forest and Boosting 2 | 3 | ## Motivation 4 | 5 | You need to predict rental prices of apartments using various features. You don’t know that the various features may interact with each other in determining price, so you would like to use a regression tree. But you want to build a model that gives the best possible prediction, better than a single tree. What methods are available that keep the advantage of regression trees but give a better prediction? How should you choose from among those methods? 6 | 7 | How can you grow a random forest, the most widely used tree-based method, to carry out the prediction of apartment rental prices? What details do you have to decide on, how should you decide on them, and how can you evaluate the results? 8 | 9 | A regression tree can capture complicated interactions and nonlinearities for predicting a quantitative y variable, but it is prone to overfit the original data, even after appropriate pruning. It turns out, however, that combining multiple regression trees grown on the same data can yield a much better prediction. Such methods are called ensemble methods. There are many ensemble methods based on regression trees, and some are known to produce very good predictions. But these methods are rather complex, and some of them are not straightforward to use. 10 | 11 | ## This lecture 12 | 13 | This lecture introduces two ensemble methods based on regression trees: random forest and boosting. We start by introducing the main idea of ensemble methods: combining results from many imperfect models can lead to a much better prediction than a single model that we try to build to perfection. Of the two methods, we discuss the random forest (RF) via `ranger` package in more detail. The random forest is perhaps the most frequently used method to predict a quantitative y variable, both because of its excellent predictive performance and because it is relatively simple to use. Even more than with a single tree, it is hard to understand the underlying patterns of association between y and x that drive the predictions of ensemble methods. We discuss some diagnostic tools that can help with that: variable importance plots, partial dependence plots, and examining the quality of predictions in subgroups. Finally, we show another method: boosting, an alternative approach to making predictions based on an ensemble of regression trees via `gbm`. 14 | 15 | Note that some of the used methods take a considerable amount of time to run on a simple PC, thus pre-run model results are also uploaded to the repository, to speed up the seminar. 16 | 17 | Case study: 18 | - [Chapter 16, A: Predicting apartment prices with random forest](https://gabors-data-analysis.com/casestudies/#ch16a-predicting-apartment-prices-with-random-forest) 19 | 20 | ## Learning outcomes 21 | 22 | Lecturer/students should be aware that there is a separate file: [`airbnb_prepare.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture24-random-forest/codes/airbnb_prepare.R) for this seminar, overviewing only the data cleaning and feature engineering process. This is extremely important and powerful to understand how to prepare the data for these methods, as without it data analysts do garbage-in garbage-out analysis... Usually, due to time constraints, this part is not covered in the seminar but asked students to cover it before the seminar. 23 | 24 | After successfully completing [`randomforest_airbnb.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture24-random-forest/codes/randomforest_airbnb.R), students should be able: 25 | 26 | - Estimate random forest via `ranger` 27 | - unsderstand `mytr` parameter and other setup 28 | - autotune option 29 | - Understanding random forest's output 30 | - variable importance plots: all, top 10 and grouped variables (typically factors) 31 | - partial dependence plot 32 | - sub-sample analysis for understanding prediction performance across groups 33 | - Run a 'Horse-Race' prediction competition with: 34 | - Linear regression (OLS) 35 | - LASSO 36 | - Regression Tree with CART 37 | - Random Forest 38 | - GBM model 39 | 40 | ## Dataset used 41 | 42 | - [airbnb](https://gabors-data-analysis.com/datasets/#airbnb) 43 | 44 | ## Lecture Time 45 | 46 | Ideal overall time: **100 mins**. 47 | 48 | 49 | ## Further material 50 | 51 | - This lecture is a modified version of [Ch16-airbnb-random-forest.R](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch16-airbnb-random-forest/Ch16-airbnb-random-forest.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 52 | 53 | -------------------------------------------------------------------------------- /lecture24-random-forest/data/gbm_model.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture24-random-forest/data/gbm_model.RData -------------------------------------------------------------------------------- /lecture24-random-forest/data/rf_model_1.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture24-random-forest/data/rf_model_1.RData -------------------------------------------------------------------------------- /lecture24-random-forest/data/rf_model_2.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture24-random-forest/data/rf_model_2.RData -------------------------------------------------------------------------------- /lecture25-classification-wML/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 25: Prediction and classification of binary outcome with ML tools 2 | 3 | ## Motivation 4 | 5 | Predicting whether people will repay their loans or default on them is important to a bank that sells such loans. Should the bank predict the default probability for applicants? Or, rather, should it classify applicants into prospective defaulters and prospective repayers? And how are the two kinds of predictions related? In particular, can the bank use probability predictions to classify applicants into defaulters and repayers, in a way that takes into account the bank’s costs when a default happens and its costs when it forgoes a good applicant? 6 | 7 | Many companies have relationships with other companies, as suppliers or clients. Whether those other companies stay in business in the future is an important question for them. You have rich data on many companies across the years that allows you to see which companies stayed in business and which companies exited, and relate that to various features of the companies. How should you use that data to predict the probability of exit for each company? How should you predict which companies will exit and which will stay in business in the future? 8 | 9 | In the previous seminars we covered the logic of predictive analytics and its most important steps, and we introduced specific methods to predict a quantitative y variable. But sometimes our y variable is not quantitative. The most important case is when y is binary: y = 1 or y = 0. How can we predict such a variable? 10 | 11 | ## This lecture 12 | 13 | This lecture introduces the framework and methods of probability prediction and classification analysis for binary y variables. Probability prediction means predicting the probability that y = 1, with the help of the predictor variables. Classification means predicting the binary y variable itself, with the help of the predictor variables: putting each observation in one of the y categories, also called classes. We build on what we know about probability models and the basics of probability prediction from [lecture16-binary-models](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture16-binary-models). In this seminar, we put that into the framework of predictive analytics to arrive at the best probability model for prediction purposes and to evaluate its performance. We then discuss how we can turn probability predictions into classification with the help of a classification threshold and how we should use a loss function to find the optimal threshold. We discuss how to evaluate a classification by making use of a confusion table and expected loss. We introduce the ROC curve, which illustrates the trade-off of selecting different classification threshold values. We discuss how we can use random forests based on classification trees. 14 | 15 | Case study: 16 | - [Chapter 17, A: Predicting firm exit: probability and classification](https://gabors-data-analysis.com/casestudies/#ch17a-predicting-firm-exit-probability-and-classification) 17 | 18 | ## Learning outcomes 19 | 20 | Lecturer/students should be aware that there is a separate file at the official case studies repository: [`ch17-firm-exit-data-prep.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch17-predicting-firm-exit/ch17-firm-exit-data-prep.R) for this seminar, overviewing only the data cleaning and feature engineering process for binary outcomes. This is extremely important and powerful to understand how to prepare the data for these methods, as without it data analysts do garbage-in garbage-out analysis... Usually, due to time constraints, this part is not covered in the seminar but asked students to cover it before the seminar. 21 | 22 | After successfully completing [`classification_wML.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture25-classification-wML/codes/classification_wML.R), students should be able: 23 | 24 | - What is winsorizing and how it helps 25 | - Basic linear models for predicting probabilities 26 | - simple linear probability model (review) 27 | - simple logistic model (logit, review) 28 | - Cross-validation with logit model (via `caret`) 29 | - LASSO with logit model (via `glmnet` and `caret`) 30 | - Evaluation of model prediction 31 | - Calibration curve (review) 32 | - Confusion matrix 33 | - ROC curve and AUC (Area Under Curve) 34 | - Model comparison based on RMSE and AUC 35 | - User-defined loss funtion 36 | - find the optimal trheshold based on self-defined loss function 37 | - Show ROC curve and optimal point 38 | - Show loss-function values for different points on ROC 39 | - CART and Random Forest 40 | - modelling porbabilities 41 | - Random Forest with majority voting as a misunderstand method, especially with user-defined loss function 42 | 43 | ## Dataset used 44 | 45 | -[bisnode-firms](https://gabors-data-analysis.com/datasets/#bisnode-firms) 46 | 47 | ## Lecture Time 48 | 49 | Ideal overall time: **100 mins**. 50 | 51 | 52 | ## Further material 53 | 54 | - This lecture is a modified version of [`ch17-predicting-firm-exit.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch17-predicting-firm-exit/ch17-predicting-firm-exit.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 55 | 56 | -------------------------------------------------------------------------------- /lecture25-classification-wML/codes/auxfuncs_binarywML.R: -------------------------------------------------------------------------------- 1 | ############ 2 | # Helper functions for Bisnode analysis 3 | 4 | twoClassSummaryExtended <- function(data, lev = NULL, model = NULL) 5 | { 6 | lvls <- levels(data$obs) 7 | rmse <- sqrt(mean((data[, lvls[1]] - ifelse(data$obs == lev[2], 0, 1))^2)) 8 | c(defaultSummary(data, lev, model), "RMSE" = rmse) 9 | } 10 | 11 | 12 | createRocPlot <- function(r, file_name, myheight_small = 5.625, mywidth_small = 7.5) { 13 | all_coords <- coords(r, x="all", ret="all", transpose = FALSE) 14 | 15 | roc_plot <- ggplot(data = all_coords, aes(x = fpr, y = tpr)) + 16 | geom_line(color='red', size = 0.7) + 17 | geom_area(aes(fill = 'green', alpha=0.4), alpha = 0.3, position = 'identity', color = 'red') + 18 | scale_fill_viridis(discrete = TRUE, begin=0.6, alpha=0.5, guide = "none") + 19 | xlab("False Positive Rate (1-Specifity)") + 20 | ylab("True Positive Rate (Sensitivity)") + 21 | geom_abline(intercept = 0, slope = 1, linetype = "dotted", col = "black") + 22 | scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, .1), expand = c(0, 0.01)) + 23 | scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1), expand = c(0.01, 0)) + 24 | theme_bw() 25 | 26 | roc_plot 27 | } 28 | 29 | 30 | create_calibration_plot <- function(data, prob_var, actual_var, y_lab = "Actual event probability", n_bins = 10, breaks = NULL) { 31 | 32 | if (is.null(breaks)) { 33 | breaks <- seq(0,1,length.out = n_bins + 1) 34 | } 35 | 36 | binned_data <- data %>% 37 | mutate( 38 | prob_bin = cut(!!as.name(prob_var), 39 | breaks = breaks, 40 | include.lowest = TRUE) 41 | ) %>% 42 | group_by(prob_bin, .drop=FALSE) %>% 43 | summarise(mean_prob = mean(!!as.name(prob_var)), mean_actual = mean(!!as.name(actual_var)), n = n()) 44 | 45 | p <- ggplot(data = binned_data) + 46 | geom_line(aes(mean_prob, mean_actual), color='red', size=0.6, show.legend = TRUE) + 47 | geom_point(aes(mean_prob,mean_actual), color = 'red', size = 1, shape = 16, alpha = 0.7, show.legend=F, na.rm = TRUE) + 48 | geom_segment(x=min(breaks), xend=max(breaks), y=min(breaks), yend=max(breaks), color='blue', size=0.3) + 49 | theme_bw() + 50 | labs(x= "Predicted event probability", 51 | y= y_lab) + 52 | coord_cartesian(xlim=c(0,1), ylim=c(0,1))+ 53 | expand_limits(x = 0.01, y = 0.01) + 54 | scale_y_continuous(expand=c(0.01,0.01),breaks=c(seq(0,1,0.1))) + 55 | scale_x_continuous(expand=c(0.01,0.01),breaks=c(seq(0,1,0.1))) 56 | 57 | p 58 | } 59 | 60 | 61 | createLossPlot <- function(r, best_coords, myheight_small = 5.625, mywidth_small = 7.5) { 62 | t <- best_coords$threshold[1] 63 | sp <- best_coords$specificity[1] 64 | se <- best_coords$sensitivity[1] 65 | n <- rowSums(best_coords[c("tn", "tp", "fn", "fp")])[1] 66 | 67 | all_coords <- coords(r, x="all", ret="all", transpose = FALSE) 68 | all_coords <- all_coords %>% 69 | mutate(loss = (fp*FP + fn*FN)/n) 70 | l <- all_coords[all_coords$threshold == t, "loss"] 71 | 72 | loss_plot <- ggplot(data = all_coords, aes(x = threshold, y = loss)) + 73 | geom_line(color='red', size=0.7) + 74 | scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) + 75 | geom_vline(xintercept = t, color = 'blue' ) + 76 | annotate(geom = "text", x = t, y= min(all_coords$loss), 77 | label=paste0("best threshold: ", round(t,2)), 78 | colour='blue', angle=90, vjust = -1, hjust = -0.5, size = 7) + 79 | annotate(geom = "text", x = t, y= l, 80 | label= round(l, 2), hjust = -0.3, size = 7) + 81 | theme_bw() 82 | 83 | loss_plot 84 | } 85 | 86 | 87 | createRocPlotWithOptimal <- function(r, best_coords, file_name, myheight_small = 5.625, mywidth_small = 7.5) { 88 | 89 | all_coords <- coords(r, x="all", ret="all", transpose = FALSE) 90 | t <- best_coords$threshold[1] 91 | sp <- best_coords$specificity[1] 92 | se <- best_coords$sensitivity[1] 93 | 94 | roc_plot <- ggplot(data = all_coords, aes(x = specificity, y = sensitivity)) + 95 | geom_line(color='red', size=0.7) + 96 | scale_y_continuous(breaks = seq(0, 1, by = 0.1)) + 97 | scale_x_reverse(breaks = seq(0, 1, by = 0.1)) + 98 | geom_point(aes(x = sp, y = se)) + 99 | annotate(geom = "text", x = sp, y = se, 100 | label = paste(round(sp, 2),round(se, 2),sep = ", "), 101 | hjust = 1, vjust = -1, size = 7) + 102 | xlab("False Positive Rate (1-Specifity)") + 103 | ylab("True Positive Rate (Sensitivity)") + 104 | theme_bw() 105 | 106 | 107 | roc_plot 108 | } 109 | -------------------------------------------------------------------------------- /lecture25-classification-wML/data/bisnode_firms_clean.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture25-classification-wML/data/bisnode_firms_clean.RData -------------------------------------------------------------------------------- /lecture26-long-term-time-series-wML/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 26: Forecasting from Time Series Data I - ML methods for simple models 2 | 3 | ## Motivation 4 | 5 | Your task is to predict the number of daily tickets sold for next year in a swimming pool in a large city. The swimming pool sells tickets through its sales terminal that records all transactions. You aggregate that data to daily frequency. How should you use the information on daily sales to produce your forecast? In particular, how should you model trends, and how should you model seasonality by months of the year and days of the week to produce the best prediction? 6 | 7 | 8 | ## This lecture 9 | 10 | This lecture discusses forecasting: prediction from time series data for one or more time periods in the future. The focus of this chapter is forecasting future values of one variable, by making use of past values of the same variable, and possibly other variables, too. We build on what we learned about time series regressions in [lecture18-timeseries-regression](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture18-timeseries-regression). We start with forecasts with a long horizon, which means many time periods into the future. Such forecasts use the information on trends, seasonality, and other long-term features of the time series. 11 | 12 | Case study: 13 | - [Chapter 18, A: Forecasting daily ticket sales for a swimming pool](https://gabors-data-analysis.com/casestudies/#ch18a-forecasting-daily-ticket-sales-for-a-swimming-pool) 14 | 15 | ## Learning outcomes 16 | After successfully completing [`long_term_swimming.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture26-long-term-time-series-wML/long_term_swimming.R), students should be able: 17 | 18 | - Data munging with time series (review) 19 | - Adding deterministic variables such as trends, yearly/mounthly/weekly seasonality 20 | - Adding deterministic variables with `timeDate` package such as holidays, weekdays, etc. 21 | - Sample splitting with time series 22 | - Simple linear models: 23 | - deterministic trend/seasonality and/or other deterministic variables (holidays, etc.) 24 | - Cross-validation with time series 25 | - `prophet` package 26 | - Forecasting 27 | - Comparing model based on forecasting performance (RMSE) 28 | - Graphical representation of model fit and forecasts 29 | 30 | ## Dataset used 31 | 32 | - [swim-transactions](https://gabors-data-analysis.com/datasets/#swim-transactions) 33 | 34 | ## Lecture Time 35 | 36 | Ideal overall time: **50-60 mins**. 37 | 38 | 39 | ## Further material 40 | 41 | - This lecture is a modified version of [ch18-swimmingpool-predict.R](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch18-swimmingpool/ch18-swimmingpool-predict.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 42 | 43 | -------------------------------------------------------------------------------- /lecture27-short-term-time-series-ARIMA-VAR/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 27: Forecasting from Time Series Data II - ARIMA and VAR models 2 | 3 | ## Motivation 4 | 5 | Your task is to predict how house prices will move in a particular city in the next months. You have monthly data on the house price index of the city, and you can collect monthly data on other variables that may be correlated with how house prices move. How should you use that data to forecast changes in house prices for the next few months? In particular, how should you use those other variables to help that forecast even though you don’t know their future values? 6 | 7 | ## This lecture 8 | 9 | This lecture discusses forecasting: prediction from time series data for one or more time periods in the future. The focus of this chapter is forecasting future values of one variable, by making use of past values of the same variable, and possibly other variables, too. We build on what we learned about time series regressions in [lecture18-timeseries-regression](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture18-timeseries-regression). Now, we then turn to short horizon forecasts that forecast y for a few time periods ahead. These forecasts make use of serial correlation of the time series of y besides those long-term features. We introduce autoregression (AR) and ARIMA models via `fpp3` package, which captures the patterns of serial correlation and can use for short horizon forecasting. We then turn to use other variables in forecasting and introduce vector autoregression (VAR) models that help in forecasting future values of those x variables that we can use to forecast y. We discuss how to carry out cross-validation in forecasting and the specific challenges and opportunities the time series nature of our data provides for assessing external validity. 10 | 11 | Case study: 12 | - [Chapter 18, B: Forecasting a house price index](https://gabors-data-analysis.com/casestudies/#ch18b-forecasting-a-house-price-index) 13 | 14 | ## Learning outcomes 15 | After successfully completing [`short_term_priceindex.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture27-short-term-time-series-ARIMA-VAR/short_term_priceindex.R), students should be able: 16 | 17 | - Decide if a conversion of data to stationarity is needed 18 | - ARIMA models with `fpp3` package 19 | - self specified lags for AR, I, and MA components 20 | - auto select the lags 21 | - handling trend and seasonality within ARIMA 22 | - understand 'S' from SARIMA and why we do not use it in this course 23 | - Cross-validation with ARIMA models 24 | - Vector AutoRegressive models (VAR) 25 | - estimation and cross-validation 26 | - Forecasting 27 | - comparing models based on forecast performance 28 | - external validity check on a longer horizon 29 | - Fan charts for assessing risks 30 | 31 | ## Lecture Time 32 | 33 | Ideal overall time: **50-80 mins**. 34 | 35 | 36 | ## Further material 37 | 38 | - This lecture is a modified version of [`ch18-ts-pred-homeprices.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch18-case-shiller-la/ch18-ts-pred-homeprices.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies). 39 | 40 | --------------------------------------------------------------------------------