├── .gitignore
├── LICENSE
├── README.md
├── common_issues
    ├── README.md
    ├── help_github_n_Rstudio.md
    └── help_rmarkdown.md
├── lecture00-intro
    ├── README.md
    ├── getting_started.Rmd
    ├── getting_started.md
    ├── getting_started_files
    │   └── figure-gfm
    │   │   └── unnamed-chunk-2-1.png
    ├── intro_to_R.R
    └── test_Rmarkdown.Rmd
├── lecture01-coding-basics
    ├── README.md
    ├── assignment_1.R
    ├── coding_basics.Rmd
    ├── coding_basics.md
    └── first_script.R
├── lecture02-data-imp-n-exp
    ├── README.md
    ├── complete_codes
    │   └── dataset_handling_fin.R
    ├── data
    │   └── hotels_vienna
    │   │   ├── clean
    │   │       ├── README.md
    │   │       ├── VARIABLES.xlsx
    │   │       └── hotels-vienna.csv
    │   │   ├── export
    │   │       └── export_here
    │   │   └── raw
    │   │       ├── hotelbookingdata.csv
    │   │       └── show_folder
    └── raw_codes
    │   └── dataset_handling.R
├── lecture03-tibbles
    ├── README.md
    ├── complete_codes
    │   └── intro_to_tibbles_fin.R
    ├── data
    │   ├── games.csv
    │   └── points.csv
    └── raw_codes
    │   └── intro_to_tibbles.R
├── lecture04-data-munging
    ├── README.md
    ├── complete_codes
    │   └── data_munging_fin.R
    └── raw_codes
    │   └── data_munging.R
├── lecture05-data-exploration
    ├── README.md
    ├── complete_codes
    │   └── data_exploration_fin.R
    └── raw_codes
    │   └── data_exploration.R
├── lecture06-rmarkdown101
    ├── README.md
    ├── complete_codes
    │   ├── report_bpp_fin.Rmd
    │   ├── report_bpp_fin.html
    │   └── report_bpp_fin.pdf
    └── raw_codes
    │   └── report_bpp.Rmd
├── lecture07-ggplot-indepth
    ├── README.md
    ├── complete_codes
    │   └── ggplot_indepth_fin.R
    └── raw_codes
    │   ├── ggplot_indepth.R
    │   ├── homework_ggpplot_runfile.R
    │   ├── theme_RENAMEME.R
    │   └── theme_bluewhite.R
├── lecture08-conditionals
    ├── README.md
    ├── conditionals.R
    ├── conditionals.Rmd
    └── conditionals.md
├── lecture09-loops
    ├── README.md
    ├── loops.R
    ├── loops.Rmd
    └── loops.md
├── lecture10-random-numbers
    ├── README.md
    ├── random_numbers.R
    ├── random_numbers.Rmd
    ├── random_numbers.md
    └── random_numbers_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-3-1.png
    │       ├── unnamed-chunk-4-.gif
    │       ├── unnamed-chunk-4-1.gif
    │       ├── unnamed-chunk-6-1.png
    │       ├── unnamed-chunk-7-1.png
    │       └── unnamed-chunk-9-1.png
├── lecture11-functions
    ├── README.md
    ├── functions.R
    ├── functions.Rmd
    ├── functions.md
    └── functions_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-10-2.png
    │       ├── unnamed-chunk-10-3.png
    │       ├── unnamed-chunk-11-1.png
    │       ├── unnamed-chunk-11-2.png
    │       ├── unnamed-chunk-11-3.png
    │       ├── unnamed-chunk-9-1.png
    │       ├── unnamed-chunk-9-2.png
    │       └── unnamed-chunk-9-3.png
├── lecture12-intro-to-regression
    ├── README.md
    ├── complete_codes
    │   ├── hotels_intro_to_regression_fin.R
    │   └── hotels_vienna_regression_fin_w_logs.R
    └── raw_codes
    │   └── hotels_intro_to_regression.R
├── lecture13-feature-engineering
    ├── README.md
    ├── complete_codes
    │   ├── feature_engineering_part_II_fin.R
    │   └── feature_engineering_part_I_fin.R
    └── raw_codes
    │   ├── feature_engineering_part_I.R
    │   └── feature_engineering_part_II.R
├── lecture14-simple-regression
    ├── README.md
    ├── complete_codes
    │   ├── life_exp_analysis_fin.R
    │   ├── life_exp_clean.R
    │   └── life_exp_getdata_fin.R
    ├── data
    │   ├── clean
    │   │   └── WDI_lifeexp_clean.csv
    │   └── raw
    │   │   └── WDI_lifeexp_raw.csv
    └── raw_codes
    │   ├── life_exp_analysis.R
    │   └── life_exp_getdata.R
├── lecture15-advanced-linear-regression
    ├── README.md
    ├── complete_codes
    │   └── hotels_advanced_regression_fin.R
    └── raw_codes
    │   └── hotels_advanced_regression.R
├── lecture16-binary-models
    ├── README.md
    ├── complete_codes
    │   └── binary_models_fin.R
    └── raw_codes
    │   └── binary_models.R
├── lecture17-dates-n-times
    ├── README.md
    ├── complete_codes
    │   └── date_time_manipulations_fin.R
    └── raw_codes
    │   └── date_time_manipulations.R
├── lecture18-timeseries-regression
    ├── README.md
    ├── complete_codes
    │   └── intro_time_series_fin.R
    └── raw_codes
    │   ├── ggplotacorr.R
    │   └── intro_time_series.R
├── lecture19-advaced-rmarkdown
    ├── README.md
    ├── complete_codes
    │   ├── advanced_rmarkdown_fin.Rmd
    │   ├── advanced_rmarkdown_fin.log
    │   ├── advanced_rmarkdown_fin.pdf
    │   └── advanced_rmarkdown_fin_files
    │   │   └── figure-latex
    │   │       ├── create figure wi label-1.pdf
    │   │       ├── plot pred graph-1.pdf
    │   │       ├── setup-1.pdf
    │   │       └── show two graphs-1.pdf
    ├── extra
    │   ├── maschools_prep.R
    │   ├── maschools_report.Rmd
    │   └── maschools_report.pdf
    ├── hotels_analysis.pdf
    └── raw_codes
    │   ├── advanced_rmarkdown.Rmd
    │   └── advanced_rmarkdown_prep.R
├── lecture20-basic-spatial-vizz
    ├── README.md
    ├── complete_codes
    │   └── visualize_spatial_fin.R
    ├── data_map
    │   ├── BEZIRKSGRENZEOGDPolygon.dbf
    │   ├── BEZIRKSGRENZEOGDPolygon.shp
    │   ├── BEZIRKSGRENZEOGDPolygon.shx
    │   ├── London_Borough_Excluding_MHW.dbf
    │   ├── London_Borough_Excluding_MHW.shp
    │   └── London_Borough_Excluding_MHW.shx
    ├── output
    │   ├── heu_prices.png
    │   └── lifeexpectancy_world.png
    └── raw_codes
    │   └── visualize_spatial.R
├── lecture21-cross-validation
    ├── README.md
    └── crossvalidation_usedcars.R
├── lecture22-lasso
    ├── README.md
    ├── codes
    │   ├── ch14_aux_fncs.R
    │   └── lasso_aribnb.R
    └── data
    │   └── airbnb_hackney_workfile_adj_book1.csv
├── lecture23-regression-tree
    ├── README.md
    └── cart_usedcars.R
├── lecture24-random-forest
    ├── README.md
    ├── codes
    │   ├── airbnb_prepare.R
    │   └── randomforest_airbnb.R
    └── data
    │   ├── airbnb_london_workfile_adj_book.csv
    │   ├── gbm_model.RData
    │   ├── rf_model_1.RData
    │   └── rf_model_2.RData
├── lecture25-classification-wML
    ├── README.md
    ├── codes
    │   ├── auxfuncs_binarywML.R
    │   └── classification_wML.R
    └── data
    │   └── bisnode_firms_clean.RData
├── lecture26-long-term-time-series-wML
    ├── README.md
    └── long_term_swimming.R
└── lecture27-short-term-time-series-ARIMA-VAR
    ├── README.md
    └── short_term_priceindex.R


/.gitignore:
--------------------------------------------------------------------------------
 1 | # History files
 2 | .Rhistory
 3 | .Rapp.history
 4 | 
 5 | # Session Data files
 6 | .RData
 7 | 
 8 | # User-specific files
 9 | .Ruserdata
10 | 
11 | # Example code in package build process
12 | *-Ex.R
13 | 
14 | # Output files from R CMD build
15 | /*.tar.gz
16 | 
17 | # Output files from R CMD check
18 | /*.Rcheck/
19 | 
20 | # RStudio files
21 | .Rproj.user/
22 | 
23 | # produced vignettes
24 | vignettes/*.html
25 | vignettes/*.pdf
26 | 
27 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
28 | .httr-oauth
29 | 
30 | # knitr and R markdown default cache directories
31 | *_cache/
32 | /cache/
33 | 
34 | # Temporary files created by R markdown
35 | *.utf8.md
36 | *.knit.md
37 | 
38 | # R Environment Variables
39 | .Renviron
40 | .DS_Store
41 | lecture02-data-imp_n_exp/data/hotels_vienna/raw/hotelbookingdata.csv
42 | lecture02-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.csv
43 | lecture02-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.xlsx
44 | lecture02-data-imp_n_exp/data/hotels_vienna/export/hotelbookingdata.xlsx
45 | lecture02-data-imp_n_exp/data/hotels_vienna/export/hotelbookingdata.RData
46 | lecture02-data-imp_n_exp/data/hotels_vienna/export/my_rfile.RData
47 | lecture00-intro/test_Rmarkdown.html
48 | lecture00-intro/test_Rmarkdown.pdf
49 | TO DOs
50 | lecture03-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.csv
51 | lecture03-data-imp_n_exp/data/hotels_vienna/export/my_csvfile.xlsx
52 | lecture03-data-imp_n_exp/data/hotels_vienna/export/my_rfile.RData
53 | lecture12_intro_to_regression/complete_codes/hotels_vienna_regression_fin_w_logs.R
54 | lecture19-advaced_rmarkdown/complete_codes/advanced_rmarkdown_fin.log
55 | lecture20-basic-spatial-vizz/visualize_spatial_old.R
56 | partIII-case-studies/seminar04-random-forest-airbnb/data/rf_model_2auto.RData
57 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Gabors Data Analysis
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/common_issues/README.md:
--------------------------------------------------------------------------------
 1 | # Common Issues that help navigate through this course
 2 | 
 3 | In this course, we focus on R and its IDE: RStudio. This means you won’t learn anything about Python, Julia, or any other programming language useful for data science. They’re also excellent choices, and in practice, most data science teams use a mix of languages, often at least R and Python.
 4 | 
 5 | Here we collect some of the common issues that we have experienced through the years of teaching coding with R. These issues are not specific to any of the lectures but relate to techniques (such as Git or GitHub) or troubleshooting (e.g. RMarkdown) which are rather specific to students. These are some general advice on how to start tackling these topics. 
 6 | 
 7 | This folder should be dynamic in the sense that it should adapt to new problems and show general guidance on how to solve them.
 8 | 
 9 | ## Current issues
10 |   
11 |   - Troubles with knitting an RMarkdown document: [help_rmarkdown.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_rmarkdown.md).
12 |   - Help for Git and GitHub: [help_github_n_Rstudio.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_github_n_Rstudio.md).
13 |   - In general, we have found useful [cheatsheets by RStudio](https://www.rstudio.com/resources/cheatsheets/) for several topics.
14 | 
15 | ## Archived - not relevant anymore
16 | 
17 |   No archived issue at the moment.
18 | 


--------------------------------------------------------------------------------
/common_issues/help_github_n_Rstudio.md:
--------------------------------------------------------------------------------
 1 | # Help to connect RStudio to your GitHub account
 2 | 
 3 | In this course, we not only learn how to use R and Rstudio but would like to enhance good habits for programming.
 4 | One main feature of this is to use actively **version control**. Using Git and Github is a great example of this. RStudio allows a built-in direct connection for GitHub. 
 5 | 
 6 | In the following, we give help on how to establish this direct connection.
 7 | 
 8 | ## 1) Create a **Personal Access Token** at *GitHub*
 9 |   To get access RStudio to your GitHub account you need to create a token.
10 |   
11 |   1. Sign in to GitHub
12 |   2. Click on your personal icon button and select [**Settings**](https://github.com/settings/profile) (Note: not the Settings button for one of your repos)
13 |   3. In the left options panel, select [**Developer Settings**](https://github.com/settings/apps)
14 |   4. Select [**Personal Access Tokens**](https://github.com/settings/tokens)
15 |   5. Select **Generate New Token**. This will ask for your GitHub password.
16 |   - You shall give a note of what is this token good for such as *RStudio access* or something similar
17 |   - Set the **Expiration** to **No Expiration**.
18 |   - You should check all **Scopes** boxes. You can safely check all, but at least you need to check: *repo, workflow, write:package, delete:package, notification,  write:discussion*
19 |   6. Click on **Generate Token** at the bottom of the page.
20 |   7. You get your **key**, that **NEED TO SAVE to a temporary file or note**.
21 |   - In case you have not saved the key, you can regenerate it by clicking on your already existing token, but then you will need to update all your apps using this key
22 | 
23 | Some of these steps are nicely summarized and shown in [Ginny Fahs's blog](https://ginnyfahs.medium.com/github-error-authentication-failed-from-command-line-3a545bfd0ca8).
24 | 
25 | ## 2) Create a new repo on your GitHub
26 | 
27 |  Good idea to create a new repo for this course. You can use this repo throughout the course.
28 | 
29 | ## 3) Creating a version-controlled project in RStudio
30 | 
31 |   1. Open RStudio.
32 |   2. Create a new project (File/New project)
33 |   3. Select **Version Control**
34 |   4. Select **Git**
35 |   5. Add your **repo's url** to clone it by RStudio and select your **path** on your computer, where your repo will be. Click **create project**.
36 |   6. A window will pop up and ask for your **GitHub account** then for the **token key**.
37 |   7. You have your first GitHub-controlled RStudio project!
38 |     
39 | ## 4) Working with a version-controlled project in RStudio
40 | 
41 |   1. Open and work on your created version-controlled project.
42 |   2. You can commit your work by:
43 |   - Going to Tools/Version Control/Commit and a window comes up:
44 |   - Upper left panel you can select files to commit.
45 |   - Upper right panel you must specify your commit message.
46 |   - Lower panel shows the changes in your files.
47 |   - You can Push/Pull with the arrows in the upper-right section of the window.
48 |   - You can follow your history, by changing to *History* from the *Changes* button on the top.
49 |   - You can select branches by switching from *master* on the top as well.
50 |   3. You can Push/Pull within the commit window or by clicking on Tools/Version Control/Push or Pull
51 |   4. Now your work has been updated.
52 | 
53 | ## 5) Additional comments
54 | 
55 | - Try to create a good habit with version control: not only save to your computer but push it to your repo.
56 | - It is a good practice to save, commit and push your project after each working session or even more frequently.
57 | - Pay attention to your folder structure.
58 | - Always add `Readme.md` files to your folders, you can add them by *File/New File/Text File* and save them as `Readme.md`.
59 | - You do not need to use RStudio for version control, you can use the *Shell/Terminal* or specific programs such as *GitHub Desktop*. RStudio makes the same as these but in a compact way.
60 | - If you are interested in more on this topic, I highly recommend checking out https://happygitwithr.com/.
61 | 


--------------------------------------------------------------------------------
/common_issues/help_rmarkdown.md:
--------------------------------------------------------------------------------
 1 | # Help to knit your first document in *RMarkdown*
 2 | 
 3 | In many cases knitting a document, especially in pdf, result in an error. Here, I provide some possible solutions for these problems.
 4 | You may iterate through these possible solutions and check whether you got a proper output after each fix.
 5 | 
 6 | 1. Your working directory contains invalid characters
 7 |     - In general, you should avoid paths, which contains non-machine readable characters such as non-English characters *á,ë,ö* or characters which has their own purposes in coding such as *.,;\*\\\[\]\(\)\#* or *space*.
 8 |     - You should use **'_'** or **'-'** to colocate your folders/file names instead, which is machine-readable.
 9 |     - **Solution:** rename your folders in your path which contains the *.Rmd* file and the *.Rmd* file itself.
10 | 
11 | 2. Try to re-install or update your RStudio and R
12 | 
13 |     2.1. You can update internally both R-Studio and R
14 |       - RStudio update: click on Help/Check for Updates
15 |       - R update:
16 |         * Windows users: use the `installr` package.
17 | 
18 |          ```r 
19 |             install.packages("installr")
20 |             library(installr)
21 |             updateR()
22 |          ```
23 | 
24 |         * Mac update: substitute your password at the last line.
25 |                
26 |         ```r,
27 |           install.packages('devtools') #assuming it is not already installed
28 |           library(devtools)
29 |           install_github('andreacirilloac/updateR')
30 |           library(updateR)
31 |           updateR(admin_password = 'Admin user password')
32 |         ```
33 | 
34 | 
35 |     2.2. You can download both of them from their website, see links in the [Readme.md](https://github.com/regulyagoston/BA21_Coding/blob/main/README.md)
36 |   
37 | 3. Error encountered during knitting a **pdf** file.
38 | 
39 |     3.1 You do not have a **tex/latex** engine installed.
40 |       - Install tinytex to RStudio:
41 |         ```r
42 |         install.packages('tinytex')
43 |         ```
44 |     
45 |     3.2. Your **tex/latex** engine is out-of-date.
46 |       - You have to update/re-install the latex engine.
47 |          
48 |           ```r
49 |           tinytex::reinstall_tinytex()
50 |           ```
51 |         
52 |       - Restart your RStudio and try to knit your document.
53 |         
54 |     3.3. Install another **tex/latex** engine if *tinytex* does not work...
55 |       - An alternative **tex/latex** engine is *MikTex*. 
56 |           * Follow the steps written in [Søren L Kristiansen's blog](https://medium.com/@sorenlind/create-pdf-reports-using-r-r-markdown-latex-and-knitr-on-windows-10-952b0c48bfa9)
57 |           * Alternatively you can watch this old [video](https://www.youtube.com/watch?v=k-xSGZ-RLBU&ab_channel=OutLieer) on how to install it in RStudio.
58 |       - Try to stick with *tinytex* as much as possible, these alternatives are not as stable. However, sometimes I have found this is the only solution...
59 | 


--------------------------------------------------------------------------------
/lecture00-intro/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 00: Introduction to R and RStudio
 2 | 
 3 | ## Motivation
 4 | 
 5 | In this course, we focus on R and its IDE: RStudio. This means you won’t learn anything about Python, Julia, or any other programming language useful for data science. They’re also excellent choices, and in practice, most data science teams use a mix of languages, often at least R and Python.
 6 | 
 7 | We also believe that R is a great place to start your data science carrier as it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammar for specific parts of the data science process. These mini languages help you think about problems as a data scientist while supporting fluent interaction between your brain and the computer. ([Hadley Wickham and Garrett Grolemund R for Data Science](https://r4ds.had.co.nz/introduction.html))
 8 | 
 9 | ## This lecture
10 | 
11 | This is the starting lecture, that introduces students to R and RStudio (download and install), runs a pre-written script, asks them to knit a pdf/Html document, and highlights the importance of version control.
12 | 
13 | The aim of this class is not to teach coding, but to make sure that everybody has R and RStudio on their laptop, installs `tidyverse` package, and (tries to) knit an RMarkdown document. The main aim of these steps is to reveal possible OS mismatches or other problems with R and RStudio. 
14 | The material and steps are detailed in [`getting_started.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.md).
15 | 
16 | 
17 | ## Learning outcomes
18 | After successfully teaching the material (see: [`getting_started.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.md)), students will have
19 | 
20 | - R and RStudio on their laptop/computers
21 | 
22 | and understand,
23 | 
24 | - How RStudio looks like, which window is which.
25 | - How to run a command via console.
26 | - What are libraries (packages), and how to install and load them.
27 | - Why version control is important and what are the main possibilities with Git and GitHub.
28 | 
29 | Furthermore, students will,
30 | 
31 | - knit a Rmarkdown in both *pdf* and *Html*, without any deeper knowledge on the issue.
32 | 
33 | These steps are found to be extremely important, as fixing installation and knitting problems may take days to weeks.
34 | 
35 | ## Datasets used
36 | * No dataset is used in this lecture
37 | 
38 | ## Lecture Time
39 | 
40 | Ideal overall time: **20-30 mins**.
41 | 
42 | It can substantially differ from this if the teacher decides to do a live coding session with students and fixes the emerging problems during the class (up to ~90 mins).
43 | 
44 | ## Homework
45 | 
46 | No homework, apart from fixing possible issues with R, RStudio, and compiling a '.Rmd' in Html and pdf.
47 | 
48 | ## Further material
49 | 
50 |   - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 1](https://r4ds.had.co.nz/introduction.html) on introduction, [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) on libraries, [Chapter 6](https://r4ds.had.co.nz/workflow-scripts.html) on windows and workflow. 
51 |   - Kieran H. (2019): Data Visualization [Chapter 2.2](https://socviz.co/gettingstarted.html#use-r-with-rstudio) introduces window structure in RStudio pretty well.
52 |   - Andrew Heiss: Data Visualization with R, [Lesson 1](https://datavizs21.classes.andrewheiss.com/lesson/01-lesson/) provides some great videos and an introduction to R and Rmarkdown.
53 |   - Git references: 
54 |     - [Technical foundations of informatics book](https://info201.github.io/git-basics.html)
55 |     - [Software carpentry course](https://swcarpentry.github.io/git-novice/)  (Strongly recommended)
56 |     - [Github Learning Lab](https://lab.github.com/)
57 |     - [If you are really committed](https://git-scm.com/book/en/v2) (pun intended)
58 | 
59 | 
60 | ## File structure
61 |   
62 |   - [`getting_started.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.md) provides material on the installation of R and RStudio, tidyverse and show some cool stuff with R.
63 |   - [`getting_started.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/getting_started.Rmd) is the generating Rmarkdown file for `getting_started.md`
64 |   - [`intro_to_R.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/intro_to_R.R) includes codes to introduce scripts, install `tidyverse`, and show how cool R is.
65 |   - [`test_Rmarkdown.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture00-intro/test_Rmarkdown.Rmd) is a test file to reveal possible issues with knitting a Rmarkdown document. During the course, students need to be able to compile their work into pdf and/or Html. This is a test, which is super important to do as quickly as possible, while some fixes take a while...
66 | 
67 | ## Help with RMarkdown and RStudio with git
68 | 
69 | In case you have trouble with the knitting of a Rmarkdown document or connecting your GitHub to RStudio, I have collected the major solutions, which may help in [**common_issues**](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues) folder.
70 | 
71 |   - For RMarkdown: [help_rmarkdown.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_rmarkdown.md) file.
72 |   - For GitHub: [help_github_n_Rstudio.md](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/common_issues/help_github_n_Rstudio.md) file.
73 | 


--------------------------------------------------------------------------------
/lecture00-intro/getting_started_files/figure-gfm/unnamed-chunk-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture00-intro/getting_started_files/figure-gfm/unnamed-chunk-2-1.png


--------------------------------------------------------------------------------
/lecture00-intro/intro_to_R.R:
--------------------------------------------------------------------------------
 1 | #########################
 2 | ##                     ##
 3 | ## WELCOME TO R-STUDIO ##
 4 | ##                     ##
 5 | ##  THIS IS A SCRIPT   ##
 6 | ##                     ##
 7 | ##    Lecture 00       ##
 8 | ##                     ##
 9 | #########################
10 | 
11 | # Cleaning the environment
12 | #   just to be sure we are in the same setup
13 | rm(list = ls())
14 | 
15 | # Install a package
16 | install.packages('tidyverse')
17 | # load a package for the work
18 | library(tidyverse)
19 | 
20 | 
21 | # There are built-in data:
22 | # mpg is a dataset for cars with different characteristics:
23 | mpg
24 | 
25 | 
26 | # It is easy to create a plot to compare:
27 | # engine size (displ) vs. fuel efficiency (hwy)
28 | ggplot(data = mpg) + 
29 |   geom_point(mapping = aes(x = displ, y = hwy)) +
30 |   labs(y = 'fuel efficiency', x = 'engine size')
31 | 
32 | 
33 | # You may say that there are specific groups
34 | #   which are not highlighted by this simple graph
35 | #   it is easy to plot some further patterns...
36 | ggplot(data = mpg) + 
37 |   geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
38 |   labs(y = 'fuel efficiency', x = 'engine size')
39 | 
40 | 
41 | # You may want to quantify these relations
42 | #   First, start with the overall pattern
43 | ggplot(data = mpg) + 
44 |   geom_point(mapping = aes(x = displ, y = hwy)) +
45 |   geom_smooth(mapping = aes(x = displ, y = hwy)) +
46 |   labs(y = 'fuel efficiency', x = 'engine size')
47 | 
48 | # But it is as easy to refine the graph for more complex patterns:
49 | ggplot(data = mpg) + 
50 |   geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
51 |   geom_smooth(mapping = aes(x = displ, y = hwy, color = class)) +
52 |   labs(y = 'fuel efficiency', x = 'engine size')
53 | 
54 | # But it may be overcrowded... 
55 | #   No worries, one can easily make multiple graphs as well!
56 | ggplot(data = mpg) + 
57 |   geom_point(mapping = aes(x = displ, y = hwy)) +
58 |   geom_smooth(mapping = aes(x = displ, y = hwy)) +
59 |   facet_wrap(~ class, nrow = 3)
60 |   labs(y = 'fuel efficiency', x = 'engine size')
61 |   
62 |   
63 | ##
64 | # With R, we can get maps as well:
65 | install.packages('maps')
66 | ggplot(map_data('world'), aes(long, lat, group = group)) +
67 |     geom_polygon(fill = 'white', colour = 'black') +
68 |     coord_quickmap()
69 | 
70 | ##
71 | # Or other pretty cool stuff that we are going to learn through the course!
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/lecture00-intro/test_Rmarkdown.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Tryout"
 3 | output: pdf_document
 4 | ---
 5 | 
 6 | ```{r setup, include=FALSE}
 7 | knitr::opts_chunk$set(echo = TRUE)
 8 | ```
 9 | 
10 | ## R Markdown
11 | 
12 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
13 | 
14 | When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
15 | 
16 | ```{r cars}
17 | summary(cars)
18 | ```
19 | 
20 | ## Including Plots
21 | 
22 | You can also embed plots, for example:
23 | 
24 | ```{r pressure, echo=FALSE}
25 | plot(pressure)
26 | ```
27 | 
28 | Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
29 | 


--------------------------------------------------------------------------------
/lecture01-coding-basics/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 01: Coding basics
 2 | 
 3 | ## Motivation
 4 | 
 5 | Coding has its charm you can do anything if you know the basics. We take the traditional programming approach to first introduce the building blocks of coding. This may seem cumbersome at first sight (e.g. in contrast to Hadley Wickham and Garrett Grolemund R for Data Science), but it leads to understanding the basic principles of coding. It is also a great help when searching for solutions on the web as among the many options most of them use blocks of these basic blocks.
 6 | 
 7 | ## This lecture
 8 | 
 9 | This is the first coding lecture, which introduces students to coding in R.
10 | It is a **live coding class**, where the teaching material is detailed in [`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md).
11 | 
12 | 
13 | ## Learning outcomes
14 | After successfully live-coding the material ([`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md)), students will have knowledge on
15 | 
16 | - What is RStudio.
17 | - How to run a command via console.
18 | - What is a script and how to create one.
19 |    - Difference between a script and a project. 
20 | - What are the basics of coding:
21 |   - how to name your variables, how to use comments, and format your code
22 |   - what is an R-object and what operations can one do with it (numeric, logical, characters, and factors)
23 |   - what is a built-in function and how does it work
24 |   - how to decide the type of R-objects via functions
25 |   - how to create a vector, what are the vector operations and how to get elements of it via indexing.
26 |   - what are the special variables/values/issues (empty variable, NA, Inf, precision)
27 |   - what are the different variable types
28 |   - create different vectors
29 |   - create lists and indexing with lists
30 | 
31 | ## Datasets used
32 | 
33 | - No dataset is used in this lecture
34 | 
35 | ## Lecture Time
36 | 
37 | Ideal overall time: **100 mins**.
38 | 
39 | This lecture time is one of the hardest to predict, as it solely depends on the background of the students. This is an introductory class showing how coding works in general. Always aim for students with the least knowledge. Note, however, that there are extra 'good-to-know' parts that can be skipped.
40 | 
41 | This lecture time assumes that R and RStudion already work on their laptops.
42 | 
43 | ## Homework
44 | 
45 | *Type*: quick practice, approx 15 mins, 7+1 lines of code. See [`assignment_1.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/assignment_1.R).
46 | 
47 | ## Further material
48 | 
49 |   - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 4](https://r4ds.had.co.nz/workflow-basics.html) provide some basic principles and using the console. [Chapter 6](https://r4ds.had.co.nz/workflow-scripts.html) deals with scripts and some error handling.[Chapter 8](https://r4ds.had.co.nz/workflow-projects.html) shows how to work with projects, along with useful setup options and working directory settings. [Chapter 20](https://r4ds.had.co.nz/vectors.html) provides a similar but more detailed discussion.
50 |   - Kieran H. (2019): Data Visualization [Chapter 2.2-2.3](https://socviz.co/gettingstarted.html#use-r-with-rstudio) introduces window structure in RStudio pretty well (Chapter 2.2) and basic syntax, objects, libraries (Chapter 2.3)
51 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 02](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/02_code_style.Rmd) provides useful guidelines on how to write and format codes. [Lecture 03](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/03_1d_data.Rmd) provides some additional exercises and insights into R-objects, variable types, and some basic functions.
52 |   - <https://style.tidyverse.org/> provides a great overview on good coding style with the `tidyverse` approach. 
53 | 
54 | 
55 | ## File structure
56 |   
57 |   - [`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md) provides material for the live coding session with explanations.
58 |   - [`coding_basics.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.Rmd) is the generating Rmarkdown file for [`coding_basics.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/coding_basics.md)
59 |   - [**`first_script.R`**](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/first_script.R) is a possible realization of the live coding session
60 |   - [`assignment_1.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture01-coding-basics/assignment_1.R) is the assignment after the first lecture.
61 | 


--------------------------------------------------------------------------------
/lecture01-coding-basics/assignment_1.R:
--------------------------------------------------------------------------------
 1 | #########################
 2 | ##                     ##
 3 | ##    Assignment 1     ##
 4 | ##                     ##
 5 | ##    Coding basics    ##
 6 | ##                     ##
 7 | ##  Deadline:          ##
 8 | ##                     ##
 9 | ##                     ##
10 | #########################
11 | 
12 | ##
13 | # Task: fill out the blank parts with the needed commands
14 | 
15 | # 1) Add the command which clears the environment
16 | 
17 | 
18 | # 2) Create a string variable, which states: 'This is my first assignment in R!'
19 | str_var <- 
20 | 
21 | # 3) Decide with the proper command whether this is truly a string variable or not:
22 | 
23 |   
24 | # 4) Create vector 'v', which contains the values of: 3, 363, 777, 2021, -987 and Inf
25 | v <- 
26 |   
27 | # 5) Multiply this vector with 10 and name it as v_10
28 |   
29 | 
30 | # 6) Create a list, which contains 'str_var' variable and 'v' vector
31 | mL <-
32 |   
33 | # 7) Get the value of 'Inf' out of this 'mL' variable with indexing
34 |   
35 | 
36 | # +1) decide whether the former extracted value is infinite or not. 
37 | #   The result should be a logical value.
38 |   
39 |   
40 |   
41 |   


--------------------------------------------------------------------------------
/lecture01-coding-basics/first_script.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | #                                #
  3 | #          Lecture 01            #
  4 | #                                #
  5 | #    Introduction to coding      #
  6 | #                                #
  7 | #     - R-objects                #
  8 | #     - Variables                #
  9 | #     - Built in functions       #
 10 | #     - Vectors                  #
 11 | #     - Indexing                 #
 12 | #     - Special values           #
 13 | #     - Lists                    #
 14 | #                                #
 15 | #                                #
 16 | ##################################
 17 | 
 18 | 
 19 | ##
 20 | # R-objects:
 21 | 
 22 | # Character
 23 | myString <- 'Hello world!'
 24 | 
 25 | # Convention to name your variables
 26 | my_fav_var <- 'bla'
 27 | myFavVar <- 'bla'
 28 | # Rarely use long names such as
 29 | my_favourite_variable <- 'bla'
 30 | 
 31 | 
 32 | # We can define numeric R-objects:
 33 | a <- 2
 34 | b <- 3
 35 | 
 36 | # And do mathematical operations:
 37 | a+b-(a*b)^a
 38 | 
 39 | c <- a + b
 40 | d <- a*c/b*c
 41 | 
 42 | # Or create logical R-object:
 43 | a == b
 44 | 2 == 3
 45 | (a + 1) == b
 46 | # negation:
 47 | a != b
 48 | 
 49 | # other logical operators for multiple statement
 50 | 2 == 2 & 3 == 2
 51 | 2 == 2 | 3 == 2
 52 | 
 53 | ##
 54 | # Functions:
 55 | 
 56 | # Remove variables from work space
 57 | rm(d)
 58 | # or calculate square root:
 59 | sqrt(4)
 60 | # if not sure what a function does:
 61 | ?sqrt
 62 | 
 63 | ##
 64 | # Type of R-objects:
 65 | typeof(myString)
 66 | typeof(a)
 67 | 
 68 | # Numeric values: integer and double
 69 | num_val  <- as.numeric(1.2)
 70 | doub_val <- as.double(1.2)
 71 | int_val  <- as.integer(1.2)
 72 | typeof(num_val)
 73 | 
 74 | # Decide what type a variable has:
 75 | is.character(myString)
 76 | is.logical(2==3)
 77 | is.double(doub_val)
 78 | is.integer(int_val)
 79 | is.numeric(doub_val)
 80 | is.numeric(int_val)
 81 | is.integer(doub_val)
 82 | is.double(int_val)
 83 | 
 84 | ##
 85 | # Create vectors
 86 | v <- c(2,5,10)
 87 | # Operations with vectors
 88 | z <- c(3,4,7)
 89 | 
 90 | v+z
 91 | v*z
 92 | a+v
 93 | 
 94 | # Number of elements
 95 | num_v <- length(v)
 96 | num_v
 97 | 
 98 | # Create vector from vectors
 99 | w <- c(v,z)
100 | w
101 | length(w)
102 | # R is case-sensitive: gives an error
103 | length(W)
104 | 
105 | # Note: be careful w operation
106 | q <- c(2,3)
107 | v+q
108 | v+c(2,3,2)
109 | 
110 | # Indexing with vectors: goes with []
111 | v[1]
112 | v[2:3]
113 | v[c(1,3)]
114 | 
115 | # Fix the addition of v+q
116 | v[1:2] + q 
117 | 
118 | 
119 | ## Special variables/values/issues:
120 | null_vector <- c()
121 | # NaN value
122 | nan_vec <- c(NaN,1,2,3,4)
123 | na_vec <- c(NA,1,2,3,4)
124 | nan_vec + 3
125 | # Inf values
126 | inf_val <- Inf
127 | 5/0
128 | # Rounding issues
129 | sqrt(2)^2  == 2
130 | # and fix it:
131 | round(sqrt(2)^2) == 2
132 | 
133 | 
134 | ####
135 | # Lists
136 | my_list  <- list('a',2,0==1)
137 | my_list2 <- list(c('a','b'),c(1,2,3),sqrt(2)^2==2)
138 | 
139 | # indexing with lists:
140 | # you get the list's value - still a list (typeof(my_list2[1]))
141 | my_list2[1]
142 | # you get the vector's value - it is a character (typeof(my_list2[[1]]))
143 | my_list2[[1]]
144 | # you get the second element from the vector
145 | my_list2[[1]][2]
146 | 


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 02: Import and Export data to R
 2 | 
 3 | ## Motivation
 4 | 
 5 | Data doesn’t grow on trees but needs to be collected with a lot of effort, and it’s essential to have high-quality data to get meaningful answers to our questions. In the end, data quality is determined by how the data was collected. Thus, it’s fundamental for data analysts to understand various data collection methods, how they affect data quality in general, and what the details of the actual collection of their data imply for its quality. Most important methods of data collection used in business, economics, and policy analysis, such as web scraping, using administrative sources, and conducting surveys all imply these sources need to be imported to R.
 6 | 
 7 | ## This lecture
 8 | 
 9 | This lecture introduces students to importing and exporting data to R with `readr` from `tidyverse`. Various importation technique and format is discussed and several options on how to export data to the local computer.
10 | 
11 | 
12 | ## Learning outcomes
13 | After successfully completing the code in *raw_codes* students should be able to:
14 | 
15 | [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R)
16 |   - Import data *csv* or other formats via 
17 |     - clicking through the built-in options 
18 |     - using a local path
19 |     - download directly via internet url
20 |     - use API, namingly: `tidyquant` and `WDI` packages.
21 |   - Export data on *csv*, *xlsx* or *RData* format to local computer 
22 | 
23 | ## Datasets used
24 | 
25 | * [Hotels Vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna)
26 | * [Football](https://gabors-data-analysis.com/datasets/#football) as homework.
27 | 
28 | 
29 | ## Lecture Time
30 | 
31 | Ideal overall time: **10-20 mins**.
32 | 
33 | Showing [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R) takes around *10 minutes* while doing the tasks would take the rest.
34 |  
35 | 
36 | ## Homework
37 | 
38 | *Type*: quick practice, approx 10 mins
39 | 
40 | Import from OSF the [football](https://osf.io/zqm6c/) data tables. To be more precise you should import table containing manager's characteristics data (`football_managers.csv`) and football performance with teams (`football_managers_workfile.csv`). Make sure of using a tidy folder structure: create a data folder with raw and clean folders. For this time only, export the same data tables into an export folder as `xlsx` and `.RData` files.
41 | 
42 | ## Further material
43 | 
44 |   - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 11](https://r4ds.had.co.nz/data-import.html) provides an overview of data import and export along with a detailed discussion of how these methods are done and how tidyverse approaches.
45 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 02](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/02_computational_reproducibility.Rmd) provides useful further guidelines on how to organize the folder structure and how to export and import data/figures/etc.
46 | 
47 | 
48 | ## Folder structure
49 |   
50 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes) includes one code, which is ready to use during the course but requires some live coding in class.
51 |     - [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R)
52 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/complete_codes) includes one code with solutions for
53 |     - [`dataset_handling_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/complete_codes/dataset_handling_fin.R) solution for: [`dataset_handling.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R)
54 |   - [data/hotels_vienna](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna) provides a folder structure for the class. It contains data that will be used during the lecture as well as folders for the outputs.
55 |     - [clean](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna/clean) - this is a great example of how to organize a project's cleaned data folder.
56 |     - [raw](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna/raw) - includes (a) raw files. Should save during the lecture the data on bookings of hotels as `hotelbookingdata.csv` into this folder.
57 |     - [export](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture02-data-imp-n-exp/data/hotels_vienna/export) - is a folder where you should export all the files during the course.
58 |     
59 | 
60 | 
61 | 


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/complete_codes/dataset_handling_fin.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | #                                #
  3 | #          Lecture 02            #
  4 | #                                #
  5 | #    Import and Export Data      #
  6 | #       to R with                #
  7 | #                                #
  8 | #     - Importing with clicking  #
  9 | #     - read_csv():              #
 10 | #       - local and url          #
 11 | #       - working directory      #
 12 | #     - Export                   #
 13 | #       - write_csv              #
 14 | #       - xlsx package           #
 15 | #       - save to RData          #
 16 | #     - API:                     #
 17 | #       - tidyquant and WDI      #
 18 | #                                #
 19 | #                                #
 20 | ##################################
 21 | 
 22 | rm(list = ls())
 23 | # Tidyverse includes readr package 
 24 | #  which we use for importing data!
 25 | library(tidyverse)
 26 | 
 27 | 
 28 | 
 29 | ####################
 30 | ## Importing data:
 31 | # 3 options to import data:
 32 | 
 33 | 
 34 | #####
 35 | #   1) Import by clicking: File -> Import Dataset -> 
 36 | #       -> From Text (readr) / this is for csv. You may use other to import other specific formats
 37 | #
 38 | # Notes: 
 39 | #   - Do this exercise to find your data and realize that importing this way will show up in the console.
 40 | #         if second option does not work, check the path on the console!
 41 | #   - Check the library, that the import command used: it is called 'readr' which is part of 'tidyverse'!
 42 | #         you should avoid calling libraries multiple times, thus if tidyverse is already imported,
 43 | #         there is no need to import readr again. 
 44 | #           (But in this case will not cause any problem. It may be a problem if you call different versions!)
 45 | 
 46 | #######
 47 | #   2) Import by defining your path:
 48 | #       a) use an absolute path (you have to know from root folder the path of your csv)
 49 | 
 50 | data_in <- '~/Documents/Egyetem/Bekes_Kezdi_Textbook/da-coding-rstats/lecture02-data-imp_n_exp/data/hotels_vienna/'
 51 | df_0      <- read_csv(paste0(data_in,'clean/hotels-vienna.csv'))
 52 | 
 53 | #       b) use relative path:
 54 | #           R works in a specific folder called `working directory`, that you can check by:
 55 | getwd()
 56 | 
 57 | # after that, you can set your working directory by:
 58 | setwd(data_in)
 59 | # and simply call the data
 60 | df_1      <- read_csv('clean/hotels-vienna.csv')
 61 | 
 62 | 
 63 | # delete your data
 64 | rm(hotels_vienna, df_0, df_1)
 65 | 
 66 | 
 67 | ########
 68 | #   3) Import by using url - this is going to be our preferred method at this course!
 69 | #     Note: importing from the web is almost inferior to use your local disc, 
 70 | #       but there are some exceptions:
 71 | #         a) The data is considerably large (>1GB)
 72 | #         b) It is important that there is no `refresh` or change in the data
 73 | #       in these case it is good practice to download to your computer the datas
 74 | 
 75 | # Can access (almost) all the dat from 'ISF'
 76 | # the hotels vienna dataset has the following url:
 77 | df <- read_csv(url('https://osf.io/y6jvb/download')) 
 78 | 
 79 | 
 80 | ###
 81 | # Quick check on the data:
 82 | 
 83 | # glimpse on data
 84 | glimpse(df)
 85 | 
 86 | # Check some of the first observations
 87 | head(df)
 88 | 
 89 | # Have a built in summary for the variables
 90 | summary(df)
 91 | 
 92 | 
 93 | ###########################
 94 | # Exporting your data:
 95 | #
 96 | # This is a special case: data_out is now the same as data_in (no cleaning...)
 97 | data_out <- paste0(data_in, '/export/')
 98 | write_csv(df, paste0(data_out, 'my_csvfile.csv'))
 99 | 
100 | # If due to some reason you would like to export as xls(x)
101 | install.packages('writexl')
102 | library(writexl)
103 | write_xlsx(df, paste0(data_out, 'my_csvfile.xlsx'))
104 | 
105 | # Third option is to save as an R object
106 | save(df, file = paste0(data_out, 'my_rfile.RData'))
107 | 
108 | ######
109 | # Extra: using API
110 | #   - tq_get - get stock prices from Yahoo/Google/FRED/Quandl, ect.
111 | #   - WDI    - get various data from World Bank's site
112 | #
113 | 
114 | # tidyquant
115 | install.packages('tidyquant')
116 | library(tidyquant)
117 | # Apple stock prices from Yahoo
118 | aapl <- tq_get('AAPL',
119 |                from = '2020-01-01',
120 |                to = '2021-10-01',
121 |                get = 'stock.prices')
122 | 
123 | glimpse(aapl)
124 | 
125 | # World Bank
126 | install.packages('WDI')
127 | library(WDI)
128 | # How WDI works - it is an API
129 | # Search for variables which contains GDP
130 | a <- WDIsearch('gdp')
131 | # Narrow down the serach for: GDP + something + capita + something + constant
132 | a <- WDIsearch('gdp.*capita.*constant')
133 | # Get data
134 | gdp_data <- WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019)
135 | 
136 | glimpse(gdp_data)
137 | 
138 | ##
139 | # Tasks:
140 | #
141 | # 1) Go to the webpage: https://gabors-data-analysis.com/ and find OSF database under `Data and Code`
142 | # 2) Go the the Gabor's OSF database and download manually 
143 | #       the `hotelbookingdata.csv` from `hotels-europe` dataset into your computer and save it to 'raw' folder.
144 | # 3) load the data from this path
145 | # 4) also load the data directly from the web (note you need to add `/download` to the url)
146 | # 5) write out this file as xlsx and as a .RData next to the original data.
147 | 
148 | # Load from path
149 | df_t0 <- read_csv(paste0(data_in,'raw/hotelbookingdata.csv'))
150 | # Load from wed
151 | df_t1 <- read_csv('https://osf.io/yzntm/download')
152 | # Write as xlsx
153 | write_xlsx(df_t1, paste0(data_out, 'hotelbookingdata.xlsx'))
154 | # Write as .RData
155 | save(df_t1, file = paste0(data_out, 'hotelbookingdata.RData'))
156 | 
157 | 
158 | 
159 | 


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/data/hotels_vienna/clean/README.md:
--------------------------------------------------------------------------------
 1 | ****************************************************************
 2 | Prepared for Gabor's Data Analysis
 3 | 
 4 | Data Analysis for Business, Economics, and Policy
 5 |  by Gabor Bekes and  Gabor Kezdi
 6 |  Cambridge University Press 2021
 7 |  gabors-data-analysis.com 
 8 | 
 9 | Description of the 
10 | hotels-vienna dataset
11 | 
12 | used in case studies
13 |  2A  Finding a good deal among hotels: data preparation
14 |  3A  Finding a good deal among hotels: data exploration
15 |  7A  Finding a good deal among hotels with simple regression
16 |  8A  Finding a good deal among hotels with nonlinear function
17 |  8C  Hotel ratings and measurement error
18 |  10B Finding a good deal among hotels with multiple regression
19 | 
20 | ****************************************************************
21 | 
22 | [see it on the website](https://gabors-data-analysis.com/dat_hotels-vienna)
23 | 
24 | 
25 | This is a  README file for the `hotels-vienna` dataset that includes information on price and features of hotels in Vienna for one date. 
26 |  
27 | 
28 | ## Data source
29 | 
30 | Scraped from a price comparison website.
31 | It was anonymized and slightly altered to ensure confidentiality. It contains all the necessary information about the location and rating that helps to distinguish them.
32 | 
33 | ## Data access and copyright  
34 | 
35 | The data was collected by the authors and may be used for education purposes only. 
36 | 
37 | ## About the data
38 | 
39 | ### Raw data tables
40 | 
41 | `hotelbookingdata-vienna.csv`  
42 | 
43 | The file contains data about hotel prices and features from a price comparison website.  
44 | * for Vienna, Austria, 
45 | * for a single weekday night in November 2017 
46 | * The dataset has N=430 observations.    
47 | * ID variable: hotel_id
48 | 
49 | 
50 | ### Tidy data table
51 | 
52 | `hotels-vienna` is just a slightly cleaned version of the raw data excluding duplicates. 
53 | 
54 |  * The dataset has N=428 observations.    
55 |  * ID variable: hotel_id
56 | 
57 | 
58 | | variable name 		| info    	 	            						| type   	|  
59 | |--------------------	|------------------------------------------------	|---------	|
60 | | hotel_id           	| Hotel ID                                       	| numeric 	|
61 | | accommodation_type 	| Type of accomodation                           	| factor  	|
62 | | country            	| Country                                        	| string  	|
63 | | city               	| City based on search                           	| string  	|
64 | | city_actual        	| City actual of hotel                           	| string  	|
65 | | neighbourhood      	| Neighburhood                                   	| string  	|
66 | | center1label       	| Centre 1 - name of location for distance       	| string  	|
67 | | distance           	| Distance - from main city center               	| numeric 	|
68 | | center2label       	| Centre 2 - name of location for distance_alter 	| string  	|
69 | | distance_alter     	| Distance - alternative - from Centre 2         	| numeric 	|
70 | | stars              	| Number of stars                                	| numeric 	|
71 | | rating             	| User rating average                            	| numeric 	|
72 | | rating_count       	| Number of user ratings                         	| numeric 	|
73 | | ratingta           	| User rating average (tripadvisor)              	| numeric 	|
74 | | ratingta_count     	| Number of user ratings (tripadvisor)           	| numeric 	|
75 | | hotel_id           	| Hotel ID                                       	| numeric 	|
76 | | year               	| Year (YYYY)                                    	| numeric 	|
77 | | month              	| Month (MM)                                     	| numeric 	|
78 | | weekend            	| Flag, if day is a weekend                      	| binary  	|
79 | | holiday            	| Flag, if day is a public holiday               	| binary  	|
80 | | nnights            	| Number of nights                               	| factor  	|
81 | | price              	| Pricee in EUR                                  	| numeric 	|
82 | | scarce_room        	| Flag, if room was noted as scarce              	| binary  	|
83 | | offer              	| Flag, if there was an offer available          	| binary  	|
84 | | offer_cat          	| Type of offer                                  	| factor  	|


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/data/hotels_vienna/clean/VARIABLES.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture02-data-imp-n-exp/data/hotels_vienna/clean/VARIABLES.xlsx


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/data/hotels_vienna/export/export_here:
--------------------------------------------------------------------------------
1 | Here you should export the files you have created during the lecture.


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/data/hotels_vienna/raw/show_folder:
--------------------------------------------------------------------------------
1 | To show raw folder on Github


--------------------------------------------------------------------------------
/lecture02-data-imp-n-exp/raw_codes/dataset_handling.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | #                                #
  3 | #          Lecture 02            #
  4 | #                                #
  5 | #    Import and Export Data      #
  6 | #       to R with                #
  7 | #                                #
  8 | #     - Importing with clicking  #
  9 | #     - read_csv():              #
 10 | #       - local and url          #
 11 | #       - working directory      #
 12 | #     - Export                   #
 13 | #       - write_csv              #
 14 | #       - xlsx package           #
 15 | #       - save to RData          #
 16 | #     - API:                     #
 17 | #       - tidyquant and WDI      #
 18 | #                                #
 19 | #                                #
 20 | ##################################
 21 | 
 22 | rm(list = ls())
 23 | # Tidyverse includes readr package 
 24 | #  which we use for importing data!
 25 | library(tidyverse)
 26 | 
 27 | 
 28 | 
 29 | ####################
 30 | ## Importing data:
 31 | # 3 options to import data:
 32 | 
 33 | 
 34 | #####
 35 | #   1) Import by clicking: File -> Import Dataset -> 
 36 | #       -> From Text (readr) / this is for csv. You may use other to import other specific formats
 37 | #
 38 | # Notes: 
 39 | #   - Do this exercise to find your data and realize that importing this way will show up in the console.
 40 | #         if second option does not work, check the path on the console!
 41 | #   - Check the library, that the import command used: it is called 'readr' which is part of 'tidyverse'!
 42 | #         you should avoid calling libraries multiple times, thus if tidyverse is already imported,
 43 | #         there is no need to import readr again. 
 44 | #           (But in this case will not cause any problem. It may be a problem if you call different versions!)
 45 | 
 46 | #######
 47 | #   2) Import by defining your path:
 48 | #       a) use an absolute path (you have to know from root folder the path of your csv)
 49 | 
 50 | data_in <- '~/Documents/Egyetem/Bekes_Kezdi_Textbook/da-coding-rstats/lecture02-data-imp_n_exp/data/hotels_vienna/'
 51 | df_0      <- read_csv(paste0(data_in,'clean/hotels-vienna.csv'))
 52 | 
 53 | #       b) use relative path:
 54 | #           R works in a specific folder called `working directory`, that you can check by:
 55 | getwd()
 56 | 
 57 | # after that, you can set your working directory by:
 58 | setwd(data_in)
 59 | # and simply call the data
 60 | df_1      <- read_csv('clean/hotels-vienna.csv')
 61 | 
 62 | 
 63 | # delete your data
 64 | rm(hotels_vienna, df_0, df_1)
 65 | 
 66 | 
 67 | ########
 68 | #   3) Import by using url - this is going to be our preferred method at this course!
 69 | #     Note: importing from the web is almost inferior to use your local disc, 
 70 | #       but there are some exceptions:
 71 | #         a) The data is considerably large (>1GB)
 72 | #         b) It is important that there is no `refresh` or change in the data
 73 | #       in these case it is good practice to download to your computer the datas
 74 | 
 75 | # Can access (almost) all the dat from 'ISF'
 76 | # the hotels vienna dataset has the following url:
 77 | df <- read_csv(url('https://osf.io/y6jvb/download')) 
 78 | 
 79 | 
 80 | ###
 81 | # Quick check on the data:
 82 | 
 83 | # glimpse on data
 84 | glimpse(df)
 85 | 
 86 | # Check some of the first observations
 87 | head(df)
 88 | 
 89 | # Have a built in summary for the variables
 90 | summary(df)
 91 | 
 92 | 
 93 | ###########################
 94 | # Exporting your data:
 95 | #
 96 | # This is a special case: data_out is now the same as data_in (no cleaning...)
 97 | data_out <- paste0(data_in, '/export/')
 98 | write_csv(df, paste0(data_out, 'my_csvfile.csv'))
 99 | 
100 | # If due to some reason you would like to export as xls(x)
101 | install.packages('writexl')
102 | library(writexl)
103 | write_xlsx(df, paste0(data_out, 'my_csvfile.xlsx'))
104 | 
105 | # Third option is to save as an R object
106 | save(df, file = paste0(data_out, 'my_rfile.RData'))
107 | 
108 | ######
109 | # Extra: using API
110 | #   - tq_get - get stock prices from Yahoo/Google/FRED/Quandl, ect.
111 | #   - WDI    - get various data from World Bank's site
112 | #
113 | 
114 | # tidyquant
115 | install.packages('tidyquant')
116 | library(tidyquant)
117 | # Apple stock prices from Yahoo
118 | aapl <- tq_get('AAPL',
119 |                from = '2020-01-01',
120 |                to = '2021-10-01',
121 |                get = 'stock.prices')
122 | 
123 | glimpse(aapl)
124 | 
125 | # World Bank
126 | install.packages('WDI')
127 | library(WDI)
128 | # How WDI works - it is an API
129 | # Search for variables which contains GDP
130 | a <- WDIsearch('gdp')
131 | # Narrow down the serach for: GDP + something + capita + something + constant
132 | a <- WDIsearch('gdp.*capita.*constant')
133 | # Get data
134 | gdp_data <- WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019)
135 | 
136 | glimpse(gdp_data)
137 | 
138 | ##
139 | # Tasks:
140 | #
141 | # 1) Go to the webpage: https://gabors-data-analysis.com/ and find OSF database under `Data and Code`
142 | # 2) Go the the Gabor's OSF database and download manually 
143 | #       the `hotelbookingdata.csv` from `hotels-europe` dataset into your computer.
144 | # 3) load the data from this path
145 | # 4) also load the data directly from the web
146 | # 5) write out this file as xlsx and as a .RData next to the original data.
147 | 
148 | 
149 | 


--------------------------------------------------------------------------------
/lecture03-tibbles/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 03: Tibbles
  2 | 
  3 | ## Motivation
  4 | 
  5 | How to start working with data? Clarifying the concept of tidy data helps us to carry out analysis in a tractable way. Tidy data tables have the same structure for storing observations and variables. We discuss potential issues with storing observations and variables, and how to deal with those issues. We describe good practices for the process of converting non-tidy data into a tidy data frame.
  6 | 
  7 | ## This lecture
  8 | 
  9 | This lecture introduces `tibble`-s as the 'Data' type of variable in `tidyverse`. It shows multiple columns and row manipulations with one `tibble` as well as how to merge two `tibble`s. It uses pre-written codes with tasks during the class.
 10 | 
 11 | Data merging is based on [Chapter 02, C: Identifying successful football managers](https://gabors-data-analysis.com/casestudies/#ch02c-identifying-successful-football-managers).
 12 | 
 13 | 
 14 | ## Learning outcomes
 15 | After successfully completing codes in [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R) students should be able:
 16 | 
 17 |   - understand what is a 'Data' variable, why to use tibble and how it relates to `data_frame` and `data.frame`
 18 |   - How to do indexing with a tibble
 19 |     - indexing with integer numbers
 20 |     - indexing with logicals
 21 |     - when to use which and what are the connections
 22 |   - How to use simple functions with tibbles
 23 |     - `sum`, `mean`, `sd`, `add_column`, `select`, `add_row` 
 24 |   - How to:
 25 |     - reset a cell's value in a tibble
 26 |     - add or remove a column (or variable)
 27 |     - add or remove a row (or an observation)
 28 |   - Wide vs long format and how to convert one to another
 29 |     - `pivot_wider` and `pivot_longer` functions
 30 |   - Merging - different ways to merge two tibbles:
 31 |     - new/other rows/observations are in the new tibble
 32 |     - new/other columns/variables are in the new tibble  
 33 |     - difference between: `left_join`, `right_join`, `full_join` and `inner_join`
 34 |     - importance of the identifier variables and cases for non-unique identifications
 35 |     - `all_equal` to compare tibbles
 36 |     - extra: `semi_join` and `anti_join`
 37 | 
 38 | ## Datasets used
 39 | 
 40 |   - [Football](https://gabors-data-analysis.com/datasets/#football)
 41 | 
 42 | ## Lecture Time
 43 | 
 44 | Ideal overall time: **30-40 mins**.
 45 | 
 46 | Showing [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R) takes around *20-25 minutes*, while doing the tasks would take the rest.
 47 |  
 48 | 
 49 | ## Homework
 50 | 
 51 | *Type*: quick practice, approx 15 mins
 52 | 
 53 | Use the created tibble from class (called `df`) and create two new tibble -- call it `df_2` and `df_3` -- with the following values:
 54 | 
 55 | `df_2`:
 56 | 
 57 | | id | age | grade | gender |
 58 | | -- | --- | ----- | ------ |
 59 | | 10 |  16 |  C    |    F   |
 60 | | 11 |  40 |  A    |    F   |
 61 | | 12 |  52 |  B-   |    M   | 
 62 | | 13 |  24 |  C+   |    M   |
 63 | | 14 |  28 |  B+   |    M   |
 64 | | 15 |  26 |  A-   |    F   |
 65 | 
 66 | `df_3`:
 67 | 
 68 | | id | height |
 69 | | -- | ------ |
 70 | |  1 | 165 |
 71 | | 3  | 200 |
 72 | | 5  | 187 |
 73 | | 10 | 175 |
 74 | | 12 | 170 |
 75 | 
 76 | Do the following manipulations:
 77 | 
 78 |  - add `df_2` to `df`  and call the newly merged tibble as `df_m`
 79 |  - merge `df_3` to `df_m` such that you have *all kinds of id* values (adding missing values), call it `df_m2`
 80 |  - merge `df_3` to `df_m` such that you have only such id-s that there are no missing values, call it `df_m3`
 81 |  - create a wide format from `df_m2` with names from grades and values from age (not really meaningful, but good practice)
 82 | 
 83 | 
 84 | ## Further material
 85 | 
 86 |   - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 12](https://r4ds.had.co.nz/tidy-data.html) introduce to tidy approach and works with tibble. [Chapter 13](https://r4ds.had.co.nz/relational-data.html) provides a detailed discussion on merging.
 87 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 05](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/05_tidy_data.Rmd) provides useful further guidelines on tidy approach and merging.
 88 |   - Another interesting material on this topic is by [Hansjörg Neth: Data Science for Psychologists](https://bookdown.org/hneth/ds4psy/), especially [Chapter 7.2](https://bookdown.org/hneth/ds4psy/7-2-tidy-essentials.html) on wide vs long format and [Chapter 8](https://bookdown.org/hneth/ds4psy/8-join.html) on merging.
 89 | 
 90 | 
 91 | ## Folder structure
 92 |   
 93 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes) includes one code, which is ready to use during the course but requires some live coding in class.
 94 |     - [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R)
 95 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/complete_codes) includes one code with solutions for
 96 |     - [`intro_to_tibbles_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/complete_codes/intro_to_tibbles_fin.R) solution for: [`intro_to_tibbles.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture03-tibbles/raw_codes/intro_to_tibbles.R)
 97 |     
 98 | 
 99 | 
100 | 


--------------------------------------------------------------------------------
/lecture03-tibbles/data/games.csv:
--------------------------------------------------------------------------------
  1 | team,manager_id,manager_name,manager_games
  2 | Arsenal,19,Arsène Wenger,380
  3 | Arsenal,238,Unai Emery,38
  4 | Aston Villa,14,Alex McLeish,38
  5 | Aston Villa,71,Eric Black,7
  6 | Aston Villa,80,Gary McAllister,5
  7 | Aston Villa,96,Gérard Houllier,30
  8 | Aston Villa,131,Kevin MacDonald,3
  9 | Aston Villa,146,Martin O'Neill,76
 10 | Aston Villa,173,Paul Lambert,101
 11 | Aston Villa,204,Rémi Garde,21
 12 | Aston Villa,229,Tim Sherwood,23
 13 | Birmingham,14,Alex McLeish,76
 14 | Blackburn,171,Paul Ince,17
 15 | Blackburn,205,Sam Allardyce,76
 16 | Blackburn,216,Steve Kean,59
 17 | Blackpool,102,Ian Holloway,38
 18 | Bolton,81,Gary Megson,56
 19 | Bolton,165,Owen Coyle,96
 20 | Bournemouth,68,Eddie Howe,152
 21 | Brighton,40,Chris Hughton,76
 22 | Burnley,32,Brian Laws,18
 23 | Burnley,165,Owen Coyle,20
 24 | Burnley,209,Sean Dyche,152
 25 | Cardiff,57,David Kerslake,2
 26 | Cardiff,138,Malky Mackay,18
 27 | Cardiff,158,Neil Warnock,38
 28 | Cardiff,163,Ole Gunnar Solskjær,18
 29 | Chelsea,16,André Villas-Boas,27
 30 | Chelsea,17,Antonio Conte,76
 31 | Chelsea,37,Carlo Ancelotti,76
 32 | Chelsea,95,Guus Hiddink,35
 33 | Chelsea,120,José Mourinho,92
 34 | Chelsea,137,Luiz Felipe Scolari,25
 35 | Chelsea,149,Maurizio Sarri,38
 36 | Chelsea,184,Rafael Benítez,26
 37 | Chelsea,192,Roberto Di Matteo,23
 38 | Crystal Palace,9,Alan Pardew,73
 39 | Crystal Palace,77,Frank de Boer,4
 40 | Crystal Palace,102,Ian Holloway,8
 41 | Crystal Palace,124,Keith Millen,7
 42 | Crystal Palace,158,Neil Warnock,16
 43 | Crystal Palace,199,Roy Hodgson,72
 44 | Crystal Palace,205,Sam Allardyce,21
 45 | Crystal Palace,235,Tony Pulis,27
 46 | Everton,58,David Moyes,190
 47 | Everton,61,David Unsworth,6
 48 | Everton,140,Marco Silva,38
 49 | Everton,194,Roberto Martínez,113
 50 | Everton,197,Ronald Koeman,47
 51 | Everton,205,Sam Allardyce,24
 52 | Fulham,46,Claudio Ranieri,16
 53 | Fulham,72,Felix Magath,12
 54 | Fulham,143,Mark Hughes,38
 55 | Fulham,145,Martin Jol,89
 56 | Fulham,189,René Meulensteen,13
 57 | Fulham,199,Roy Hodgson,76
 58 | Fulham,208,Scott Parker,10
 59 | Fulham,211,Slaviša Jokanović,12
 60 | Huddersfield,62,David Wagner,60
 61 | Huddersfield,105,Jan Siewert,15
 62 | Huddersfield,142,Mark Hudson,1
 63 | Hull,100,Iain Dowie,9
 64 | Hull,140,Marco Silva,18
 65 | Hull,154,Mike Phelan,20
 66 | Hull,213,Steve Bruce,76
 67 | Leicester,28,Brendan Rodgers,11
 68 | Leicester,45,Claude Puel,56
 69 | Leicester,46,Claudio Ranieri,63
 70 | Leicester,49,Craig Shakespeare,21
 71 | Leicester,150,Michael Appleton,1
 72 | Leicester,160,Nigel Pearson,38
 73 | Liverpool,28,Brendan Rodgers,122
 74 | Liverpool,126,Kenny Dalglish,56
 75 | Liverpool,184,Rafael Benítez,76
 76 | Liverpool,199,Roy Hodgson,20
 77 | Man City,31,Brian Kidd,2
 78 | Man City,139,Manuel Pellegrini,114
 79 | Man City,143,Mark Hughes,54
 80 | Man City,175,Pep Guardiola,114
 81 | Man City,193,Roberto Mancini,134
 82 | Man United,12,Alex Ferguson,190
 83 | Man United,58,David Moyes,34
 84 | Man United,120,José Mourinho,93
 85 | Man United,136,Louis van Gaal,76
 86 | Man United,163,Ole Gunnar Solskjær,21
 87 | Man United,203,Ryan Giggs,4
 88 | Middlesbrough,3,Aitor Karanka,27
 89 | Middlesbrough,78,Gareth Southgate,38
 90 | Middlesbrough,212,Steve Agnew,11
 91 | Newcastle,9,Alan Pardew,156
 92 | Newcastle,10,Alan Shearer,8
 93 | Newcastle,40,Chris Hughton,19
 94 | Newcastle,112,Joe Kinnear,24
 95 | Newcastle,114,John Carver,18
 96 | Newcastle,129,Kevin Keegan,3
 97 | Newcastle,184,Rafael Benítez,86
 98 | Newcastle,217,Steve McClaren,28
 99 | Norwich,15,Alex Neil,38
100 | Norwich,40,Chris Hughton,71
101 | Norwich,157,Neil Adams,5
102 | Norwich,173,Paul Lambert,38
103 | Portsmouth,21,Avram Grant,25
104 | Portsmouth,97,Harry Redknapp,8
105 | Portsmouth,170,Paul Hart,27
106 | Portsmouth,231,Tony Adams,16
107 | QPR,42,Chris Ramsey,15
108 | QPR,97,Harry Redknapp,49
109 | QPR,143,Mark Hughes,30
110 | QPR,158,Neil Warnock,20
111 | Reading,34,Brian McDermott,29
112 | Reading,159,Nigel Adkins,8
113 | Reading,262,Eamonn Dolan,1
114 | Southampton,45,Claude Puel,38
115 | Southampton,143,Mark Hughes,22
116 | Southampton,147,Mauricio Pellegrino,30
117 | Southampton,148,Mauricio Pochettino,54
118 | Southampton,185,Ralph Hasenhüttl,24
119 | Southampton,197,Ronald Koeman,76
120 | Stoke,143,Mark Hughes,174
121 | Stoke,173,Paul Lambert,16
122 | Stoke,235,Tony Pulis,190
123 | Sunderland,58,David Moyes,38
124 | Sunderland,65,Dick Advocaat,17
125 | Sunderland,94,Gus Poyet,60
126 | Sunderland,127,Kevin Ball,2
127 | Sunderland,146,Martin O'Neill,56
128 | Sunderland,166,Paolo Di Canio,12
129 | Sunderland,190,Ricky Sbragia,23
130 | Sunderland,205,Sam Allardyce,30
131 | Sunderland,213,Steve Bruce,89
132 | Swansea,7,Alan Curtis,7
133 | Swansea,25,Bob Bradley,11
134 | Swansea,28,Brendan Rodgers,38
135 | Swansea,38,Carlos Carvalhal,18
136 | Swansea,73,Francesco Guidolin,24
137 | Swansea,79,Garry Monk,67
138 | Swansea,134,Leon Britton,2
139 | Swansea,151,Michael Laudrup,62
140 | Swansea,168,Paul Clement,37
141 | Tottenham,16,André Villas-Boas,54
142 | Tottenham,97,Harry Redknapp,144
143 | Tottenham,121,Juande Ramos,8
144 | Tottenham,148,Mauricio Pochettino,190
145 | Tottenham,229,Tim Sherwood,22
146 | Watford,106,Javi Gracia,52
147 | Watford,140,Marco Silva,24
148 | Watford,183,Quique Sánchez Flores,38
149 | Watford,240,Walter Mazzarri,38
150 | West Brom,8,Alan Irvine,19
151 | West Brom,9,Alan Pardew,18
152 | West Brom,52,Darren Moore,6
153 | West Brom,81,Gary Megson,2
154 | West Brom,123,Keith Downing,5
155 | West Brom,150,Michael Appleton,1
156 | West Brom,176,Pepe Mel,18
157 | West Brom,192,Roberto Di Matteo,25
158 | West Brom,199,Roy Hodgson,50
159 | West Brom,214,Steve Clarke,53
160 | West Brom,233,Tony Mowbray,38
161 | West Brom,235,Tony Pulis,107
162 | West Ham,6,Alan Curbishley,3
163 | West Ham,21,Avram Grant,36
164 | West Ham,58,David Moyes,27
165 | West Ham,85,Gianfranco Zola,72
166 | West Ham,130,Kevin Keen,3
167 | West Ham,139,Manuel Pellegrini,38
168 | West Ham,205,Sam Allardyce,114
169 | West Ham,210,Slaven Bilić,87
170 | Wigan,194,Roberto Martínez,152
171 | Wigan,213,Steve Bruce,38
172 | Wolves,152,Mick McCarthy,101
173 | Wolves,162,Nuno Espírito Santo,38
174 | Wolves,226,Terry Connor,13
175 | 


--------------------------------------------------------------------------------
/lecture03-tibbles/data/points.csv:
--------------------------------------------------------------------------------
  1 | team,manager_id,manager_name,manager_points
  2 | Arsenal,19,Arsène Wenger,721
  3 | Arsenal,238,Unai Emery,70
  4 | Aston Villa,71,Eric Black,1
  5 | Aston Villa,80,Gary McAllister,8
  6 | Aston Villa,96,Gérard Houllier,34
  7 | Aston Villa,131,Kevin MacDonald,6
  8 | Aston Villa,146,Martin O'Neill,126
  9 | Aston Villa,173,Paul Lambert,101
 10 | Aston Villa,204,Rémi Garde,12
 11 | Aston Villa,229,Tim Sherwood,20
 12 | Birmingham,14,Alex McLeish,89
 13 | Blackburn,171,Paul Ince,13
 14 | Blackburn,205,Sam Allardyce,99
 15 | Blackburn,216,Steve Kean,53
 16 | Blackpool,102,Ian Holloway,39
 17 | Bolton,81,Gary Megson,59
 18 | Bolton,165,Owen Coyle,103
 19 | Bournemouth,68,Eddie Howe,177
 20 | Brighton,40,Chris Hughton,76
 21 | Burnley,165,Owen Coyle,20
 22 | Burnley,209,Sean Dyche,167
 23 | Cardiff,57,David Kerslake,1
 24 | Cardiff,138,Malky Mackay,17
 25 | Cardiff,158,Neil Warnock,34
 26 | Cardiff,163,Ole Gunnar Solskjær,12
 27 | Chelsea,16,André Villas-Boas,46
 28 | Chelsea,17,Antonio Conte,163
 29 | Chelsea,37,Carlo Ancelotti,157
 30 | Chelsea,95,Guus Hiddink,69
 31 | Chelsea,120,José Mourinho,184
 32 | Chelsea,137,Luiz Felipe Scolari,49
 33 | Chelsea,149,Maurizio Sarri,72
 34 | Chelsea,184,Rafael Benítez,51
 35 | Chelsea,192,Roberto Di Matteo,42
 36 | Crystal Palace,9,Alan Pardew,88
 37 | Crystal Palace,77,Frank de Boer,0
 38 | Crystal Palace,102,Ian Holloway,3
 39 | Crystal Palace,124,Keith Millen,3
 40 | Crystal Palace,158,Neil Warnock,15
 41 | Crystal Palace,199,Roy Hodgson,93
 42 | Crystal Palace,205,Sam Allardyce,26
 43 | Crystal Palace,235,Tony Pulis,41
 44 | Everton,58,David Moyes,297
 45 | Everton,61,David Unsworth,10
 46 | Everton,140,Marco Silva,54
 47 | Everton,194,Roberto Martínez,163
 48 | Everton,197,Ronald Koeman,69
 49 | Everton,205,Sam Allardyce,34
 50 | Fulham,46,Claudio Ranieri,12
 51 | Fulham,72,Felix Magath,12
 52 | Fulham,143,Mark Hughes,49
 53 | Fulham,145,Martin Jol,105
 54 | Fulham,189,René Meulensteen,10
 55 | Fulham,199,Roy Hodgson,99
 56 | Fulham,208,Scott Parker,9
 57 | Fulham,211,Slaviša Jokanović,5
 58 | Huddersfield,62,David Wagner,48
 59 | Huddersfield,105,Jan Siewert,5
 60 | Huddersfield,142,Mark Hudson,0
 61 | Hull,100,Iain Dowie,6
 62 | Hull,140,Marco Silva,21
 63 | Hull,154,Mike Phelan,13
 64 | Hull,180,Phil Brown,59
 65 | Hull,213,Steve Bruce,72
 66 | Leicester,28,Brendan Rodgers,20
 67 | Leicester,45,Claude Puel,70
 68 | Leicester,46,Claudio Ranieri,102
 69 | Leicester,49,Craig Shakespeare,29
 70 | Leicester,150,Michael Appleton,3
 71 | Leicester,160,Nigel Pearson,41
 72 | Liverpool,28,Brendan Rodgers,219
 73 | Liverpool,122,Jürgen Klopp,296
 74 | Liverpool,126,Kenny Dalglish,85
 75 | Liverpool,184,Rafael Benítez,149
 76 | Liverpool,199,Roy Hodgson,25
 77 | Man City,31,Brian Kidd,3
 78 | Man City,139,Manuel Pellegrini,231
 79 | Man City,143,Mark Hughes,76
 80 | Man City,175,Pep Guardiola,276
 81 | Man City,193,Roberto Mancini,276
 82 | Man United,12,Alex Ferguson,433
 83 | Man United,58,David Moyes,57
 84 | Man United,120,José Mourinho,176
 85 | Man United,136,Louis van Gaal,136
 86 | Man United,163,Ole Gunnar Solskjær,40
 87 | Man United,203,Ryan Giggs,7
 88 | Middlesbrough,3,Aitor Karanka,22
 89 | Middlesbrough,78,Gareth Southgate,32
 90 | Middlesbrough,212,Steve Agnew,6
 91 | Newcastle,9,Alan Pardew,209
 92 | Newcastle,10,Alan Shearer,5
 93 | Newcastle,40,Chris Hughton,19
 94 | Newcastle,112,Joe Kinnear,25
 95 | Newcastle,114,John Carver,12
 96 | Newcastle,129,Kevin Keegan,4
 97 | Newcastle,184,Rafael Benítez,102
 98 | Newcastle,217,Steve McClaren,24
 99 | Norwich,15,Alex Neil,34
100 | Norwich,40,Chris Hughton,76
101 | Norwich,157,Neil Adams,1
102 | Norwich,173,Paul Lambert,47
103 | Portsmouth,21,Avram Grant,21
104 | Portsmouth,97,Harry Redknapp,13
105 | Portsmouth,170,Paul Hart,24
106 | Portsmouth,231,Tony Adams,11
107 | QPR,42,Chris Ramsey,11
108 | QPR,97,Harry Redknapp,40
109 | QPR,143,Mark Hughes,24
110 | QPR,158,Neil Warnock,17
111 | Reading,34,Brian McDermott,23
112 | Reading,159,Nigel Adkins,5
113 | Reading,262,Eamonn Dolan,0
114 | Southampton,45,Claude Puel,46
115 | Southampton,143,Mark Hughes,17
116 | Southampton,147,Mauricio Pellegrino,28
117 | Southampton,148,Mauricio Pochettino,75
118 | Southampton,159,Nigel Adkins,22
119 | Southampton,185,Ralph Hasenhüttl,30
120 | Southampton,197,Ronald Koeman,123
121 | Stoke,143,Mark Hughes,219
122 | Stoke,173,Paul Lambert,13
123 | Stoke,235,Tony Pulis,225
124 | Sunderland,58,David Moyes,24
125 | Sunderland,65,Dick Advocaat,15
126 | Sunderland,94,Gus Poyet,63
127 | Sunderland,127,Kevin Ball,0
128 | Sunderland,146,Martin O'Neill,65
129 | Sunderland,166,Paolo Di Canio,9
130 | Sunderland,190,Ricky Sbragia,21
131 | Sunderland,200,Roy Keane,15
132 | Sunderland,205,Sam Allardyce,36
133 | Sunderland,213,Steve Bruce,102
134 | Swansea,7,Alan Curtis,5
135 | Swansea,25,Bob Bradley,8
136 | Swansea,28,Brendan Rodgers,47
137 | Swansea,38,Carlos Carvalhal,20
138 | Swansea,73,Francesco Guidolin,32
139 | Swansea,79,Garry Monk,88
140 | Swansea,134,Leon Britton,1
141 | Swansea,151,Michael Laudrup,70
142 | Swansea,168,Paul Clement,41
143 | Tottenham,16,André Villas-Boas,99
144 | Tottenham,97,Harry Redknapp,250
145 | Tottenham,121,Juande Ramos,2
146 | Tottenham,229,Tim Sherwood,42
147 | Watford,106,Javi Gracia,65
148 | Watford,140,Marco Silva,26
149 | Watford,183,Quique Sánchez Flores,45
150 | Watford,240,Walter Mazzarri,40
151 | West Brom,8,Alan Irvine,17
152 | West Brom,9,Alan Pardew,8
153 | West Brom,52,Darren Moore,11
154 | West Brom,81,Gary Megson,2
155 | West Brom,123,Keith Downing,6
156 | West Brom,176,Pepe Mel,15
157 | West Brom,192,Roberto Di Matteo,26
158 | West Brom,199,Roy Hodgson,67
159 | West Brom,214,Steve Clarke,64
160 | West Brom,233,Tony Mowbray,32
161 | West Brom,235,Tony Pulis,125
162 | West Ham,6,Alan Curbishley,6
163 | West Ham,21,Avram Grant,33
164 | West Ham,58,David Moyes,33
165 | West Ham,85,Gianfranco Zola,80
166 | West Ham,130,Kevin Keen,0
167 | West Ham,139,Manuel Pellegrini,52
168 | West Ham,205,Sam Allardyce,133
169 | West Ham,210,Slaven Bilić,116
170 | Wigan,194,Roberto Martínez,157
171 | Wigan,213,Steve Bruce,45
172 | Wolves,152,Mick McCarthy,99
173 | Wolves,162,Nuno Espírito Santo,57
174 | 


--------------------------------------------------------------------------------
/lecture04-data-munging/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 04: Data Munging with dplyr
 2 | 
 3 | ## Motivation
 4 | 
 5 | Before analyzing the data, data analysts spend a lot of time organizing, managing, and cleaning it to prepare it for analysis. This is called data wrangling or data munging. It is often said that 80 percent of data analysis time is spent on these tasks. Data wrangling is an iterative process: we usually start by organizing and cleaning our data, then start doing the analysis, and then go back to the cleaning process as problems emerge during analysis.
 6 | 
 7 | Here we introduce students to a (relatively) easy way of carrying out this task and use the case study of [finding a good deal among hotels]((https://gabors-data-analysis.com/casestudies/#ch02a-finding-a-good-deal-among-hotels-data-preparation)). The initial data preparation, continues to work towards finding hotels that are underpriced relative to their location and quality. In this lecture, we illustrate how to find problems with observations and variables and how to solve those problems.
 8 | 
 9 | ## This Lecture
10 | 
11 | This lecture introduces students to how to manipulate raw data in various ways with `dplyr` from `tidyverse`.
12 | 
13 | This lecture is based on [Chapter 02, A: Finding a good deal among hotels: data preparation](https://gabors-data-analysis.com/casestudies/#ch02a-finding-a-good-deal-among-hotels-data-preparation).
14 | 
15 | 
16 | ## Learning outcomes
17 | After successfully completing [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R), students should be able to:
18 | 
19 |   - Add variables
20 |   - Separate a character variable into two (or more) variables with `separate`
21 |   - Convert different type of variables to specific types:
22 |     - character to numeric
23 |     - character to factor -> understanding `factor` variable type
24 |   - Further string manipulations (`gsub` and string expressions)
25 |   - Rename variables with `rename`
26 |   - Filter out different observations with `filter`
27 |     - select observations with specific values
28 |     - tabulate different values of a variable with `table`
29 |     - filter out missing values
30 |     - replace specific values with others
31 |     - handle duplicates with `duplicated`
32 |   - use pipes `%>%` to do multiple manipulations at once
33 |   - sort data ascending or descending according to a specific variable with `arrange`
34 | 
35 | ## Datasets used
36 | * [Hotels Europe](https://gabors-data-analysis.com/datasets/#hotels-europe)
37 | 
38 | 
39 | ## Lecture Time
40 | 
41 | Ideal overall time: **40-60 mins**.
42 | 
43 | Showing [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R)takes around *30 minutes* while doing the tasks would take the rest.
44 |  
45 | 
46 | ## Homework
47 | 
48 | *Type*: quick practice, approx 10 mins
49 | 
50 | Use the same [hotel-europe data from OSF](https://osf.io/r6uqb/), but now 
51 |   - Download both `hotels-europe_price.csv` and `hotels-europe_features.csv`
52 |   - `left_join` them in this order by `hotel_id`
53 |   - filter for :
54 |     - time: 2018/01 and weekend == 1
55 |     - city: Vienna or London. Hint: for multiple matches, use something like: 
56 |     ```r 
57 |     city %in% c('City_A','City_B')
58 |     ``` 
59 |     - accommodation should be Apartment, 3-4 stars (only) with more than 10 reviews
60 |     - price is less than 600$
61 |  - arrange the data in ascending order by price
62 | 
63 | ## Further material
64 | 
65 |   - More materials on the case study can be found in Gabor's [da_case_studies repository](https://github.com/gabors-data-analysis/da_case_studies): [ch02-hotels-data-prep](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch02-hotels-data-prep/ch02-hotels-data-prep.R)
66 |   - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 5](https://r4ds.had.co.nz/transform.html) provides an overview of the type of variables, selecting, filtering, and arranging along with others. [Chapter 15](https://r4ds.had.co.nz/factors.html) provides further material on factors. [Chapter 18](https://r4ds.had.co.nz/pipes.html) discusses pipes in various applications.
67 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 3](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/03_1d_data.Rmd) is relevant for factors, but includes many more. [Lecture 6](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/06_slicing_dicing.Rmd) introduces similar manipulations with tibble.
68 |   - Grant McDermott: Data Science for Economists, Course material, [Lecture 5](https://github.com/uo-ec607/lectures/blob/master/05-tidyverse/05-tidyverse.pdf) is a nice overview on tidyverse with easy data manipulations.
69 | 
70 | 
71 | ## Folder structure
72 |   
73 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture04-data-munging/raw_codes) includes one code, which is ready to use during the course but requires some live coding in class.
74 |     - [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R)
75 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture04-data-munging/complete_codes) includes one code with solutions for
76 |     - [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture04-data-munging/complete_codes/data_munging_fin.R) solution for: [`data_munging.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture04-data-munging/raw_codes/data_munging.R)
77 | 


--------------------------------------------------------------------------------
/lecture06-rmarkdown101/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 06: Introduction to RMarkdown
 2 | 
 3 | ## Motivation
 4 | 
 5 | You want to know and articulate whether online and offline prices differ in your country for products that are sold in both ways. You have access to data on a sample of products with their online and offline prices. How would you use this data to establish whether prices tend to be different or the same for all products?
 6 | 
 7 | We introduce RMarkdown, which is a powerful tool to create R-based reports in many formats and, where the report can be easily updated. By the end of this lecture students should know how to put together a descriptive based report with hypothesis testing with simple formatting.
 8 | 
 9 | 
10 | ## This lecture
11 | 
12 | This lecture introduces students to *RMarkdown*, which is a great tool to create reports in pdf or Html. The aim of this session is to prepare students to create a simple report in pdf or Html on a descriptive analysis. This lecture uses exploratory analysis of [lecture05-data-exploration](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture05-data-exploration) and put it into an RMarkdown document.
13 | 
14 | Case studies connected to this lecture are similar to [lecture05-data-exploration](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture05-data-exploration), but this lecture focuses on how to create a report and does not cover patterns of associations.
15 |   - [Chapter 03, A: Finding a good deal among hotels: data exploration](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration) - emphasis on one variable descriptive analysis, different data
16 |   - [Chapter 06, A: Comparing online and offline prices: testing the difference](https://gabors-data-analysis.com/casestudies/#ch06a-comparing-online-and-offline-prices-testing-the-difference) - focuses on hypothesis testing, association and one variable descriptive is not emphasized.
17 | 
18 | 
19 | ## Learning outcomes
20 | After completing [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd) students should be able to:
21 | 
22 |   - Knit Rmd documents into Html and pdf
23 |   - Understanding the structure of an RMarkdown file: 
24 |     - YAML header, chunks of (R) codes surronded by ``` and text mixes with simple formatting
25 |   - Header of chunks of R codes:
26 |     - Use general commands, such as `include`, `echo`, `warning` or `eval`
27 |   - Text formatting:
28 |     - sections and sub-sections 
29 |     - bulleted, numbered and nested lists
30 |     - bold and italic
31 |     - add plain and embedded url
32 |     - in-line reported code values
33 |     - simple greek letters 
34 |     - color text (in pdf)
35 |  - Reporting descriptive statistics with `modelsummary` and `kableExtra` packages
36 |     - rename the reported variable names 
37 |     - add caption and notes
38 |     - set position of the table with `kable_styling()`
39 |  - `kable` to report a `tibble`
40 |     - add column (or row) names
41 |     - add caption 
42 |     - report in pdf with setting position and convert format theme
43 |     - report in Html with setting position, and change format theme
44 |  - Report a `ggplot2` object
45 |     - set size of the plot with `fig.width` and `fig.height`
46 |     - align the plot with `fig.align` and `fig.pos` with `float` package in YAML
47 |     - add caption
48 |     - set plot labels, theme etc to fit the formatting
49 | 
50 | ## Datasets used
51 | * [Billion Prices](https://gabors-data-analysis.com/datasets/#billion-prices)
52 | 
53 | 
54 | ## Lecture Time
55 | 
56 | Ideal overall time: **20-40mins**.
57 | 
58 | Showing [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd) takes around *20-30 minutes* while doing the tasks would take the rest.
59 | 
60 | Issues with RMarkdown knitting should be resolved by now.
61 |  
62 | 
63 | ## Homework
64 | 
65 | *Type*: quick practice, approx 15 mins
66 | 
67 | Use the [hotel-europe data from OSF](https://osf.io/r6uqb/) data and filter to have:
68 |   - Time: year 2017, november and `weekday = 0`
69 |   - Cities: London and Vienna
70 |   - Accomodation: 3-4 stars hotels
71 | 
72 | Create a max 2-page report in pdf **and** Html, where you
73 |   - describe the data filtering you have done with a list
74 |   - show a histogram of the prices
75 |   - report a descriptive table for the prices grouped by cities
76 |   - and carry out a simple t-test to decide if the mean prices in the two cities are the same. Hint: `t.test( price ~ city, data )` would compare the prices in the two cities.
77 |     - draw a conclusion in text with greek letters and in-line codes.   
78 | 
79 | Note: there is no need for a comprehensive argument, here focus on rather the coding and pretty-reporting part.
80 | 
81 | ## Further material
82 | 
83 |   - Hadley Wickham and Garrett Grolemund: R for Data Science: [Chapter 27](https://r4ds.had.co.nz/r-markdown.html) reviews the basics of RMarkdown such as chunks, general setup, problem-solving, and citation. [Chapter 29](https://r4ds.had.co.nz/r-markdown-formats.html) shows different types of outputs, that are not covered in this lecture but can be handy.
84 |   - [Yihui Xie, Christophe Dervieux, Emily Riederer: R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/) is a detailed book on all RMarkdown topics and issues.
85 |   - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch06-online-offline-price-test](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch06-online-offline-price-test)
86 | 
87 | ## Folder structure
88 |   
89 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture06-rmarkdown101/raw_codes) includes one RMarkdown file, which is ready to use during the course but requires some live coding in class.
90 |     - [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd)
91 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture06-rmarkdown101/complete_codes) includes
92 |     - [`report_bpp_fin.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd) RMarkdown file with solution for: [`report_bpp.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd)
93 |     - [`report_bpp_fin.pdf`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.pdf) is the generated pdf from [`report_bpp_fin.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd)
94 |     - [`report_bpp_fin.html`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.html) is the generated Html from [`report_bpp_fin.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd)
95 | 


--------------------------------------------------------------------------------
/lecture06-rmarkdown101/complete_codes/report_bpp_fin.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Report on Billion Price Project"
  3 | author: 'Name of author'
  4 | output: 
  5 |   # html_document
  6 |   pdf_document:
  7 |     extra_dependencies: ["float"]
  8 |   
  9 | ---
 10 | 
 11 | ```{r setup, include=FALSE}
 12 | rm(list = ls())
 13 | # Here comes the packages
 14 | library(tidyverse)
 15 | library(modelsummary)
 16 | library(kableExtra)
 17 | ```
 18 | 
 19 | ## Introduction
 20 | 
 21 | This is a report on *The Billion Price Project*.
 22 | 
 23 | HERE COMES THE MOTIVATION WHY THIS IS A MEANINGFUL PROJECT AND WHAT IS THE MAIN GOAL!
 24 | 
 25 | For more details on the project see: <http://www.thebillionpricesproject.com/> or [this embedded link](http://www.thebillionpricesproject.com/).
 26 | 
 27 | ## Data
 28 | 
 29 | ```{r data import, include=FALSE}
 30 | # Import data and basic data munging
 31 | bpp_orig <- read_csv('https://osf.io/yhbr5/download')
 32 | 
 33 | bpp_orig <- bpp_orig %>% mutate(p_diff = price_online - price)
 34 | 
 35 | bpp <- bpp_orig %>% 
 36 |   filter(is.na(sale_online)) %>%
 37 |   filter(!is.na(price)) %>%
 38 |   filter(!is.na(price_online)) %>% 
 39 |   filter(PRICETYPE == 'Regular Price') %>% 
 40 |   filter(price < 1000) %>% 
 41 |   filter(p_diff < 500 & p_diff > -500)
 42 | 
 43 | ```
 44 | 
 45 | HERE COMES A DETAILLED EXPLANATION ABOUT WHERE THE DATA COMES FROM AND IF IT IS REPRESENTATIVE OR NOT.
 46 | 
 47 | Our main interest is whether online prices are lower or higher than simple retail store prices. We investigated the data on the collected prices and we have the following descriptive statistics on online, in-store prices and in their differences.
 48 | 
 49 | ```{r data descriptive, echo=FALSE}
 50 | 
 51 | # Descriptive statistics with pretty names
 52 | P95 <- function(x){ quantile(x, 0.95, na.rm=T) }
 53 | P05 <- function(x){ quantile(x, 0.05, na.rm=T) }
 54 | 
 55 | # Use datasummary:
 56 | # - rewrite names to human readable
 57 | # - add title and notes
 58 | # - fix position with kable_styling()
 59 | datasummary((`Retail` = price) + (`Online` = price_online) + (`Price difference` = p_diff) ~
 60 |              Mean + Median + SD + Min + Max + P05 + P95, 
 61 |              data = bpp ,
 62 |              title = 'Descriptive statistics of prices',
 63 |              notes = 'Data are available from: https://osf.io/yhbr5/') %>% 
 64 |   kableExtra::kable_styling(latex_options = 'hold_position')
 65 | ```
 66 | 
 67 | The number of observations is `r sum(!is.na(bpp$price))` for all of our key variables.
 68 | 
 69 | DESCRIPTION OF THE SUMMARY STATS: WHAT CAN WE LEARN FROM THEM?
 70 | 
 71 | As the focus is the price difference, the next Figure shows the histogram for this variable.
 72 | 
 73 | ```{r data hist, echo=FALSE, warning=FALSE, fig.width=3, fig.height = 2, fig.align="center", fig.cap='Distribution of price differences', fig.pos = 'H' }
 74 | 
 75 | # Add plot: in header, specify the figure size and alignment
 76 | #
 77 | # add simple plot: take care of labels and limits (and theme)
 78 | ggplot(data = bpp) +
 79 |   geom_density(aes(x = p_diff), fill = 'navyblue', bins = 30) +
 80 |   labs(x = 'Price differences',
 81 |        y = 'Relative Frequency') +
 82 |   # following commands will be covered more in details in lecture-07-ggplot-indepth
 83 |   xlim(-4,4) + # limits for x-axis
 84 |   theme_bw() + # add a built-in theme
 85 |   theme(axis.text  = element_text(size = 8), # change the font size of axis text/numbers
 86 |          axis.title = element_text(size = 8)) # change the font size of axis titles
 87 | ```
 88 | 
 89 | DESCRIPTION OF THE FIGURE. WHAT DOES IT TELS US?
 90 | 
 91 | (May change the order of descriptive stats and graph.)
 92 | 
 93 | ## Testing Price Differences
 94 | 
 95 | ```{r test, echo = FALSE }
 96 | 
 97 | test_out <- t.test(bpp$p_diff, mu = 0)
 98 | 
 99 | ```
100 | 
101 | We test the hypothesis, whether the price difference is zero, therefore there is no difference between retail and online prices:
102 | 
103 | $$H_0:=\text{price online} - \text{price retail} = 0$$ $$H_A:=\text{price online} - \text{price retail} \neq 0$$ Running a two-sided t-test, we have the t-statistic as `r round(test_out$statistic, 2)` and the p-value as `r round(test_out$p.value, 2)`. The 95% confidence intervals are: `r round(test_out$conf.int[1], 2)` and `r round(test_out$conf.int[2], 2)`. **Based on these results with 95% confidence we can reject the hypothesis that the two price would be the same in this particular sample.**
104 | 
105 | ## Robustness check / 'Heterogeneity analysis'
106 | 
107 | Task: 
108 | 
109 |   - calculate and report t-tests for each countries.
110 |   - You should report: 
111 |     - country, 
112 |     - mean of price differences
113 |     - standard errors of the mean for price differences
114 |     - number of observations in each country
115 |     - t-statistic
116 |     - p-value.
117 | 
118 | Hints: 
119 | 
120 |   1. use 'kable()' and to hold the table position you can define the following argument: 'position = "H"'
121 |   2. Take care of caption, number of digits you use and the name of variables you report! 
122 |   3. You may check how the output changes if you use 'booktabs = TRUE' input for kable! 
123 |   4. In case of html output use something like:
124 |   
125 | 
126 | ```{r, eval=FALSE}
127 |   kable(...,
128 |         'html', booktabs = F,  position = 'H') %>%
129 |         kable_classic(full_width = F, html_font = 'Cambria')
130 | ```
131 | 
132 | ```{r test countries, echo = FALSE }
133 | 
134 | # When creating a factor, it will use the sorted values:
135 | levels_fac <- sort(unique(bpp$COUNTRY))
136 | # If want to have pretty variable nems when groupping, need to use 'labels' input!
137 | # create a new variable for that:
138 | labs <- c('Brazil','China','Germany','Japan','South Africa','USA')
139 | # Note order must be the same as in `levels_fac`!
140 | bpp$country <- factor(bpp$COUNTRY, labels = labs)
141 | # Multiple hypothesis testing
142 | testing <- bpp %>% 
143 |   select(country, p_diff) %>% 
144 |   group_by(country) %>% 
145 |   summarise(mean_pdiff = mean(p_diff) ,
146 |              se_pdiff = 1/sqrt(n())*sd(p_diff),
147 |              num_obs = n())
148 | 
149 | # Testing in R is easy if one understands the theory!
150 | testing <- mutate(testing, t_stat = mean_pdiff / se_pdiff)
151 | testing <- mutate(testing, p_val = pt(-abs(t_stat), df = num_obs - 1))
152 | testing <- mutate(testing,  p_val = round(p_val, digit = 4))
153 | 
154 | # Report a tibble or any other dataframe/matrix variable
155 | kable(testing, digits = 4, 
156 |        caption = 'Online and retail price differences by countries and t-tests',
157 |        col.names = c('Country','Mean','SE','Num.Obs','t-stat','p-val'),
158 |        'latex', booktabs = TRUE,  position = 'H')             # Comment out if html
159 |       #'html', booktabs = F,  position = 'H') %>%             # Uncomment if html
160 |       #  kable_classic(full_width = F, html_font = 'Cambria')
161 | 
162 | ```
163 | 
164 | Extra: In words, select those countries, where you can not reject the alternative that the prices are different. With the command '\textcolor{red}{this is red} you can highlight these countries!
165 | 
166 | Countries, where we can not reject the alternative with 95% confidence (or with 5% significance level), that the prices are different, hence retail and online prices might differ: \textcolor{red}{`r testing$country[ testing$p_val < 0.05]`}
167 | 
168 | ## Conclusion
169 | 
170 | HERE COMES WHAT WE HAVE LEARNED AND WHAT WOULD STRENGHTEN AND WEAKEN OUR ANALYSIS.
171 | 


--------------------------------------------------------------------------------
/lecture06-rmarkdown101/complete_codes/report_bpp_fin.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture06-rmarkdown101/complete_codes/report_bpp_fin.pdf


--------------------------------------------------------------------------------
/lecture06-rmarkdown101/raw_codes/report_bpp.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Report on Billion Price Project"
  3 | author: 'Name of author'
  4 | output: 
  5 |   pdf_document:
  6 |     extra_dependencies: ["float"]
  7 |   #
  8 |   # html_document
  9 | ---
 10 | 
 11 | ```{r setup, include=FALSE}
 12 | rm(list = ls())
 13 | # Here comes the packages
 14 | library(tidyverse)
 15 | library(modelsummary)
 16 | library(kableExtra)
 17 | ```
 18 | 
 19 | ## Introduction
 20 | 
 21 | This is a report on *The Billion Price Project*.
 22 | 
 23 | HERE COMES THE MOTIVATION WHY THIS IS A MEANINGFUL PROJECT AND WHAT IS THE MAIN GOAL!
 24 | 
 25 | For more details on the project see: <http://www.thebillionpricesproject.com/> or [this embedded link](http://www.thebillionpricesproject.com/).
 26 | 
 27 | ## Data
 28 | 
 29 | ```{r data import, include=FALSE}
 30 | # Import data and basic data munging
 31 | bpp_orig <- read_csv('https://osf.io/yhbr5/download')
 32 | 
 33 | bpp_orig <- bpp_orig %>% mutate(p_diff = price_online - price)
 34 | 
 35 | bpp <- bpp_orig %>% 
 36 |   filter(is.na(sale_online)) %>%
 37 |   filter(!is.na(price)) %>%
 38 |   filter(!is.na(price_online)) %>% 
 39 |   filter(PRICETYPE == 'Regular Price') %>% 
 40 |   filter(price < 1000) %>% 
 41 |   filter(p_diff < 500 & p_diff > -500)
 42 | 
 43 | ```
 44 | 
 45 | HERE COMES A DETAILLED EXPLANATION ABOUT WHERE THE DATA COMES FROM AND IF IT IS REPRESENTATIVE OR NOT.
 46 | 
 47 | Our main interest is whether online prices are lower or higher than simple retail store prices. We investigated the data on the collected prices and we have the following descriptive statistics on online, in-store prices and in their differences.
 48 | 
 49 | ```{r data descriptive, echo=FALSE}
 50 | 
 51 | # Descriptive statistics with pretty names
 52 | P95 <- function(x){ quantile(x, 0.95, na.rm=T) }
 53 | P05 <- function(x){ quantile(x, 0.05, na.rm=T) }
 54 | 
 55 | # Use datasummary:
 56 | # - rewrite names to human readable
 57 | # - add title and notes
 58 | # - fix position with kable_styling()
 59 | datasummary((`Retail` = price) + (`Online` = price_online) + (`Price difference` = p_diff) ~
 60 |              Mean + Median + SD + Min + Max + P05 + P95, 
 61 |              data = bpp ,
 62 |              title = 'Descriptive statistics of prices',
 63 |              notes = 'Data are available from: https://osf.io/yhbr5/') %>% 
 64 |   kableExtra::kable_styling(latex_options = 'hold_position')
 65 | ```
 66 | 
 67 | The number of observations is `r sum(!is.na(bpp$price))` for all of our key variables.
 68 | 
 69 | DESCRIPTION OF THE SUMMARY STATS: WHAT CAN WE LEARN FROM THEM?
 70 | 
 71 | As the focus is the price difference, the next Figure shows the histogram for this variable.
 72 | 
 73 | ```{r data hist, echo=FALSE, warning=FALSE, fig.width=3, fig.height = 2, fig.align="center", fig.cap='Distribution of price differences', fig.pos = 'H' }
 74 | 
 75 | # Add plot: in header, specify the figure size and alignment
 76 | #
 77 | # add simple plot: take care of labels and limits (and theme)
 78 | ggplot(data = bpp) +
 79 |   geom_density(aes(x = p_diff), fill = 'navyblue' , bins = 30) +
 80 |   labs(x = 'Price differences',
 81 |        y = 'Relative Frequency') +
 82 |   # following commands will be covered more in details in lecture-07-ggplot-indepth
 83 |   xlim(-4,4) + # limits for x-axis
 84 |   theme_bw() + # add a built-in theme
 85 |   theme(axis.text  = element_text(size = 8), # change the font size of axis text/numbers
 86 |          axis.title = element_text(size = 8)) # change the font size of axis titles
 87 | ```
 88 | 
 89 | DESCRIPTION OF THE FIGURE. WHAT DOES IT TELS US?
 90 | 
 91 | (May change the order of descriptive stats and graph.)
 92 | 
 93 | ## Testing Price Differences
 94 | 
 95 | ```{r test, echo = FALSE }
 96 | 
 97 | test_out <- t.test(bpp$p_diff, mu = 0)
 98 | 
 99 | ```
100 | 
101 | We test the hypothesis, whether the price difference is zero, therefore there is no difference between retail and online prices:
102 | 
103 | $$H_0:=\text{price online} - \text{price retail} = 0$$ $$H_A:=\text{price online} - \text{price retail} \neq 0$$ Running a two-sided t-test, we have the t-statistic as `r round(test_out$statistic, 2)` and the p-value as `r round(test_out$p.value, 2)`. The 95% confidence intervals are: `r round(test_out$conf.int[1], 2)` and `r round(test_out$conf.int[2], 2)`. **Based on these results with 95% confidence we can reject the hypothesis that the two price would be the same in this particular sample.**
104 | 
105 | ## Robustness check / 'Heterogeneity analysis'
106 | 
107 | Task: 
108 | 
109 |   - calculate and report t-tests for each countries.
110 |   - You should report: 
111 |     - country, 
112 |     - mean of price differences
113 |     - standard errors of the mean for price differences
114 |     - number of observations in each country
115 |     - t-statistic
116 |     - p-value.
117 | 
118 | Hints: 
119 | 
120 |   1. use 'kable()' and to hold the table position you can define the following argument: 'position = "H"'
121 |   2. Take care of caption, number of digits you use and the name of variables you report! 
122 |   3. You may check how the output changes if you use 'booktabs = TRUE' input for kable! 
123 |   4. In case of html output use something like:
124 |   
125 | 
126 | 
127 | ```{r, eval=FALSE}
128 |   kable(...,
129 |         'html', booktabs = F,  position = 'H') %>%
130 |         kable_classic(full_width = F, html_font = 'Cambria')
131 | ```
132 | 
133 | 
134 | 
135 | Extra: In words, select those countries, where you can not reject the alternative that the prices are different. With the command '\textcolor{red}{this is red} you can highlight these countries!
136 | 
137 | Countries, where we can not reject the alternative with 95% confidence (or with 5% significance level), that the prices are different, hence retail and online prices might differ: 
138 | 
139 | Task: put here country names in red with p-values less than 5%.
140 | 
141 | 
142 | ## Conclusion
143 | 
144 | HERE COMES WHAT WE HAVE LEARNED AND WHAT WOULD STRENGHTEN AND WEAKEN OUR ANALYSIS.
145 | 


--------------------------------------------------------------------------------
/lecture07-ggplot-indepth/raw_codes/homework_ggpplot_runfile.R:
--------------------------------------------------------------------------------
 1 | #########################
 2 | #                       #
 3 | #   Lecture 07          #
 4 | #                       #
 5 | #   Runner for          # 
 6 | #     assignment        #  
 7 | #                       #
 8 | #    Deadline:          #
 9 | #                       #
10 | #                       #
11 | #                       #
12 | #########################
13 | 
14 | ##
15 | # Create your own theme for ggplot!
16 | #   
17 | # 0) Clear your environment
18 | 
19 | # 1) Load tidyverse 
20 | 
21 | # 2) use the same data with the same filter as in class!
22 | 
23 | # 3) Call your personalized ggplot function
24 | 
25 | # 4) Run the following piece of command:
26 | ggplot(filter(df, city == 'Vienna'), aes(x = price)) +
27 |   geom_histogram(alpha = 0.8, binwidth = 20) +
28 |   labs(x='Hotel Prices in  Vienna',y='Density')+
29 |   theme_YOURFUNCTIONNAME()
30 | 
31 | 
32 | 
33 | 


--------------------------------------------------------------------------------
/lecture07-ggplot-indepth/raw_codes/theme_RENAMEME.R:
--------------------------------------------------------------------------------
 1 | #########################
 2 | #                       #
 3 | #     Assignment        #
 4 | #      Lecture 07       #
 5 | #                       #
 6 | #  Deadline:            #
 7 | #                       #
 8 | #########################
 9 | 
10 | ##
11 | # Create your own theme for ggplot!
12 | #   In principle you should use this ggplot theme in the remainder of the course for assignments ect.
13 | #   Of course you can change along, but I would like to encourage all of you to use a personalized theme!
14 | #
15 | #  !! Please RENAME this file and call it accordingly in the runfile !! 
16 | #
17 | #   To get 7 points you will need to modify at least 7 parameters of the theme_classic or theme_bw!
18 | #
19 | #  Useful resources you may want to check:
20 | #     https://www.datanovia.com/en/blog/ggplot-themes-gallery/
21 | #     https://ggplot2.tidyverse.org/reference/theme.html
22 | #  Or the book's theme:
23 | #       https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch00-tech-prep/theme_bg.R
24 | #  Some more advanced/elaborated examples:
25 | #     https://bookdown.org/rdpeng/RProgDA/building-a-new-theme.html
26 | #     https://towardsdatascience.com/5-steps-for-creating-your-own-ggplot-theme-656e79a96b9
27 | #
28 | # and many more....


--------------------------------------------------------------------------------
/lecture07-ggplot-indepth/raw_codes/theme_bluewhite.R:
--------------------------------------------------------------------------------
 1 | #######################################################
 2 | #                                                     #
 3 | #              Lecture 07                             #
 4 | #                                                     #
 5 | #           ggplot in-depth                           #
 6 | #         `theme_bluewhite()`                         #
 7 | #                                                     #
 8 | #     first external function:                        #
 9 | #          creating your own theme                    #
10 | #                                                     #
11 | # For complete list of theme options                  #
12 | #     see:                                            #
13 | #  https://ggplot2.tidyverse.org/reference/theme.html #
14 | #                                                     #
15 | #######################################################
16 | 
17 | theme_bluewhite <- function(base_size = 11, base_family = '') {
18 |   # Inherit the basic properties of theme_bw
19 |   theme_bw() %+replace% 
20 |     # Replace the following items:
21 |     theme(
22 |       # The grids on the background
23 |       panel.grid.major  = element_line(color = 'white'),
24 |       # The background color
25 |       panel.background  = element_rect(fill = 'lightblue'),
26 |       # the axis line
27 |       axis.line         = element_line(color = 'red'),
28 |       # Littel lines called ticks on the axis
29 |       axis.ticks        = element_line(color = 'steelblue'),
30 |       # Color and font size for the numbers on the axis
31 |       axis.text         = element_text(color = 'navyblue', size = 8)
32 |     )
33 | }
34 | 


--------------------------------------------------------------------------------
/lecture08-conditionals/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 08: Conditional Programming
 2 | 
 3 | ## Motivation
 4 | 
 5 | Deciding what to do on a case by case is widely used in decision making and also in programming. Conditional programming enables writing codes with this in mind. If a certain condition holds execute a command otherwise do something different. Conditional programming is an element of the basic programming technique, which emerges in multiple situations. Adding this technique to the programming toolbox is a must for data scientists.
 6 | 
 7 | ## This lecture
 8 | 
 9 | This lecture introduces students to conditional programming with `if-else` statements. It covers the essentials as well as logical operations with vectors, creating new variables with conditionals and some extra material.
10 | 
11 | 
12 | ## Learning outcomes
13 | After successfully live-coding the material (see: [`conditionals.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.md)), students will have knowledge on
14 | 
15 | - How a conditional statement works
16 | - What are the crucial elements of an `if-else` statement
17 | - Good practices writing a conditional
18 | - How multiple conditions work
19 |   - single-valued variables with multiple conditions
20 |   - vector variables with conditions
21 | - with vectors:
22 |   - understanding the differences between `|`, `||`, `&`, `&&`, `any()` and `all()`
23 |   - understanding pairwise comparison of vectors
24 |   - understanding different levels of evaluation of logical operators.
25 | - creating new variable with conditional
26 |   - base R method with logicals
27 |   - `ifelse()` function with `tidyverse`  
28 | - extra material
29 |   - conditional installation of packages
30 |   - spacing and formatting the `if-else` statements
31 |   - `xor` function
32 |   - `switch` statement 
33 | 
34 | ## Datasets used
35 | 
36 | - [wms-management](https://gabors-data-analysis.com/datasets/#wms-management-survey)
37 | 
38 | ## Lecture Time
39 | 
40 | Ideal overall time: **10-20 mins**.
41 | 
42 | This is a relatively short lecture, and it can be even shorter if logical operators with vectors is neglected. Although good understanding of the anatomy of an `if-else` statement is important
43 | 
44 | ## Homework
45 | 
46 | *Type*: quick practice, approx 15 mins, together with [lecture09-loops](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture09-loops), [lecture10-random-numbers](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture10-random-numbers), and [lecture11-functions](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture11-functions).
47 | 
48 | Check the common homework [here](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/README.md).
49 | 
50 | ## Further material
51 | 
52 |   - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 19.4](https://r4ds.had.co.nz/functions.html) provides further material on conditionals.
53 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 10](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/10_functional_programming.Rmd) provides useful guidelines on conditionals along with other programming skills.
54 | 
55 | 
56 | ## File structure
57 |   
58 |   - [`conditionals.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.md) provides material for the live coding session with explanations.
59 |   - [`conditionals.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.Rmd) is the generating Rmarkdown file for `conditionals.md`
60 |   - [`conditionals.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture08-conditionals/conditionals.R) is a possible realization of the live coding session
61 | 


--------------------------------------------------------------------------------
/lecture08-conditionals/conditionals.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | ##                              ##
  3 | ##          Lecture 08          ##
  4 | ##                              ##
  5 | ##  Conditional Programming     ##
  6 | ##                              ##
  7 | ##                              ##
  8 | ##################################
  9 | 
 10 | 
 11 | # Simple if-statement
 12 | x <- 5
 13 | if (x == 5){
 14 |   print('x is equal to 5')
 15 | }
 16 | 
 17 | # Create an if-else statement
 18 | x2 <- 4
 19 | if (x2 == 5){
 20 |   print('x2 is equal to 5')
 21 | } else{
 22 |   print('x2 is not equal to 5')
 23 | }
 24 | 
 25 | # Multiple if-else statement!
 26 | # play around with the value of x
 27 | x <- -5
 28 | if (x > 0){
 29 |   print('positive number')
 30 | } else if(x == 0){
 31 |   print('zero value')
 32 | } else{
 33 |   print('negative number')
 34 | }
 35 | 
 36 | 
 37 | #####
 38 | # Multiple conditions
 39 | 
 40 | # Multiple logical statements
 41 | y <- 10
 42 | if (x > 0 && y > 0){
 43 |   print('x and y are positive numbers')
 44 | } else{
 45 |   print('one of y or x is non-positive')
 46 | }
 47 | 
 48 | ###
 49 | # Conditional with one vector
 50 | v <- c(0 , 1 , 10)
 51 | 
 52 | # First, let check if elements of v larger than 0
 53 | v > 0
 54 | 
 55 | # any or all functions 
 56 | if (any(v > 0)){
 57 |   print('We have at least one element in v, which is larger than zero!')
 58 | } else {
 59 |   print('All elements in v, are smaller than zero!')
 60 | }
 61 | 
 62 | 
 63 | ### 
 64 | # Conditional with two or more vector
 65 | q <- c(2 , 0 , 8)
 66 | 
 67 | # use of single-operators
 68 | v | q > 0
 69 | v & q > 0
 70 | 
 71 | # At this point we can check the differences between single-operators and double-operators
 72 | 
 73 | v | q > 0
 74 | v || q > 0
 75 | 
 76 | # Using double-operators will imply `any()` for `||` and `all()` for `&&`:
 77 | (v || q > 0) == any(v | q > 0)
 78 | (v && q > 0) == all(v & q > 0)
 79 | 
 80 | # be careful, when using these operators with vectors, 
 81 | # as the results can be different if mixing these up, e.g.
 82 | v && q > 0
 83 | any(v & q > 0)
 84 | 
 85 | #####
 86 | # Using conditionals when creating new variables
 87 | 
 88 | # Import wms-management data
 89 | library(tidyverse)
 90 | wms <- read_csv('https://osf.io/uzpce/download')
 91 | 
 92 | # Method 1: use base-R commands
 93 | wms$firm_size <- NA_character_
 94 | wms$firm_size[ wms$emp_firm >= 1000 ] = 'large'
 95 | wms$firm_size[ wms$emp_firm < 1000 & wms$emp_firm >= 200 ] = 'medium'
 96 | wms$firm_size[ wms$emp_firm < 200 ] = 'small'
 97 | 
 98 | # Method 2: ifelse function
 99 | wms <- wms %>% mutate(firm_size2 = ifelse(emp_firm >= 1000 , 'large',
100 |                                     ifelse(wms$emp_firm < 1000 & wms$emp_firm >= 200 , 'medium',
101 |                                     ifelse(wms$emp_firm < 200, 'small', NA_character_)))) 
102 | 
103 | # Task: check they are the same:
104 | all(wms$firm_size == wms$firm_size2, na.rm = T)
105 | 
106 | ######
107 | # Extra material
108 | 
109 | 
110 | # Spacing and formatting
111 | 
112 | if (x > 5){ print(' x > 5') } else { print('x <= 5') }
113 | # However, it is not recommended as it makes reading the code much harder.
114 | 
115 | 
116 | # The xor() operator
117 | # xor, which takes two logical value/vectors as inputs. 
118 | xor(c(T,F,F,T),c(T,T,F,F))
119 | 
120 | 
121 | # `switch` statement
122 | type <- 'apple'
123 | switch(type,
124 |        apple = 'I love apple!',
125 |        banana = 'I love banana!',
126 |        orange = 'I love orange!',
127 |        error('type must be either \'apple\',\'banana\', or \'orange\'')
128 | )
129 | 
130 | # try different inputs for types which are not in the listed values!
131 | 


--------------------------------------------------------------------------------
/lecture09-loops/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 09: Programming loops
 2 | 
 3 | ## Motivation
 4 | 
 5 | There are many cases when one needs to do repetitive coding: carry out the same commands but on a different object/data. Writing loops is one of the best tools to carry out such repetition with only a few modifications to the codes. It also reduces the code duplication, which has three main benefits:
 6 | 
 7 |   1. It’s easier to see the intent of your code because your eyes are drawn to what’s different, not what stays the same.
 8 |   2. It’s easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied and pasted the code.
 9 |   3. You’re likely to have fewer bugs because each line of code is used in more places.
10 | 
11 | *([Hadley Wickham and Garrett Grolemund R for Data Science Ch. 21.1](https://r4ds.had.co.nz/iteration.html))*
12 | 
13 | 
14 | ## This lecture
15 | 
16 | This lecture introduces students to imperative programming with `for` and `while` loops. Furthermore, it provides an exercise with [sp500](https://gabors-data-analysis.com/datasets/#sp500) dataset to calculate yearly and monthly returns.
17 | 
18 | [Chapter 05, A: What likelihood of loss to expect on a stock portfolio?](https://gabors-data-analysis.com/casestudies/#ch05a-what-likelihood-of-loss-to-expect-on-a-stock-portfolio) case study was a starting point to develop the exercise.
19 | 
20 | 
21 | ## Learning outcomes
22 | After successfully live-coding the material (see: [`loops.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.md)), students will know
23 | 
24 | - What is imperative programming and what is functional programming for iterations
25 | - What is a for loop
26 |   - what are the possible inputs for an iteration vector
27 |   - how to measure CPU time
28 |   - what are the possible issues with the for-loop
29 | - What is a while loop
30 |   - what are the possible drawbacks of a while loop
31 |   - how to use a for loop instead
32 |   - `break` command
33 | - Calculate returns with different time periods.   
34 | 
35 | ## Datasets used
36 | 
37 | - [sp500](https://gabors-data-analysis.com/datasets/#sp500)
38 | 
39 | ## Lecture Time
40 | 
41 | Ideal overall time: **10-20 mins**.
42 | 
43 | This is a relatively short lecture, and it can be even shorter if measuring CPU time and/or exercise is/are neglected.
44 | 
45 | ## Homework
46 | 
47 | *Type*: quick practice, approx 15 mins, together with [lecture08-conditionals](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture08-conditionals), [lecture10-random-numbers](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture10-random-numbers), and [lecture11-functions](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture11-functions).
48 | 
49 | Check the common homework [here](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/README.md).
50 | 
51 | ## Further material
52 |   
53 |   - More materials on the case study can be found in Gabor's da_case_studies repository: [ch05-stock-market-loss-generalize](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch05-stock-market-loss-generalize)
54 |   - Hadley Wickham and Garrett Grolemund R for Data Science [Chapter 21](https://r4ds.had.co.nz/iteration.html) provide further material on iterations, both imperative and functional programming.
55 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 10](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/10_functional_programming.Rmd) provides useful guidelines on iterations along with other programming skills.
56 | 
57 | 
58 | ## File structure
59 |   
60 |   - [`loops.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.md) provides material for the live coding session with explanations.
61 |   - [`loops.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.Rmd) is the generating Rmarkdown file for `loops.md`
62 |   - [`loops.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture09-loops/loops.R) is a possible realization of the live coding session
63 | 


--------------------------------------------------------------------------------
/lecture09-loops/loops.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | ##                              ##
  3 | ##  Imperative Programming      ##
  4 | ##      for and while loops     ##
  5 | ##                              ##
  6 | ##################################
  7 | 
  8 | rm(list = ls())
  9 | 
 10 | # Case 1) purest form of a for loop
 11 | for (i in 1 : 5){
 12 |   print(i)
 13 | }
 14 | 
 15 | # Case 2) 
 16 | for (i in seq(50, 58)){
 17 |   print(i)
 18 | }
 19 | 
 20 | # Case 3) 
 21 | for (i in c(10,9,-10,8)){
 22 |   print(i)
 23 | }
 24 | 
 25 | # Play around with lists
 26 | for (i in list(2, 'a', TRUE, sqrt(2))){
 27 |   print(i)
 28 | }
 29 | 
 30 | # Create a loop which gives the cumulative sum:
 31 | v <- c(10, 6, 5, 32, 45, 23)
 32 | cs_v <- v
 33 | for (i in 2 : length(v)){
 34 |   cs_v[ i ] <- cs_v[ i - 1 ] + cs_v[ i ]
 35 | }
 36 | v
 37 | cs_v
 38 | cumsum(v)
 39 | 
 40 | # Also good to know
 41 | seq_along(v)
 42 | 
 43 | cs_v2 <- 0
 44 | for (i in seq_along(v)){
 45 |   cs_v2 <- cs_v2 + v[ i ]
 46 | }
 47 | cs_v2
 48 | 
 49 | # Task check if all the elements in cs_v is the same as
 50 | # the cumsum(v) function and if it is true
 51 | #  print out Good job! otherwise: there is a mistake!
 52 | 
 53 | if (all(cs_v == cumsum(v))){
 54 |   print('Good job!')
 55 | } else {
 56 |   print('There is a mistake!')
 57 | }
 58 | 
 59 | ## Measure CPU time
 60 | if (!require(tictoc)){
 61 |   install.packages('tictoc')
 62 |   library(tictoc)
 63 | }
 64 | 
 65 | 
 66 | iter_num <- 10000
 67 | 
 68 | # Sloppy way to do loops:
 69 | tic('Sloppy way')
 70 | q <- c()
 71 | for (i in 1 : iter_num){
 72 |   q <- c(q, i)
 73 | }
 74 | toc()
 75 | 
 76 | # Proper way
 77 | tic('Good way')
 78 | r <- double(length = iter_num)
 79 | for (i in 1 : iter_num){
 80 |   r[ i ] <- i
 81 | }
 82 | toc()
 83 | 
 84 | ##
 85 | # While loop
 86 | x <- 0
 87 | while (x < 10) {
 88 |   x <- x + 1
 89 |   print(x)
 90 | }
 91 | x
 92 | 
 93 | # Instead use a for loop with break
 94 | max_iter <- 10000
 95 | x <- 0
 96 | flag <- FALSE
 97 | for (i in 1 : max_iter){
 98 |   if (x < 10){
 99 |     x <- x + 1 
100 |   } else{
101 |     flag <- TRUE
102 |     break
103 |   }
104 | }
105 | x
106 | if (flag) {
107 |   print('Successful iteration!')
108 | }else{
109 |   print('Did not satisfied stopping criterion!')
110 | }
111 | 
112 | ####
113 | # Exercise sp500
114 | library(tidyverse)
115 | # Load data
116 | sp500 <- read_csv('https://osf.io/h64z2/download', na = c('', '#N/A'))
117 | # Filter out missing and create a year variable
118 | sp500 <- sp500 %>% filter(!is.na(VALUE)) %>% 
119 |   mutate(year = format(DATE, '%Y')) 
120 | 
121 | # Get unique years and create tibble
122 | years <- unique(sp500$year)
123 | return_yearly <- tibble(years = years, return = NA)
124 | 
125 | # Initialize
126 | aux <- sp500$VALUE[ sp500$year == years[ 1 ] ]
127 | lyp <- aux[ length(aux) ]
128 | rm(aux)
129 | # start from 2007
130 | for (i in 2 : length(years)){
131 |   # get the values for specific year
132 |   value_year_i <- sp500$VALUE[ sp500$year == years[ i ] ]
133 |   # last day's price
134 |   ldp <- value_year_i[ length(value_year_i) ]
135 |   # calculate the return
136 |   return_yearly$return[ i ] <- (ldp - lyp) / lyp * 100
137 |   # save this years last value as last year value
138 |   lyp <- ldp
139 | }
140 | 
141 | # Check results
142 | return_yearly
143 | 
144 | 
145 | 


--------------------------------------------------------------------------------
/lecture10-random-numbers/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 10: Random Numbers and Random Sampling
 2 | 
 3 | ## Motivation
 4 | 
 5 | While dealing with data, the use of random numbers is essential to understanding modern data analytics. In many cases, you will not use them directly, but many advanced models (e.g. Machine Learning techniques) use them. A general understanding how these methods work, and what are the limitations is beneficial. Some examples of random number usage:
 6 | 
 7 |   - get a random (sub)-sample (e.g. cross-validation techniques)
 8 |   - bootstrapping (e.g. calculate standard errors)
 9 |   - estimating models (e.g. random forest, Markov-Chain-Monte-Carlo or (quasi) maximum-likelihood methods)
10 |   - ‘stochastic’ optimization methods (e.g. genetic algorithms)
11 | 
12 | We cover the main properties of random number generators and how to use them for reproducible results.
13 | 
14 | ## This lecture
15 | 
16 | This lecture introduces students to how to generate random numbers and deal with random sampling in R.
17 | 
18 | Relates to case studies:
19 |   - [Chapter 03, D: Distributions of body height and income](https://gabors-data-analysis.com/casestudies/#ch03d-distributions-of-body-height-and-income) to show random numbers generated from theoretical distributions vs empirical distributions. 
20 |   - [Chapter 05, A: What likelihood of loss to expect on a stock portfolio?](https://gabors-data-analysis.com/casestudies/#ch05a-what-likelihood-of-loss-to-expect-on-a-stock-portfolio) to show random sampling.
21 | 
22 | 
23 | ## Learning outcomes
24 | After successfully live-coding the material (see: [`random_numbers.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.md)), students will have knowledge on
25 | 
26 | - When and why random numbers are used in R
27 | - Different `distributions` and their properties available in R
28 |   - Specifically get familiar with `runif`, `rnorm`, and `rlnorm`
29 | - How to control for randomness via `set.seed`
30 | - How the number of observations generated by `rnorm` is associated with the theoretical normal distribution
31 | - How random sampling works:
32 |   - `sample_n` function
33 |   - other alternatives such as `slice_sample`, `sample` and `sample.int` 
34 | 
35 | ## Datasets used
36 | 
37 | - [height-income-distributions](https://gabors-data-analysis.com/datasets/#height-income-distributions)
38 | - [sp500](https://gabors-data-analysis.com/datasets/#sp500)
39 | 
40 | ## Lecture Time
41 | 
42 | Ideal overall time: **10-20 mins**.
43 | 
44 | This is a relatively short lecture with a little coding, but much background knowledge is needed on how random number generation works.
45 | If want to shorten the lecture skip exercise with height and income distributions.
46 | 
47 | ## Homework
48 | 
49 | *Type*: quick practice, approx 15 mins, together with [lecture08-conditionals](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture08-conditionals), [lecture09-loops](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture09-loops), and [lecture11-functions](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture11-functions)
50 | 
51 | Check the common homework [here](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/README.md).
52 | 
53 | ## Further material
54 | 
55 |   - Jae Yeon Kim: R Fundamentals for Public Policy, Course material, [Lecture 10](https://github.com/KDIS-DSPPM/r-fundamentals/blob/main/lecture_notes/10_functional_programming.Rmd) touch the topic, but not too deeply.
56 |   - There are some useful bookdown material by [Ko Chiu Yu: Techincal Analysis with R, Chapter 4.2](https://bookdown.org/kochiuyu/Technical-Analysis-with-R/random-number.html) and [Nathaniel D. Phillips: YaRrr! The Pirate’s Guide to R, Chapter 5.3](https://bookdown.org/ndphillips/YaRrr/generating-random-data.html)
57 | 
58 | 
59 | ## File structure
60 |   
61 |   - [`random_numbers.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.md) provides material for the live coding session with explanations.
62 |   - [`random_numbers.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.Rmd) is the generating Rmarkdown file for `random_numbers.md`
63 |   - [`random_numbers.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture10-random-numbers/random_numbers.R) is a possible realization of the live coding session
64 | 


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | ##                              ##
  3 | ##  Random Numbers and          ##
  4 | ##       Random Sampling in R   ##
  5 | ##                              ##
  6 | ##################################
  7 | 
  8 | # Clear memory and load packages
  9 | rm(list=ls())
 10 | library(tidyverse)
 11 | 
 12 | 
 13 | # 1) case uniform distribution random sampling
 14 | n <- 10
 15 | x <- runif(n, min = 0, max = 10)
 16 | x
 17 | 
 18 | # 2) Set the seed for the computer for rng
 19 | set.seed(123)
 20 | x <- runif(n, min = 0, max = 10)
 21 | 
 22 | rm(x, n)
 23 | 
 24 | 
 25 | # Play around with n
 26 | n <- 10000
 27 | y <- rnorm(n, mean = 1, sd = 2)
 28 | df <- tibble(var1 = y)
 29 | ggplot(df, aes(x = var1)) +
 30 |   geom_histogram(aes(y = ..density..), fill = 'navyblue') +
 31 |   stat_function(fun = dnorm, args = list(mean = 1, sd = 2),
 32 |                  color = 'red', size = 1.5)
 33 | 
 34 | # There are some other type of distributions:
 35 | #   rbinom, rexp, rlnorm, etc.
 36 | 
 37 | ###
 38 | # Exercise with height-income distributions
 39 | 
 40 | # Get data from OSF
 41 | df <- read_csv('https://osf.io/rnuh2/download')
 42 | # set height as numeric
 43 | df <- df %>% mutate(height = as.numeric(height))
 44 | 
 45 | # Create a empirical histogram of height with theoretical normal
 46 | emp_height <- ggplot(df, aes(x = height)) +
 47 |   geom_histogram(aes(y = ..density..), binwidth = 0.03, 
 48 |                   fill = 'navyblue', alpha = 0.6) +
 49 |   stat_function(fun = dnorm, color = 'red',  
 50 |                  args = with(df, c(mean = mean(height, na.rm = T), sd = sd(height, na.rm = T)))) + 
 51 |   labs(x='Height (meters)', y='Density') +
 52 |   theme_bw()
 53 | 
 54 | emp_height
 55 | 
 56 | # Calculate the empirical mean and standard deviation 
 57 | mu <- with(filter(df, hhincome < 1000), 
 58 |             log(mean(hhincome)^2 / sqrt(var(hhincome) + mean(hhincome)^2)))
 59 | sigma <- with(filter(df, hhincome < 1000),
 60 |                sqrt(log(var(hhincome) / mean(hhincome)^2 + 1)))
 61 | 
 62 | emp_inc <- ggplot(filter(df, hhincome < 1000), aes(x = hhincome)) +
 63 |   geom_histogram(aes(y = ..density..), binwidth = 10,
 64 |                   fill = 'navyblue', alpha = 0.6) +
 65 |   stat_function(fun = dlnorm, colour= 'red',  
 66 |                  args = c(mean = mu, sd =  sigma)) + 
 67 |   labs(x='Income (thousand $)', y='Density') +
 68 |   theme_bw()
 69 | 
 70 | emp_inc
 71 | 
 72 | # Generate artificial data
 73 | set.seed(123)
 74 | artif <- tibble(height_art = rnorm(nrow(df), mean(df$height, na.rm = T), 
 75 |                                      sd = sd(df$height, na.rm = T)),
 76 |                  inc_art = rlnorm(nrow(df), meanlog = mu, sdlog = sigma))
 77 | 
 78 | # Compare height
 79 | emp_height + geom_histogram(data = artif, aes(x = height_art, y = ..density..), 
 80 |                              binwidth = 0.03, boundary = 1.3, 
 81 |                              fill = 'orange', alpha = 0.3)
 82 | 
 83 | # Compare income
 84 | emp_inc + geom_histogram(data = artif, aes(x = inc_art, y = ..density..), binwidth = 10,
 85 |                           fill = 'orange', alpha = 0.3) +
 86 |   xlim(0,500)
 87 | 
 88 | # Task: log-income
 89 | 
 90 | # Create log income and artificial as well
 91 | set.seed(123)
 92 | df <- df %>% mutate(lninc = ifelse(hhincome > 0, log(hhincome), 0),
 93 |                      lninc_art = rnorm(nrow(df), mean = mean(lninc, na.rm = T),
 94 |                                                    sd = sd(lninc, na.rm = T)))
 95 | 
 96 | ggplot(df) +
 97 |   geom_histogram(aes(x = lninc, y = ..density..), binwidth = 0.3,
 98 |                   fill = 'navyblue', alpha = 0.6) +
 99 |   geom_histogram(aes(x = lninc_art, y = ..density..), binwidth = 0.3,
100 |                   fill = 'orange', alpha = 0.3) +
101 |   stat_function(fun = dnorm, colour= 'red',  
102 |                  args = with(df, c(mean = mean(lninc, na.rm = T), 
103 |                                       sd =  sd(lninc, na.rm = T)))) + 
104 |   labs(x='Log-Income (thousand $)', y='Density') +
105 |   theme_bw()
106 | 
107 | 
108 | 
109 | #####
110 | # Random sampling from a data/variable:
111 | 
112 | sp500 <- read_csv('https://osf.io/h64z2/download')
113 | head(sp500)
114 | 
115 | 
116 | # Sample_1 is without replacement
117 | set.seed(123)
118 | sample_1 <- slice_sample(sp500, n = 1000, replace = F)
119 | head(sample_1)
120 | # Sample_2 with replacement -> useful for bootstrapping
121 | sample_2 <- slice_sample(sp500, n = 1000, replace = T)
122 | 
123 | # alternatively:
124 | set.seed(123)
125 | sample_1a <- sample_n(sp500, 1000, replace = FALSE)
126 | set.seed(123)
127 | sample_1b <- tibble(VALUE = sample(sp500$VALUE, 1000, replace = FALSE))
128 | set.seed(123)
129 | sample_1c <- sp500[sample.int(1000, replace = FALSE),]
130 | # Note: all the other are the same except sample_1c, this is due to the fact 
131 | #   that set.seed controls for the output of the function, but the function may alter the seed.
132 | 
133 | 
134 | 


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-3-1.png


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-.gif


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-4-1.gif


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-6-1.png


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture10-random-numbers/random_numbers_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/lecture11-functions/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 11: Writing Functions
 2 | 
 3 | ## Motivation
 4 | 
 5 | One of the best ways to improve your reach as a data scientist is to write functions. Functions allow automating common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:
 6 | 
 7 | 1. You can give a function an evocative name that makes your code easier to understand.
 8 | 2. As requirements change, you only need to update code in one place, instead of many.
 9 | 3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
10 | 
11 | Writing good functions is a lifetime journey. Even after using R for many years, one can still learn new techniques and better ways of approaching old problems. The goal is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately. ([Hadley Wickham and Garrett Grolemund: R for Data Science, Ch. 19](https://r4ds.had.co.nz/functions.html))
12 | 
13 | ## This lecture
14 | 
15 | This lecture introduces functions, how they are structured and how to write them. Students will know how to write basic functions, control for input(s) and output(s), and error-handling.
16 | 
17 | Case studies related to lecture:
18 |   - [Chapter 05, A: What likelihood of loss to expect on a stock portfolio?](https://gabors-data-analysis.com/casestudies/#ch05a-what-likelihood-of-loss-to-expect-on-a-stock-portfolio) as homework to calculate bootstrap standard errors and calculate confidence intervals.
19 |   - [Chapter 06, A: Comparing online and offline prices: testing the difference](https://gabors-data-analysis.com/casestudies/#ch06a-comparing-online-and-offline-prices-testing-the-difference) and [Chapter 06, B: Testing the likelihood of loss on a stock portfolio](https://gabors-data-analysis.com/casestudies/#ch06b-testing-the-likelihood-of-loss-on-a-stock-portfolio) as at the end of the lecture we build a function to show the distribution of t-statistics.
20 | 
21 | In addition to writing functions, it uses data from the case study [Chapter 04, A: Management quality and firm size: describing patterns of association](https://gabors-data-analysis.com/casestudies/#ch04a-management-quality-and-firm-size-describing-patterns-of-association).
22 | 
23 | 
24 | ## Learning outcomes
25 | After successfully live-coding the material (see: [`functions.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.md)), students will know on
26 | 
27 | - What is the structure of a function
28 | - Out of the function
29 |   - simple output
30 |   - controlling for the output with `return`
31 |   - multiple outputs with lists
32 | - Controlling for the input
33 |   - `stopifnot` function
34 |   - other methods and error-handling in general
35 |   - pre-set inputs
36 | - Exercise for the sampling distribution of the t-statistics, to use:
37 |   - conditionals
38 |   - loops
39 |   - random numbers and random sampling
40 |   - writing a function
41 | 
42 | ## Datasets used
43 | 
44 |   - [wms-management-survey](https://gabors-data-analysis.com/datasets/#wms-management-survey)
45 |   - [sp500](https://gabors-data-analysis.com/datasets/#sp500) as homework.
46 | 
47 | ## Lecture Time
48 | 
49 | Ideal overall time: **20-30 mins**.
50 | 
51 | This is a relatively short lecture, and it can be even shorter if less emphasis is put on output and input controlling and error-handling.
52 | 
53 | ## Homework
54 | 
55 | *Type*: quick practice, approx 15 mins, together with [lecture08-conditionals](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture08-conditionals), [lecture09-loops](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture09-loops), and [lecture10-random-numbers](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture10-random-numbers)
56 | 
57 | Bootstrapping - using the [`sp500`](https://gabors-data-analysis.com/datasets/#sp500) data
58 | 
59 |   - download the cleaned data for `sp500` from [OSF](https://osf.io/h64z2/)
60 |   - write a function, which calculates the bootstrap standard errors and confidence intervals based on these standard errors.
61 |     - function should have an input for a) vector of prices, b) number of bootstraps, c) level for the confidence interval
62 |   - create a new variable for `sp500`: `daily_return`, which is the difference in the prices from one day to the next day.
63 |   - use this `daily_return` variable and calculate the 80% confidence interval based on bootstrap standard errors along with the mean.
64 | 
65 | 
66 | ## Further material
67 | 
68 |   - Case study materials from Gabor's da_case_studies repository on generalization (with bootstrapping) is: [ch05-stock-market-loss-generalize](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch05-stock-market-loss-generalize) on testing are: [ch06-online-offline-price-test](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch06-online-offline-price-test) and [ch06-stock-market-loss-test](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch06-stock-market-loss-test)
69 |   - Hadley Wickham and Garrett Grolemund: R for Data Science [Chapter 19](https://r4ds.had.co.nz/functions.html) provide further material on functions with exercises.
70 |   - Grant McDermott: Data Science for Economists - [Lecture 10](https://github.com/uo-ec607/lectures/blob/master/10-funcs-intro/10-funcs-intro.md) is a great alternative to introduce functions.
71 |   - Roger D. Peng, Sean Kross, and Brooke Anderson: Mastering Software Development in R, [Chapter 2](https://bookdown.org/rdpeng/RProgDA/advanced-r-programming.html) is a great place to start deepening programming skills.
72 |   - Hadley Wickham: [Advanced R](http://adv-r.had.co.nz/Introduction.html) is also a great place to start hard-core programming in R.
73 | 
74 | 
75 | ## File structure
76 |   
77 |   - [`functions.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.md) provides material for the live coding session with explanations.
78 |   - [`functions.Rmd`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.Rmd) is the generating Rmarkdown file for [`functions.md`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.md)
79 |   - [`functions.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture11-functions/functions.R) is a possible realization of the live coding session
80 | 


--------------------------------------------------------------------------------
/lecture11-functions/functions.R:
--------------------------------------------------------------------------------
  1 | #########################
  2 | ##                     ##
  3 | ##    Functions        ##
  4 | ##                     ##
  5 | #########################
  6 | rm(list=ls())
  7 | 
  8 | # 1) simplest case - calculate the mean
  9 | my_avg <- function(x){
 10 |   sum_x <- sum(x)
 11 |   sum_x / length(x)
 12 | }
 13 | 
 14 | # Import wms-management data
 15 | library(tidyverse)
 16 | wms <- read_csv('https://osf.io/uzpce/download')
 17 | # save management score x1
 18 | x1 <- wms$management
 19 | # Remove wms to keep environment tidy
 20 | rm(wms)
 21 | # Print out
 22 | my_avg(x1)
 23 | 
 24 | # or save it as a variable
 25 | avg_x <- my_avg(x1)
 26 | avg_x
 27 | 
 28 | # 2) Calculate mean and standard deviation with checking inputs
 29 | my_fun1 <- function(x){
 30 |   sum_x <- sum(x)
 31 |   # number of observations
 32 |   N <- length(x)
 33 |   # Mean of x
 34 |   mean_x <- sum_x / N
 35 |   # Variance of x
 36 |   var_x  <- sum((x - mean_x)^2 / N)
 37 |   # Standard deviation of x
 38 |   sqrt(var_x)
 39 | }
 40 | 
 41 | # Get the standard deviation for x1
 42 | my_fun1(x1)
 43 | 
 44 | # 3) Control for output
 45 | my_fun2 <- function(x){
 46 |   sum_x <- sum(x)
 47 |   # number of observations
 48 |   N <- length(x)
 49 |   # Mean of x
 50 |   mean_x <- sum_x / N
 51 |   # Variance of x
 52 |   var_x  <- sum((x - mean_x)^2) / (N - 1)
 53 |   # Standard deviation of x
 54 |   sqrt(var_x)
 55 |   return(mean_x)
 56 | }
 57 | 
 58 | # Get the mean for x1
 59 | my_fun2(x1)
 60 | 
 61 | # 4) Multiple output
 62 | my_fun3 <- function(x){
 63 |   sum_x <- sum(x)
 64 |   # number of observations
 65 |   N <- length(x)
 66 |   # Mean of x
 67 |   mean_x <- sum_x / N
 68 |   # Variance of x
 69 |   var_x  <- sum((x - mean_x)^2) / (N - 1)
 70 |   # Standard deviation of x
 71 |   sd_x <- sqrt(var_x)
 72 |   out <- list('sum' = sum_x, 'mean' = mean_x, 'var' = var_x ,'sd' = sd_x)
 73 |   return(out)
 74 | }
 75 | 
 76 | # Check the output
 77 | out3 <- my_fun3(x1)
 78 | # get all the output as list
 79 | out3
 80 | # get e.g. the mean
 81 | out3$mean
 82 | 
 83 | # 5) Controlling for input
 84 | my_avg_chck <- function(x){
 85 |   stopifnot(is.numeric(x))
 86 |   sum_x <- sum(x)
 87 |   sum_x / length(x)
 88 | }
 89 | 
 90 | # Good input
 91 | my_avg_chck(x1)
 92 | # Bad input
 93 | my_avg_chck('Hello world')
 94 | 
 95 | # 6) Multiple input
 96 | conf_interval <- function(x, level = 0.95){
 97 |   # mean of x
 98 |   mean_x <- mean(x, na.rm = TRUE) 
 99 |   # standard deviation
100 |   sd_x <- sd(x, na.rm = TRUE)
101 |   # number of observations in x
102 |   n_x <- sum(!is.na(x))
103 |   # Calculate the theoretical SE for mean of x
104 |   se_mean_x <- sd_x / sqrt(n_x)
105 |   # Calculate the CI
106 |   if (level == 0.95){
107 |     CI_mean <- c(mean_x - 1.96*se_mean_x, mean_x + 1.96*se_mean_x)
108 |   } else if (level == 0.99){
109 |     CI_mean <- c(mean_x - 2.58*se_mean_x, mean_x + 2.58*se_mean_x)
110 |   } else {
111 |     stop('No such level implemented for confidence interval, use 0.95 or 0.99')
112 |   }
113 |   out <- list('mean'=mean_x,'CI_mean' = CI_mean)
114 |   return(out)
115 | }
116 | # Get some CI values
117 | conf_interval(x1, level = 0.95)
118 | conf_interval(x1)
119 | conf_interval(x1, level = 0.99)
120 | conf_interval(x1, level = 0.98)
121 | 
122 | # Task - flexible level
123 | conf_interval2 <- function(x, level = 0.95){
124 |   # mean of x
125 |   mean_x <- mean(x, na.rm = TRUE) 
126 |   # standard deviation
127 |   sd_x <- sd(x, na.rm = TRUE)
128 |   # number of observations in x
129 |   n_x <- sum(!is.na(x))
130 |   # Calculate the theoretical SE for mean of x
131 |   se_mean_x <- sd_x / sqrt(n_x)
132 |   # Calculate the CI
133 |   if (level >= 0 | level <= 1){
134 |     crit_val <- qnorm(level + (1 - level)/2)
135 |     CI_mean <- c(mean_x - crit_val*se_mean_x, mean_x + crit_val*se_mean_x)
136 |   } else {
137 |     stop('level must be between 0 and 1')
138 |   }
139 |   out <- list('mean'=mean_x,'CI_mean' = CI_mean)
140 |   return(out)
141 | }
142 | # Get some CI values
143 | conf_interval2(x1, level = 0.95)
144 | conf_interval2(x1)
145 | conf_interval2(x1, level = 0.99)
146 | conf_interval2(x1, level = 0.98)
147 | 
148 | ##########
149 | # A solution for Execrice: sampling distribution
150 | library(tidyverse)
151 | 
152 | # Function for sampling distribution
153 | get_sampling_dists <- function(y, rep_num = 1000, sample_size = 1000){
154 |   # Check inputs
155 |   stopifnot(is.numeric(y))
156 |   stopifnot(is.numeric(rep_num), length(rep_num) == 1, rep_num > 0)
157 |   stopifnot(is.numeric(sample_size), length(sample_size) == 1 ,
158 |              sample_size > 0, sample_size <= length(y))
159 |   # initialize the for loop
160 |   set.seed(100)
161 |   mean_stat <- double(rep_num)
162 |   t_stat_A <- double(rep_num)
163 |   t_stat_B <- double(rep_num)
164 |   # Usual scaler for SE
165 |   sqrt_n <- sqrt(sample_size)
166 |   for (i in 1:rep_num) {
167 |     # Need a new sample
168 |     y_i <- sample(y, sample_size, replace = FALSE)
169 |     # Mean for sample_i
170 |     mean_stat[ i ] <- mean(y_i)
171 |     # SE for Mean
172 |     se_mean <- sd(y_i) / sqrt_n
173 |     # T-statistics for hypotheses
174 |     t_stat_A[ i ] <- (mean_stat[ i ] - 1) / se_mean
175 |     t_stat_B[ i ] <- mean_stat[ i ] / se_mean
176 |   }
177 |   out <- tibble(mean_stat = mean_stat, t_stat_A = t_stat_A, 
178 |                  t_stat_B = t_stat_B)
179 | }
180 | 
181 | # Create y
182 | set.seed(123)
183 | y <- runif(10000, min = 0, max = 2)
184 | # Get some sampling distribution
185 | sampling_y <- get_sampling_dists(y, rep_num = 1000, sample_size = 100)
186 | 
187 | # Plot these distributions
188 | ggplot(sampling_y, aes(x = mean_stat)) +
189 |   geom_histogram(aes(y = ..density..), bins = 60, color = 'navyblue', fill = 'navyblue') +
190 |   geom_vline(xintercept = 1, linetype = 'dashed', color = 'blue', size = 1)+
191 |   geom_vline(xintercept = mean(y), color = 'red', size = 1) +
192 |   geom_vline(xintercept = mean(sampling_y$mean_stat), color = 'black', size = 1)+
193 |   stat_function(fun = dnorm, args = list(mean = mean(y), sd = sd(y) / sqrt(100)) ,
194 |                  color = 'red', size = 1) +
195 |   labs(x = 'Sampling distribution of the mean', y = 'Density') +
196 |   theme_bw()
197 | 
198 | 
199 | # Plot distribution for t-stats - Hypothesis A
200 | ggplot(sampling_y, aes(x = t_stat_A)) +
201 |   geom_histogram(aes(y = ..density..), bins = 60, fill = 'navyblue') +
202 |   stat_function(fun = dnorm, args = list(mean = 0, sd = 1) ,
203 |                  color = 'red', size = 1) +
204 |   labs(x = 'Sampling distribution of t-stats: hypothesis A', y = 'Density') +
205 |   theme_bw()
206 | 
207 | 
208 | # Plot distribution for t-stats - Hypothesis B
209 | ggplot(sampling_y, aes(x = t_stat_B)) +
210 |   geom_histogram(aes(y = ..density..), bins = 60, fill = 'navyblue') +
211 |   stat_function(fun = dnorm, args = list(mean = 0, sd = 1) ,
212 |                  color = 'red', size = 1) +
213 |   scale_x_continuous(limits = c(-4,30))+
214 |   labs(x = 'Sampling distribution of t-stats: hypothesis B', y = 'Density') +
215 |   theme_bw()
216 | 
217 | 
218 | 


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-2.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-10-3.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-2.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-11-3.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-2.png


--------------------------------------------------------------------------------
/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture11-functions/functions_files/figure-gfm/unnamed-chunk-9-3.png


--------------------------------------------------------------------------------
/lecture12-intro-to-regression/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 12: Introduction to regression
 2 | 
 3 | ## Motivation
 4 | 
 5 | You want to identify hotels in a city that are good deals: underpriced for their location and quality. You have scraped the web for data on all hotels in the city, and you have cleaned the data. You have carried out exploratory data analysis that revealed that hotels closer to the city center tend to be more expensive, but there is a lot of variation in prices between hotels at the same distance. How should you identify hotels that are underpriced relative to their distance to the city center? In particular, how should you capture the average price–distance relationship that would provide you a benchmark, to which you can compare actual prices to find good deals?
 6 | 
 7 | The analysis of hotel prices and distance to the city center reveals that hotels further away from the center are less expensive by a certain amount, on average. Can you use this result to estimate how much more revenue a hotel developer could expect if it were to build a hotel closer to the center rather than farther away? Regression is a model for the conditional mean: the mean of y for different values of one or more x variables. Regression is used to uncover patterns of association. That, in turn, is used in the causal analysis, to uncover the effect of x on y, and in predictions, to arrive at a good guess of what the value of y is if we don’t know it, but we know the value of x.
 8 | 
 9 | In this lecture, we introduce simple non-parametric regression and simple linear regression, and we show how to visualize their results. We then discuss simple linear regression in detail. We introduce the regression equation, how its coefficients are uncovered (estimated) in actual data, and we emphasize how to interpret the coefficients. We introduce the concepts of predicted value and residual and goodness of fit, and we discuss the relationship between regression and correlation.
10 | 
11 | ## This lecture
12 | 
13 | This lecture introduces regressions via [hotels-vienna dataset](https://gabors-data-analysis.com/datasets/#hotels-vienna). It overviews models based on simple binary means, binscatters, lowess nonparametric regression, and introduces simple linear regression techniques. The lecture illustrates the use of predicted values and regression residuals with linear regression, but as homework, the same exercise is repeated with a binscatter-based model.
14 | 
15 | This lecture is based on [Chapter 07, A: *Finding a good deal among hotels with simple regression*](https://gabors-data-analysis.com/casestudies/#ch07a-finding-a-good-deal-among-hotels-with-simple-regression)
16 | 
17 | ## Learning outcomes
18 | After successfully completing [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R) students should be able:
19 | 
20 |   - Binary means:
21 |     - Calculate prediction based on means of two categories and create an annotated graph
22 |   - Binscatter:
23 |     - Create means based on differently defined bins for the X variable
24 |     - Show two different graphs: simple mean predictions for each bins as a dot and scatter with step functions
25 |   - Lowess nonparametric regression:
26 |     - How to create a lowess (loess) graph
27 |     - What is an output of a loess model? What are the main advantages and disadvantages?
28 |   - Simple linear regression
29 |     - How to create a simple linear regression line in a scatterplot
30 |     - The classical `lm` command and its limitation
31 |     - `feols` package: estimate two models w and w/o heteroscedastic robust SE and compare the two model
32 |     - Have an idea about `estimatr` package and `lm_robust` command
33 |     - How to get predicted values and errors of predictions
34 |     - Get the best and worst deals: identify hotels with the smallest/largest errors
35 |     - Visualize the errors via histogram and scatter plot with annotating the best and worst 5 deals.
36 | 
37 | ## Dataset used
38 | 
39 | - [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna)
40 | 
41 | ## Lecture Time
42 | 
43 | Ideal overall time: **60 mins**.
44 | 
45 | Going through [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R) takes around *45-50 minutes*, the rests are the tasks. It builds on [lecture07-ggplot-indepth](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture07-ggplot-indepth) as it requires building a boxplot. Can be skipped.
46 | 
47 | 
48 | ## Homework
49 | 
50 | *Type*: quick practice, approx 15 mins
51 | 
52 | Use the binscatter model with 7 bins and save the predicted values and errors (true price minus the predicted value). Find the best and worst 10 deals and visualize with a scatterplot, highlighting the under/overpriced hotels with these best/worst deals according to this model. Compare to the simple linear regression. Which model would you use? Argue!
53 | 
54 | 
55 | ## Further material
56 | 
57 |   - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch07-hotels-simple-reg](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch07-hotels-simple-reg)
58 |   - On ggplot, see Chapter 3.5-6 and Chapter 5.6 [Kieran H. (2019): Data Visualization](https://socviz.co/makeplot.html#mapping-aesthetics-vs-setting-them) or [Winston C. (2022): R Graphics Cookbook, Chapter 5](https://r-graphics.org/chapter-scatter)
59 |   - On regression [Grant McDermott: Data Science for Economists, Course material Lecture 08](https://github.com/uo-ec607/lectures/tree/master/08-regression) provides a somewhat different approach, but can be a nice supplement
60 | 
61 | 
62 | ## Folder structure
63 |   
64 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture12-intro-to-regression/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class.
65 |     - [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R), is the main material for this lecture.
66 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture12-intro-to-regression/complete_codes) includes code with solution for [`hotels_intro_to_regression.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/raw_codes/hotels_intro_to_regression.R) as [`hotels_intro_to_regression_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture12-intro-to-regression/complete_codes/hotels_intro_to_regression_fin.R)
67 | 
68 | 


--------------------------------------------------------------------------------
/lecture14-simple-regression/complete_codes/life_exp_clean.R:
--------------------------------------------------------------------------------
  1 | #############################################
  2 | #                                           #
  3 | #               Lecture 14                  #
  4 | #                                           #
  5 | #   Auxiliary file to clean data            #
  6 | #     - can practice, but not recommended   #
  7 | #                                           #
  8 | # Case Study:                               #
  9 | #  Life-expectancy and income               #
 10 | #                                           #
 11 | #############################################
 12 | 
 13 | 
 14 | 
 15 | # Clear memory
 16 | rm(list=ls())
 17 | 
 18 | library(tidyverse)
 19 | library(modelsummary)
 20 | 
 21 | # Call the data from github
 22 | my_url <- 'https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/main/lecture14-simple-regression/data/raw/WDI_lifeexp_raw.csv'
 23 | df <- read_csv(my_url)
 24 | 
 25 | ## Check the observations:
 26 | #   Lot of grouping observations
 27 | #     usually contains a number
 28 | d1 <- df %>% filter(grepl('[[:digit:]]', df$iso2c))
 29 | d1
 30 | # Filter these out
 31 | df <- df %>% filter(!grepl('[[:digit:]]', df$iso2c))
 32 | 
 33 | # Some grouping observations are still there, check each of them
 34 | #   HK - Hong Kong, China
 35 | #   OE - OECD members
 36 | #   all with starting X, except XK which is Kosovo
 37 | #   all with starting Z, except ZA-South Africa, ZM-Zambia and ZW-Zimbabwe
 38 | 
 39 | # 1st drop specific values
 40 | drop_id <- c('EU','HK','OE')
 41 | # Check for filtering
 42 | df %>% filter(grepl(paste(drop_id, collapse='|'), df$iso2c)) 
 43 | # Save the opposite
 44 | df <- df %>% filter(!grepl(paste(drop_id, collapse='|'), df$iso2c)) 
 45 | 
 46 | # 2nd drop values with certain starting char
 47 | # Get the first letter from iso2c
 48 | fl_iso2c <- substr(df$iso2c, 1, 1)
 49 | retain_id <- c('XK','ZA','ZM','ZW')
 50 | # Check
 51 | d1 <- df %>% filter(grepl('X', fl_iso2c) | grepl('Z', fl_iso2c) & 
 52 |                        !grepl(paste(retain_id, collapse='|'), df$iso2c)) 
 53 | # Save observations which are the opposite (use of !)
 54 | df <- df %>% filter(!(grepl('X', fl_iso2c) | grepl('Z', fl_iso2c) & 
 55 |                         !grepl(paste(retain_id, collapse='|'), df$iso2c))) 
 56 | 
 57 | # Clear non-needed variables
 58 | rm(d1, drop_id, fl_iso2c, retain_id)
 59 |   
 60 | ### 
 61 | # Check for missing observations
 62 | m <- df %>% filter(!complete.cases(df))
 63 | # Drop if life-expectancy, gdp or total population missing -> if not complete case except iso2c
 64 | df <- df %>% filter(complete.cases(df) | is.na(df$iso2c))
 65 | 
 66 | ###
 67 | # CLEAN VARIABLES
 68 | #
 69 | # Recreate table:
 70 | #   Rename variables and scale them
 71 | #   Drop all the others !! in this case write into readme it is referring to year 2018!!
 72 | df <-df %>% transmute(country = country,
 73 |                         population=SP.POP.TOTL/1000000,
 74 |                         gdppc=NY.GDP.PCAP.PP.KD/1000,
 75 |                         lifeexp=SP.DYN.LE00.IN)
 76 | 
 77 | ###
 78 | # Check for extreme values
 79 | # all HISTOGRAMS
 80 | df %>%
 81 |   keep(is.numeric) %>% 
 82 |   gather() %>% 
 83 |   ggplot(aes(value)) +
 84 |   facet_wrap(~key, scales = 'free') +
 85 |   geom_histogram(bins=30)
 86 | 
 87 | # It seems we have a large value(s) for population:
 88 | df %>% filter(population > 500)
 89 | # These are India and China... not an extreme value
 90 | 
 91 | # Check for summary as well
 92 | datasummary_skim(df)
 93 | 
 94 | # Save the raw data file for your working directory
 95 | my_path <- 'ENTER YOUR OWN PATH'
 96 | write_csv(df, paste0(my_path,'data/clean/WDI_lifeexp_clean.csv'))
 97 | 
 98 | # I have pushed it into github as well!
 99 | 
100 | 
101 | 
102 | 


--------------------------------------------------------------------------------
/lecture14-simple-regression/complete_codes/life_exp_getdata_fin.R:
--------------------------------------------------------------------------------
 1 | #############################################
 2 | #                                           #
 3 | #               Lecture 14                  #
 4 | #                                           #
 5 | #   Getting the data for analysis           #
 6 | #     - practice with WDI package           #
 7 | #                                           #
 8 | # Case Study:                               #
 9 | #  Life-expectancy and income               #
10 | #                                           #
11 | #############################################
12 | 
13 | 
14 | # Clear memory
15 | rm(list=ls())
16 | 
17 | # Call packages
18 | if (!require(WDI)){
19 |   install.packages('WDI')
20 |   library(WDI)
21 | }
22 | library(tidyverse)
23 | 
24 | 
25 | # Reminder on how WDI works - it is an API
26 | # Search for variables which contains GDP
27 | a <- WDIsearch('gdp')
28 | # Narrow down the serach for: GDP + something + capita + something + constant
29 | a <- WDIsearch('gdp.*capita.*constant')
30 | 
31 | # Get GDP data
32 | gdp_data = WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019)
33 | 
34 | ##
35 | # Task: get the GDP data, along with `population, total' and `life expectancy at birth'
36 | # for year 2019 and save to your raw folder!
37 | # Note: I have pushed it to Github, we will use that later, just to be on the same page!
38 | a <- WDIsearch('population, total')
39 | b <- WDIsearch('life expectancy at birth')
40 | 
41 | # Get all the data for year 2019
42 | data_raw <- WDI(indicator=c('NY.GDP.PCAP.PP.KD','SP.DYN.LE00.IN','SP.POP.TOTL'), 
43 |                 country='all', start=2019, end=2019)
44 | 
45 | # Save the raw data file for your working directory
46 | my_path <- 'ENTER YOUR OWN PATH'
47 | write_csv(data_raw, paste0(my_path,'data/raw/WDI_lifeexp_raw.csv'))
48 | 
49 | # I have pushed it to Github, we will use that!
50 | # Note this is only the raw files! I am cleaning them in a separate file and save the results to the clean folder!
51 | 
52 | 
53 | 


--------------------------------------------------------------------------------
/lecture14-simple-regression/raw_codes/life_exp_getdata.R:
--------------------------------------------------------------------------------
 1 | #############################################
 2 | #                                           #
 3 | #               Lecture 14                  #
 4 | #                                           #
 5 | #   Getting the data for analysis           #
 6 | #     - practice with WDI package           #
 7 | #                                           #
 8 | # Case Study:                               #
 9 | #  Life-expectancy and income               #
10 | #                                           #
11 | #############################################
12 | 
13 | 
14 | # Clear memory
15 | rm(list=ls())
16 | 
17 | # Call packages
18 | if (!require(WDI)){
19 |   install.packages('WDI')
20 |   library(WDI)
21 | }
22 | library(tidyverse)
23 | 
24 | 
25 | # Reminder on how WDI works - it is an API
26 | # Search for variables which contains GDP
27 | a <- WDIsearch('gdp')
28 | # Narrow down the serach for: GDP + something + capita + something + constant
29 | a <- WDIsearch('gdp.*capita.*constant')
30 | 
31 | # Get GDP data
32 | gdp_data = WDI(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2019, end=2019)
33 | 
34 | ##
35 | # Task: get the GDP data, along with `population, total' and `life expectancy at birth'
36 | # for year 2019 and save to your raw folder!
37 | # Note: I have pushed it to Github, we will use that later, just to be on the same page!
38 | # Note this is only the raw files! I am cleaning them in a separate file and save the results to the clean folder!
39 | 
40 | 
41 | 


--------------------------------------------------------------------------------
/lecture16-binary-models/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 16: Binary outcome - modeling probabilities
 2 | 
 3 | ## Motivation
 4 | 
 5 | Does smoking make you sick? And can smoking make you sick in late middle age even if you stopped years earlier? You have data on many healthy people in their fifties from various countries, and you know whether they stayed healthy four years later. You have variables on their smoking habits, age, income, and many other characteristics. How can you use this data to estimate how much more likely non-smokers are to stay healthy? How can you uncover if that depends on whether they never smoked or are former smokers? And how can you tell if that association is the result of smoking itself or, instead, underlying differences in smoking by education, income, and other factors?
 6 | 
 7 | The lecture is related to the chapter that discusses probability models: regressions with binary y variables. In a sense, we can treat a binary y variable just like any other variable and use regression analysis as we would otherwise. with a binary y variable, we can estimate nonlinear probability models instead of the linear ones. Data analysts need to have a good understanding of when to use these different probability models, and how to interpret and evaluate their results.
 8 | 
 9 | ## This lecture
10 | 
11 | This lecture introduces binary outcome models with an analysis of health outcomes with multiple variables based on the [share-health](https://gabors-data-analysis.com/datasets/#share-health) dataset. First, we introduce saturated models (smoking on health) and linear probability models with multiple explanatory variables. We check the predicted outcome probabilities for certain groups. Then we focus on non-linear binary models: the logit and probit model. We estimate marginal effects, to interpret the average (marginal) effects of variables on the outcome probabilities. We overview goodness of fit statistics (R2, Pseudo-R2, Brier score, and Log-loss) along with visual and descriptive inspection of the predicted probabilities. Finally, we calculate the estimated bias and the calibration curve to understand model perform better.
12 | 
13 | This lecture is based on [Chapter 11, A: Does smoking pose a health risk?](https://gabors-data-analysis.com/casestudies/#ch11a-does-smoking-pose-a-health-risk)
14 | 
15 | ## Learning outcomes
16 | After successfully completing codes in [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R), students should be able:
17 | 
18 | 
19 |   - Calculated by hand or estimate saturated models
20 |   - Visualize and understand binary outcome scatter-plots
21 |   - Estimate Linear Probability Models (LPM)
22 |     - Use `feols` to estimate regressions with multiple explanatory variables
23 |     - Use `etable` to compare multiple candidate models and report model statistics such as R2 to evaluate models.
24 |     - Understand the limitations of LPM
25 |     - Carry out sub-group analysis based on predicted probabilities
26 |   - Estimate Non-Linear Probability Models
27 |     - Use `feglm` with `link = 'logit'` or `'probit'`, to estimate logit or probit models
28 |     - Estimate `marginaleffects` with package `marginaleffects`
29 |     - Use `etable` to compare logit and probit coefficients
30 |     - Use `modelsummary` (from package `modelsummary`) to compare, LPM, logit/probit and logit/probit with marginal effects
31 |     - Handle `modelsummary` function to get relevant goodness-of-fit measures
32 |     - Use `fitstat_register()` function for `etable` to calculate user-supplied goodness-of-fit statistics, such as *Brier score* or *Log-loss* measures
33 |   - Understand the usefulness of comparing the distribution of predicted probabilities for different models
34 |   - Understanding the usefulness of comparing descriptive statistics of the predicted probabilities for different models
35 |   - Calculate the bias of the model along with the calibration curve
36 | 
37 | ## Datasets used
38 | 
39 | - [share-health](https://gabors-data-analysis.com/datasets/#share-health)
40 | 
41 | ## Lecture Time
42 | 
43 | Ideal overall time: **100 mins**.
44 | 
45 | Going through [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R) takes around *80-90 minutes* as there are many discussion and interpretation of the models. Solving the tasks takes the remaining *10-20 minutes*. 
46 | 
47 | 
48 | ## Homework
49 | 
50 | *Type*: quick practice, approx 20 mins
51 | 
52 | Use the same [share-health](https://gabors-data-analysis.com/datasets/#share-health) dataset, but now use `smoking` as your outcome variable as this task is going to ask you to predict if a person is a smoker or not. Use similar variables except `stayshealthy` to explain `smoking`. Run a LPM, logit and probit model. Compare the coefficients of these models along with the average marginal effects. Compute the goodness of fit statistics (R2, Pseudo-R2, Brier score, log-loss) of all of the models. Choose one, calculate the bias, and plot the calibration curve.
53 | 
54 | 
55 | 
56 | ## Further material
57 | 
58 |   - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch11-smoking-health-risk](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch11-smoking-health-risk)
59 |   - Coding and multiple linear regression: partially related in Chapter 4, especially Ch 4.2 from [James-Witten-Hastie-Tibshirani (2013) - An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)
60 |   - Some other useful resources which are R-related, using base-R methods: [Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer: Introduction to Econometrics with R, Chapter 11](https://www.econometrics-with-r.org/11-rwabdv.html) or [Ramzi W. Nahhas: Introduction to Regression Methods for Public Health Using R
61 | ](https://bookdown.org/rwnahhas/RMPH/blr.html).
62 | 
63 | 
64 | ## Folder structure
65 |   
66 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture16-binary-models/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class.
67 |     - [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R), is the main material for this lecture.
68 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture16-binary-models/complete_codes) includes code with solution for [`binary_models.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/raw_codes/binary_models.R) as [`binary_models_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture16-binary-models/complete_codes/binary_models_fin.R)
69 | 
70 | 


--------------------------------------------------------------------------------
/lecture17-dates-n-times/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 17: Date and time manipulations
 2 | 
 3 | ## Motivation
 4 | 
 5 | Time series data is often used to analyze business, economic, and policy questions. Time series data presents additional opportunities as well as additional challenges for regression analysis. Unlike cross-sectional data, it enables examining how y changes when x changes, and it also allows us to examine what happens to y right away or with a delay. However, variables in time series data come with some special features that affect how we should estimate regressions, and how we can interpret their coefficients.
 6 | 
 7 | One of these differences is the frequency of the time series. It can vary from seconds to years. Time series with more frequent observations have higher frequency, e.g. monthly frequency is higher than yearly frequency, but it is lower than daily frequency. The frequency may also be irregular with gaps in-between. Gaps in time series data can be viewed as missing values of variables. But they tend to have specific causes. To run a regression of y on x in time series data, the two variables need to be at the same time series frequency. When the time series frequencies of y and x are different, we need to adjust one of them. Most often that means aggregating the variable at a higher frequency (e.g., from weekly to monthly). With flow variables, such as sales, aggregation means adding up; with stock variables and other kinds of variables, such as prices, it is often taking an average for the period or taking the last value, such as the closing price.
 8 | 
 9 | Another fundamental feature of time series data is that variables evolve with time. They may hover around a stable average value, or they may drift upwards or downwards. A variable in time series data follows a trend if it tends to change in one direction; in other words, it has a tendency to increase or decrease. Another possible issue is seasonality. Seasonality means that the value of the variable is expected to follow a cyclical pattern, tracking the seasons of the year, days of the week, or hours of the day. Because of such systematic changes, later observations tend to be different from earlier observations. Understanding trends and seasonality is important because they make regression analysis challenging. They are examples of a broader concept, non-stationarity. Stationarity means stability; non-stationarity means the lack of stability. Stationary time series variables have the same expected
10 | value and the same distribution at all times. Trends and seasonality violate stationarity because the expected value is different at different times.
11 | 
12 | ## This lecture
13 | 
14 | This lecture introduces basic date and time-variable manipulations. The first part starts with the basics using `lubridate` package by overviewing basic time-related functions and manipulations with time-related values and variables. The second part discusses time-series data aggregation from different frequencies along with visualization for time-series data and unit root tests.
15 | 
16 | This lecture utilizes the case study of [Chapter 12, A: Returns on a company stock and market returns](https://gabors-data-analysis.com/casestudies/#ch12a-returns-on-a-company-stock-and-market-returns) as homework, and uses [`stocks-sp500`](https://gabors-data-analysis.com/datasets/#stocks-sp500) dataset.
17 | 
18 | ## Learning outcomes
19 | After successfully completing [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R), students should be:
20 | 
21 |   - Familiar with the `lubridate` package, especially with
22 |     - creating specific time variables, converting other types of variables into a date or datetime object
23 |     - understand the importance of time zones
24 |     - Get specific parts of a date object such as `year, quarter, month, wday, yday, day, leap_year`
25 |     - Rounding to the closest month, year, quarter, etc.
26 |     - Understand the difference between duration and periods
27 |   - Carry out time aggregation
28 |     - Aggregate different time series objects to lower frequencies, using mean/median/max/end date, etc.
29 |     - Adding `lag`-ged and differenced variables to data
30 |   - Visualize time series with
31 |     - handle time variable on x-axis with `scale_x_date()`  
32 |     - `facet_wrap` to stack multiple graphs as an alternative to `ggpurb`
33 |     - standardize variables and put multiple lines into one graph   
34 |   - Unit root tests using `aTSA` package's `pp.test` function
35 |     - understanding the result of the Philip-Perron test and deciding if the variable needs to be differenced or not. 
36 | 
37 | ## Datasets used
38 | 
39 | - [`stocks-sp500`](https://gabors-data-analysis.com/datasets/#stocks-sp500)
40 | 
41 | ## Lecture Time
42 | 
43 | Ideal overall time: **35-40 mins**.
44 | 
45 | Going through [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R) takes around *30 minutes*. There are some discussions and interpretations of the time series (e.g. stationarity). Solving the tasks takes the remaining *5-10 minutes*. The lecture can be shortened by only showing the methods. It will be partially repeated in [lecture18-timeseries-regression](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression).
46 | 
47 | 
48 | ## Homework
49 | 
50 | *Type*: quick practice, approx 10 mins
51 | 
52 | Estimate the *beta* coefficient between quarterly SP500 log returns on Microsoft stocks log return. Use the [`stocks-sp500`](https://gabors-data-analysis.com/datasets/#stocks-sp500) dataset. Take care when aggregating the data to a) use the last day in the quarter and then take the logs and then difference the variable to get log returns. When estimating the regression use heteroskedastic robust standard error (next lecture we learn how to use Newey-West SE).
53 | 
54 | 
55 | ## Further material
56 | 
57 |   - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch12-stock-returns-risk](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch12-stock-returns-risk/ch12-stock-returns-risk.R)
58 |   - Hadley Wickham and Garrett Grolemund R for Data Science: [Chapter 16](https://r4ds.had.co.nz/dates-and-times.html) discuss time and date formates more in detail.
59 |   - [`timetk` package](https://business-science.github.io/timetk/index.html) is a well-documented advanced time-series related package. There are many possibilities with great solutions. A good starting point for further material in time series with R.
60 |   - [`lubridate` package](https://lubridate.tidyverse.org/index.html) has good documentation, worth checking.
61 | 
62 | ## Folder structure
63 |   
64 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture17-dates-n-times/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class.
65 |     - [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R), is the main material for this lecture.
66 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/edit/main/lecture17-dates-n-times/complete_codes) includes code with solution for [`date_time_manipulations.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/raw_codes/date_time_manipulations.R) as [`date_time_manipulations_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture17-dates-n-times/complete_codes/date_time_manipulations_fin.R)
67 | 
68 | 


--------------------------------------------------------------------------------
/lecture18-timeseries-regression/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 18: Introduction to time-series regression
 2 | 
 3 | ## Motivation
 4 | 
 5 | Heating and cooling are potentially important uses of electricity. To investigate how weather conditions affect electricity consumption, you have collected data on temperature and residential electricity consumption in a hot region. How should you estimate the association between temperature and electricity consumption? How should you define the variables of interest, and how should you prepare the data, which has daily observations on temperature and monthly observations on electricity consumption? Should you worry about the fact that both electricity consumption and temperature vary a lot across months within years, and if yes, what should you do about it?
 6 | 
 7 | Time series data is often used to analyze business, economic, and policy questions. Time series data presents additional opportunities as well as additional challenges for regression analysis. Unlike cross-sectional data, it enables examining how y changes when x changes, and it also allows us to examine what happens to y right away or with a delay. However, variables in time series data come with some special features that affect how we should estimate regressions, and how we can interpret their coefficients.
 8 | 
 9 | ## This lecture
10 | 
11 | This lecture introduces time-series regression via the [arizona-electricity](https://gabors-data-analysis.com/datasets/#arizona-electricity) dataset. During this lecture, students manipulate time-series data along time dimensions, create multiple time-series related graphs and get familiar with (partial) autocorrelation. Differenced variables, lags of the outcome, and lags of the explanatory variables, (deterministic) seasonality are used during regression models. Estimating these models are via `feols` with Newey-West standard errors. Model comparisons and estimating cumulative effects with valid SEs are shown.
12 | 
13 | This lecture is based on [Chapter 12, B: Electricity consumption and temperature](https://gabors-data-analysis.com/casestudies/#ch12b-electricity-consumption-and-temperature)
14 | 
15 | ## Learning outcomes
16 | After successfully completing [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R), students should be able:
17 | 
18 |   - Merge different time-series data
19 |   - Create time-series related descriptives and graphs
20 |     - handle date as the axis with different formatting
21 |     - import source code from URL via `source_url` from `devtools`
22 |     - create autocorrelation and partial autocorrelation graphs and interpret
23 |   - Run time-series regression with `feols` from `fixest`
24 |     - Understand why defining period and id is important with `fixest` package
25 |     - Estimate Newey-West standard errors and understand the role of lags
26 |     - Control for seasonality via dummies
27 |     - Add lagged variables to the model (and possibly leads as well)
28 |     - How and why to use the same time interval when comparing competing time-series models
29 |     - Estimate the standard error(s) for the cumulative effect
30 | 
31 | ## Datasets used
32 | 
33 | - [arizona-electricity](https://gabors-data-analysis.com/datasets/#arizona-electricity)
34 | 
35 | ## Lecture Time
36 | 
37 | Ideal overall time: **60-80 mins**.
38 | 
39 | Going through [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R) takes around *50-70 minutes* as there are some discussions and interpretations of the time series (e.g. stationarity, a transformation of variables, etc). Solving the tasks takes the remaining *5-10 minutes*.
40 | 
41 | 
42 | ## Homework
43 | 
44 | *Type*: quick practice, approx 20 mins
45 | 
46 | You will use the [case-shiller-la](https://gabors-data-analysis.com/datasets/#case-shiller-la) dataset to build a model for unemployment based on the Shiller price index. Load the data and consider only `pn` (Shiller price index) and `un` (unemployment) as the variables of interest. Both are seasonally adjusted. Decide which transformation to use to make the variables stationary. Create models, where you predict unemployment based on the Shiller price index. At least you should have one model where you use only the contemporaneous effects and one when you use lagged variables for both variables as explanatory variables.
47 | 
48 | 
49 | ## Further material
50 | 
51 |   - More materials on the case study can be found in Gabor's *da_case_studies* repository: [ch12-electricity-temperature](https://github.com/gabors-data-analysis/da_case_studies/tree/master/ch12-electricity-temperature)
52 |   - Handy, but a somewhat different approach for time-series analysis can be found [James Long and Paul Teetor: R Cookbook (2019), Chapter 14](https://rc2e.com/timeseriesanalysis)
53 |   - A good starting point for advanced methods in the time-series analysis is: [`modeltime`](https://business-science.github.io/modeltime/) introduces automated, machine learning, and deep learning-based analysis, its supplementary package [`timetk`](https://business-science.github.io/timetk/index.html) has many great time-series related manipulations.
54 | 
55 | ## Folder structure
56 |   
57 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class.
58 |     - [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R), is the main material for this lecture.
59 |     - [`ggplot.acorr.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/ggplot.acorr.R) is an auxillary function to plot (partial) autocorrelation graphs, by [Kevin Liu](https://rh8liuqy.github.io/ACF_PACF_by_ggplot2.html). This file is `source_url`-ed to [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R).
60 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/complete_codes) includes code with solution for [`intro_time_series.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/raw_codes/intro_time_series.R) as [`intro_time_series_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture18-timeseries-regression/complete_codes/intro_time_series_fin.R)
61 | 
62 | 


--------------------------------------------------------------------------------
/lecture18-timeseries-regression/raw_codes/ggplotacorr.R:
--------------------------------------------------------------------------------
 1 | ## Auto-correlation function and Partial-autocorrelation function
 2 | # by: https://rh8liuqy.github.io/ACF_PACF_by_ggplot2.html
 3 | 
 4 | 
 5 | ggplotacorr <- function(data, lag.max = 24, ci = 0.95, large.sample.size = TRUE, horizontal = TRUE,...) {
 6 |   
 7 |   require(ggplot2)
 8 |   require(dplyr)
 9 |   require(cowplot)
10 |   
11 |   if(horizontal == TRUE) {numofrow <- 1} else {numofrow <- 2}
12 |   
13 |   list.acf <- acf(data, lag.max = lag.max, type = "correlation", plot = FALSE)
14 |   N <- as.numeric(list.acf$n.used)
15 |   df1 <- data.frame(lag = list.acf$lag, acf = list.acf$acf)
16 |   df1$lag.acf <- dplyr::lag(df1$acf, default = 0)
17 |   df1$lag.acf[2] <- 0
18 |   df1$lag.acf.cumsum <- cumsum((df1$lag.acf)^2)
19 |   df1$acfstd <- sqrt(1/N * (1 + 2 * df1$lag.acf.cumsum))
20 |   df1$acfstd[1] <- 0
21 |   df1 <- select(df1, lag, acf, acfstd)
22 |   
23 |   list.pacf <- acf(data, lag.max = lag.max, type = "partial", plot = FALSE)
24 |   df2 <- data.frame(lag = list.pacf$lag,pacf = list.pacf$acf)
25 |   df2$pacfstd <- sqrt(1/N)
26 |   
27 |   if(large.sample.size == TRUE) {
28 |     plot.acf <- ggplot(data = df1, aes(x = lag, y = acf)) +
29 |       geom_area(aes(x = lag, y = qnorm((1+ci)/2)*acfstd), fill = "#B9CFE7") +
30 |       geom_area(aes(x = lag, y = -qnorm((1+ci)/2)*acfstd), fill = "#B9CFE7") +
31 |       geom_col(fill = "#4373B6", width = 0.7) +
32 |       scale_x_continuous(breaks = seq(0,max(df1$lag),6)) +
33 |       scale_y_continuous(name = element_blank(), 
34 |                          limits = c(min(df1$acf,df2$pacf),1)) +
35 |       ggtitle("ACF") +
36 |       theme_bw()
37 |     
38 |     plot.pacf <- ggplot(data = df2, aes(x = lag, y = pacf)) +
39 |       geom_area(aes(x = lag, y = qnorm((1+ci)/2)*pacfstd), fill = "#B9CFE7") +
40 |       geom_area(aes(x = lag, y = -qnorm((1+ci)/2)*pacfstd), fill = "#B9CFE7") +
41 |       geom_col(fill = "#4373B6", width = 0.7) +
42 |       scale_x_continuous(breaks = seq(0,max(df2$lag, na.rm = TRUE),6)) +
43 |       scale_y_continuous(name = element_blank(),
44 |                          limits = c(min(df1$acf,df2$pacf),1)) +
45 |       ggtitle("PACF") +
46 |       theme_bw()
47 |   }
48 |   else {
49 |     plot.acf <- ggplot(data = df1, aes(x = lag, y = acf)) +
50 |       geom_col(fill = "#4373B6", width = 0.7) +
51 |       geom_hline(yintercept = qnorm((1+ci)/2)/sqrt(N), 
52 |                  colour = "sandybrown",
53 |                  linetype = "dashed") + 
54 |       geom_hline(yintercept = - qnorm((1+ci)/2)/sqrt(N), 
55 |                  colour = "sandybrown",
56 |                  linetype = "dashed") + 
57 |       scale_x_continuous(breaks = seq(0,max(df1$lag),6)) +
58 |       scale_y_continuous(name = element_blank(), 
59 |                          limits = c(min(df1$acf,df2$pacf),1)) +
60 |       ggtitle("ACF") +
61 |       theme_bw()
62 |     
63 |     plot.pacf <- ggplot(data = df2, aes(x = lag, y = pacf)) +
64 |       geom_col(fill = "#4373B6", width = 0.7) +
65 |       geom_hline(yintercept = qnorm((1+ci)/2)/sqrt(N), 
66 |                  colour = "sandybrown",
67 |                  linetype = "dashed") + 
68 |       geom_hline(yintercept = - qnorm((1+ci)/2)/sqrt(N), 
69 |                  colour = "sandybrown",
70 |                  linetype = "dashed") + 
71 |       scale_x_continuous(breaks = seq(0,max(df2$lag, na.rm = TRUE),6)) +
72 |       scale_y_continuous(name = element_blank(),
73 |                          limits = c(min(df1$acf,df2$pacf),1)) +
74 |       ggtitle("PACF") +
75 |       theme_bw()
76 |   }
77 |   cowplot::plot_grid(plot.acf, plot.pacf, nrow = numofrow)
78 | }
79 | 


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin.pdf


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/create figure wi label-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/create figure wi label-1.pdf


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/plot pred graph-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/plot pred graph-1.pdf


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/setup-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/setup-1.pdf


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/show two graphs-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/complete_codes/advanced_rmarkdown_fin_files/figure-latex/show two graphs-1.pdf


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/extra/maschools_report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/extra/maschools_report.pdf


--------------------------------------------------------------------------------
/lecture19-advaced-rmarkdown/hotels_analysis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture19-advaced-rmarkdown/hotels_analysis.pdf


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 20: Spatial data visualization
 2 | 
 3 | ## Motivation
 4 | 
 5 | Visualizing data spatially can allow us to make insights as to what is going on beyond our bubble. Aside from being great visuals that immediately engage audiences, map data visualizations provide a critical context for the metrics. Combining geospatial information with data creates a greater scope of understanding. Some benefits of using maps in your data visualization include:
 6 | 
 7 | 1. A greater ability to more easily understand the distribution of your variable across the city, state, country, or world.
 8 | 2. The ability to compare the activity across several locations at a glance
 9 | 3. More intuitive decision making for company leaders
10 | 4. Contextualizing your data in the real world
11 | 
12 | 
13 | There is lots of room for creativity when making map dashboards because there are numerous ways to convey information with this kind of visualization. In R, we map geographical regions colored, shaded, or graded according to some variable. They are visually striking, especially when the spatial units of the map are familiar entities.
14 | 
15 | | Life expectancy map    | Hotel prices in cities  |
16 | |-------------------------|-------------------------|
17 | | ![alt text 1](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/output/lifeexpectancy_world.png) | ![alt text 2](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/output/heu_prices.png) |
18 | 
19 | 
20 | ## This lecture
21 | 
22 | This lecture introduces spatial data visualization using maps. During the lecture, students learn how to use the `maps` package which offers built-in maps with the [worldbank-lifeexpectancy](https://gabors-data-analysis.com/datasets/#worldbank-lifeexpectancy) data. Plotting the raw life expectancy at birth on a world map is already a powerful tool, but students will learn how to show deviance from the expected value given by the regression model. In the second part, students import raw `shp` files with auxiliary files, which contain the map of London boroughs and Vienna districts. With the [hotels-europe](https://gabors-data-analysis.com/datasets/#hotels-europe) dataset the average price for each unit on the map is shown.
23 | 
24 | Case studies used during the lecture:
25 |   - [Chapter 08, B: How is life expectancy related to the average income of a country?](https://gabors-data-analysis.com/casestudies/#ch08b-how-is-life-expectancy-related-to-the-average-income-of-a-country)
26 |   - [Chapter 03, B: Comparing hotel prices in Europe: Vienna vs London](https://gabors-data-analysis.com/casestudies/#ch03b-comparing-hotel-prices-in-europe-vienna-vs-london)
27 | 
28 | ## Learning outcomes
29 | After successfully completing [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R), students should be able:
30 | 
31 |   - Part I
32 |     - Use of `maps` package to import world map
33 |     - Understand how `geom_polygon` works
34 |     - Shaping the outlook of the map with `coord_equal` or `coord_map`
35 |     - Use of `theme_map()`
36 |     - Use different coloring with `scale_fill_gradient`
37 |     - How to match different data tables to be able to plot a map
38 |     - Use custom values as a filler on the map based on life-expectancy case study
39 |   - Part II
40 |     - Use `rgdal` package with `readOGR` function to import 'shp' files and other needed auxiliary files as 'shx' and 'dbf'
41 |     - Convert an `S4 object` to a tibble and format, such that it can be used for `ggplot2`
42 |     - `geom_path` to color the edges of the map
43 |     - Map manipulations to show only inner-London boroughs
44 |     - Add (borough or district) names to a map with `aggregate` and `geom_text`
45 |     - Control for limits of legend colors with `scale_fill_gradientn()`
46 |     - Use nice color maps with `wesanderson` package
47 |     - Task for Vienna: replicate the same as for London
48 |     - `ggarrange` with a common legend and add common title with `annotate_figure()`
49 | 
50 | ## Lecture Time
51 | 
52 | Ideal overall time: **40-60 mins**.
53 | 
54 | Going through [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R) takes around 20-40 minutes. Solving the tasks takes the remaining 20-40 minutes as there are two long tasks.
55 | 
56 | 
57 | ## Homework
58 | 
59 | *Type*: quick practice, approx 10 mins
60 | 
61 | Get countries' GDP growth rates with the `WDI` package. Plot the values in a world map.
62 | 
63 | 
64 | ## Further material
65 | 
66 |   - This lecture is based on [Kieran Healy: Data Visualization, Chapter 7](https://socviz.co/maps.html#maps). Check out for more content.
67 |   - Great content can be found in (advanced) spatial data analysis [Edzer Pebesma, Roger Bivand: Spatial Data Science with applications in R](https://keen-swartz-3146c4.netlify.app/), specifically [a blog content](https://r-spatial.org/r/2018/10/25/ggplot2-sf.html) related to this book can be interesting.
68 | 
69 | ## Folder structure
70 |   
71 |   - [raw_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes) includes codes, which are ready to use during the course but requires some live coding in class.
72 |     - [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R), is the main material for this lecture.
73 |   - [complete_codes](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/complete_codes) includes code with solution for [`visualize_spatial.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/raw_codes/visualize_spatial.R) as [`visualize_spatial_fin.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture20-basic-spatial-vizz/complete_codes/visualize_spatial_fin.R)
74 |   - [data_map](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture20-basic-spatial-vizz/data_map) includes raw map data
75 |     - [London boroughs](https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london): `London_Borough_Excluding_MHW.dbf`, `London_Borough_Excluding_MHW.shp`, `London_Borough_Excluding_MHW.shx`
76 |     - [Vienna boroughs](https://www.data.gv.at/katalog/dataset/stadt-wien_bezirksgrenzenwien): `BEZIRKSGRENZEOGDPolygon.dbf`, `BEZIRKSGRENZEOGDPolygon.shp`, `BEZIRKSGRENZEOGDPolygon.shx`
77 | 


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.dbf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.dbf


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shp


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/BEZIRKSGRENZEOGDPolygon.shx


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.dbf:
--------------------------------------------------------------------------------
1 | q
2 | !   J                     NAME       C                   GSS_CODE   C    	               HECTARES   N                  NONLD_AREA N                  ONS_INNER  C                   SUB_2009   C                   SUB_2006   C    
3 |                 Kingston upon Thames  E09000021    3726.117       0.000F                  Croydon               E09000008    8649.441       0.000F                  Bromley               E09000006   15013.487       0.000F                  Hounslow              E09000018    5658.541      60.755F                  Ealing                E09000009    5554.428       0.000F                  Havering              E09000016   11445.735     210.763F                  Hillingdon            E09000017   11570.063       0.000F                  Harrow                E09000015    5046.330       0.000F                  Brent                 E09000005    4323.270       0.000F                  Barnet                E09000003    8674.837       0.000F                  Lambeth               E09000022    2724.940      43.927T                  Southwark             E09000028    2991.340     105.139T                  Lewisham              E09000023    3531.706      16.795T                  Greenwich             E09000011    5044.190     310.785F                  Bexley                E09000004    6428.649     370.619F                  Enfield               E09000010    8220.025       0.000F                  Waltham Forest        E09000031    3880.793       0.000F                  Redbridge             E09000026    5644.225       2.300F                  Sutton                E09000029    4384.698       0.000F                  Richmond upon Thames  E09000027    5876.111     135.443F                  Merton                E09000024    3762.466       0.000F                  Wandsworth            E09000032    3522.022      95.600T                  Hammersmith and FulhamE09000013    1715.409      75.648T                  Kensington and ChelseaE09000020    1238.379      25.994T                  Westminster           E09000033    2203.005      54.308T                  Camden                E09000007    2178.932       0.000T                  Tower Hamlets         E09000030    2157.501     179.707T                  Islington             E09000019    1485.664       0.000T                  Hackney               E09000012    1904.902       0.000T                  Haringey              E09000014    2959.837       0.000T                  Newham                E09000025    3857.806     237.637T                  Barking and Dagenham  E09000002    3779.934     169.150F                  City of London        E09000001     314.942      24.546T                 


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shp


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/data_map/London_Borough_Excluding_MHW.shx


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/output/heu_prices.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/output/heu_prices.png


--------------------------------------------------------------------------------
/lecture20-basic-spatial-vizz/output/lifeexpectancy_world.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture20-basic-spatial-vizz/output/lifeexpectancy_world.png


--------------------------------------------------------------------------------
/lecture21-cross-validation/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 21: Cross-validating linear models 
 2 | 
 3 | ## Motivation
 4 | 
 5 | You have a car that you want to sell in the near future. You want to know what price you can expect if you were to sell it. You may also want to know what you could expect if you were to wait one more year and sell your car then. You have data on used cars with their age and other features, and you can predict price with several kinds of regression models with different righthand-side variables in different functional forms. How should you select the regression model that would give the best prediction?
 6 | 
 7 | We introduce point prediction versus interval prediction; we discuss the components of prediction error and how to find the best prediction model that will likely produce the best fit (smallest prediction error) in the live data, using observations in the original data. We introduce loss functions in general and mean squared error (MSE) and its square root (RMSE) in particular, to evaluate predictions. We discuss three ways of finding the best predictor model, using all data and the Bayesian Information Criterion (BIC) as the measure of fit, using training–test splitting of the data, and using k-fold cross-validation, which is an improvement on the training–test split.
 8 | 
 9 | ## This lecture
10 | 
11 | This lecture refreshes methods for data cleaning and refactoring data as well as some basic feature engineering practices. After data is set, multiple competing regressions are run and compared via BIC and k-fold cross validation. Cross validation is carried out by the `caret` package as well. After the best-performing model is chosen (by RMSE), prediction performance and risks associated are discussed. In the case, when log-transformed outcome is used as the model, transformation back to level and evaluation of the prediction performance is also covered.
12 | 
13 | Case studies used:
14 |   - [Chapter 13, A: Predicting used car value with linear regressions](https://gabors-data-analysis.com/casestudies/#ch13a-predicting-used-car-value-with-linear-regressions)
15 |   - [Chapter 14, A: Predicting used car value: log prices](https://gabors-data-analysis.com/casestudies/#ch14a-predicting-used-car-value-log-prices)
16 | 
17 | ## Learning outcomes
18 | After successfully completing [`crossvalidation_usedcars.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture21-cross-validation/crossvalidation_usedcars.R), students should be able:
19 | 
20 |   - Clean and prepare data for modeling
21 |   - Decide for functional forms and do meaningful variable transformations
22 |   - Run multiple regressions and compare performance based on BIC
23 |   - Carry out k-fold cross validation with `caret` package for different regression models
24 |   - Compare the prediction performance of the models
25 |   - Understand what happens if a log-transformed outcome is used
26 |     - convert prediction back to level
27 |     - compare prediction performance of other (non-log) models 
28 | 
29 | ## Dataset used
30 | 
31 | - [used-cars](https://gabors-data-analysis.com/datasets/#used-cars)
32 | 
33 | ## Lecture Time
34 | 
35 | Ideal overall time: **100 mins**.
36 | 
37 | 
38 | ## Further material
39 | 
40 |   - This lecture is a modified and combined version of [`ch13-used-cars.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch13-used-cars-reg/ch13-used-cars.R) and [`ch14-used-cars-log.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch14-used-cars-log/ch14-used-cars-log.R) codes from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
41 | 
42 | 


--------------------------------------------------------------------------------
/lecture22-lasso/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 22: Prediction with LASSO 
 2 | 
 3 | ## Motivation
 4 | 
 5 | You want to predict the rental prices of apartments in a big city using their location, size, amenities, and other features. You have access to data on many apartments with many variables. You know how to select the best regression model for prediction from several candidate models. But how should you specify those candidate models, to begin with? In particular, which of the many variables should they include, in what functional forms, and in what interactions? More generally, how can you make sure that the candidates include truly good predictive models?
 6 | 
 7 | How should we specify the regression models? In particular, when we have many candidate predictor variables, how should we select from them, and how should we decide on their functional forms?
 8 | 
 9 | ## This lecture
10 | 
11 | This lecture discusses how to build regression models for prediction and how to evaluate the predictions they produce. We discuss how to select
12 | variables out of a large pool of candidate x variables, and how to decide on their functional forms. We introduce LASSO via `glmnet`, an algorithm that can help with variable selection. With respect to evaluating predictions, we discuss why we need a holdout sample for evaluation that is separate from all of the rest of the data we use for model building and selection.
13 | 
14 | Case study:
15 |   - [Chapter 14, B: Predicting AirBnB apartment prices: selecting a regression model](https://gabors-data-analysis.com/casestudies/#ch14b-predicting-airbnb-apartment-prices-selecting-a-regression-model)
16 | 
17 | ## Learning outcomes
18 | After successfully completing [`lasso_aribnb.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture22-lasso/codes/lasso_aribnb.R), students should be able:
19 | 
20 |   - Data cleaning and refactoring to prepare for LASSO type modelling
21 |   - Basic feature engineering for LASSO
22 |   - Understand the three sample approach:
23 |     - train and test sample to select model (cross validation for tuning parameters)
24 |     - hold-out sample to evaluate model prediction performance
25 |   - Model selection with
26 |     - (linear) regression models
27 |     - LASSO, RIDGE and Elastic Net via `glmnet` package
28 |   - Model diagnostics
29 |     - Performance measure(s) on hold-out set to evalate competing models
30 |     - stability of the prediction
31 |     - specific diagnostic figures for LASSO
32 | 
33 | ## Dataset used
34 | 
35 |   - [airbnb](https://gabors-data-analysis.com/datasets/#airbnb)
36 | 
37 | ## Lecture Time
38 | 
39 | Ideal overall time: **100 mins**.
40 | 
41 | 
42 | ## Further material
43 | 
44 |   - This lecture is a modified version of [`Ch16-airbnb-random-forest.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch16-airbnb-random-forest/Ch16-airbnb-random-forest.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
45 | 
46 | 


--------------------------------------------------------------------------------
/lecture22-lasso/codes/ch14_aux_fncs.R:
--------------------------------------------------------------------------------
 1 | price_diff_by_variables2 <- function(df, factor_var, dummy_var, factor_lab, dummy_lab){
 2 |   # Looking for interactions.
 3 |   # It is a function it takes 3 arguments: 1) Your dataframe,
 4 |   # 2) the factor variable (like room_type)
 5 |   # 3)the dummy variable you are interested in (like TV)
 6 |   
 7 |   # Process your data frame and make a new dataframe which contains the stats
 8 |   factor_var <- as.name(factor_var)
 9 |   dummy_var <- as.name(dummy_var)
10 |   
11 |   stats <- df %>%
12 |     group_by(!!factor_var, !!dummy_var) %>%
13 |     dplyr::summarize(Mean = mean(price, na.rm=TRUE),
14 |                      se = sd(price)/sqrt(n()))
15 |   
16 |   stats[,2] <- lapply(stats[,2], factor)
17 |   
18 |   ggplot(stats, aes_string(colnames(stats)[1], colnames(stats)[3], fill = colnames(stats)[2]))+
19 |     geom_bar(stat='identity', position = position_dodge(width=0.9), alpha=0.8)+
20 |     geom_errorbar(aes(ymin=Mean-(1.96*se),ymax=Mean+(1.96*se)),
21 |                   position=position_dodge(width = 0.9), width = 0.25)+
22 |     scale_color_manual(name=dummy_lab,
23 |                        values=c('red','blue')) +
24 |     scale_fill_manual(name=dummy_lab,
25 |                       values=c('red','blue')) +
26 |     ylab('Mean Price')+
27 |     xlab(factor_lab) +
28 |     theme_bw()+
29 |     theme(panel.grid.major=element_blank(),
30 |           panel.grid.minor=element_blank(),
31 |           panel.border=element_blank(),
32 |           axis.line=element_line(),
33 |           legend.position = "top",
34 |           #legend.position = c(0.7, 0.9),
35 |           legend.box = "vertical",
36 |           legend.text = element_text(size = 5),
37 |           legend.title = element_text(size = 5, face = "bold"),
38 |           legend.key.size = unit(x = 0.4, units = "cm")
39 |     )
40 | }


--------------------------------------------------------------------------------
/lecture23-regression-tree/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 23: Prediction with regression trees (CART)
 2 | 
 3 | ## Motivation
 4 | 
 5 | You want to predict the price of used cars as a function of their age and other features. You want to specify a model that includes the most important interactions and nonlinearities of those features, but you don’t know how to start. In particular, you are worried that you can’t start with a very complex regression model and use LASSO or some other method to simplify it because there are way too many potential interactions. Is there an alternative approach to regression that includes the most important interactions without you having to specify them?
 6 | 
 7 | To carry out the prediction of used car prices, we show how to use the regression tree, an alternative to linear regressions that are designed to build a model with the most important interactions and nonlinearities for a prediction. However, the regression tree you build appears to overfit your original data. How can you build a regression tree model that is less prone to overfitting the original data and can thus give a better prediction in the live data?
 8 | 
 9 | 
10 | ## This lecture
11 | 
12 | This lecture introduces the regression tree via `rpart`, an alternative to linear regression for prediction purposes that can find the most important predictor variables and their interactions and can approximate any functional form automatically. Regression trees split the data into small bins (subsamples) by the value of the x variables. For a quantitative y, they use the average y value in those small sets to predict y. We introduce the regression tree model and the most widely used algorithm to build a regression tree model. Somewhat confusingly, both the model and the algorithm are called CART (for classification and regression trees), but we reserve this name for the algorithm. We show that a regression tree is an intuitively appealing method to model nonlinearities and interactions among the x variables, but it is rarely used for prediction in itself because it is prone to overfit the original data. Instead, the regression tree forms the basic element of very powerful prediction methods that we’ll cover in the next seminar.
13 | 
14 | Case study:
15 |   - [Chapter 15, A: Predicting used car value with regression trees](https://gabors-data-analysis.com/casestudies/#ch15a-predicting-used-car-value-with-regression-trees)
16 | 
17 | ## Learning outcomes
18 | After successfully completing [`cart_usedcars.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture23-regression-tree/cart_usedcars.R), students should be able:
19 | 
20 |   - Understand the how regression tree works
21 |   - Estimate a regression tree via `rpart` package through `caret`
22 |   - Visualize regression tree(s) multiple ways (with `rpart.plot` and with `ggplot2`)
23 |   - Set stopping criteria for CART
24 |     - Depth or level of the tree
25 |     - Number of leaves
26 |     - minimum fit measure increase by a split
27 |   - Pruning a large tree
28 |     - find optimal complexity parameter (also known as pruning parameter)
29 |   - Variable importance plot
30 |   - Prediction evaluation
31 |     - comparing trees
32 |     - comparing tree vs linear regressions
33 | 
34 | ## Dataset used
35 | 
36 |   - [used-cars](https://gabors-data-analysis.com/datasets/#used-cars)
37 | 
38 | ## Lecture Time
39 | 
40 | Ideal overall time: **100 mins**.
41 | 
42 | 
43 | ## Further material
44 | 
45 |   - This lecture is a modified version of [`ch15-used-cars-cart.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch15-used-cars-cart/ch15-used-cars-cart.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
46 | 
47 | 


--------------------------------------------------------------------------------
/lecture24-random-forest/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 24: Predicting with Random Forest and Boosting
 2 | 
 3 | ## Motivation
 4 | 
 5 | You need to predict rental prices of apartments using various features. You don’t know that the various features may interact with each other in determining price, so you would like to use a regression tree. But you want to build a model that gives the best possible prediction, better than a single tree. What methods are available that keep the advantage of regression trees but give a better prediction? How should you choose from among those methods?
 6 | 
 7 | How can you grow a random forest, the most widely used tree-based method, to carry out the prediction of apartment rental prices? What details do you have to decide on, how should you decide on them, and how can you evaluate the results?
 8 | 
 9 | A regression tree can capture complicated interactions and nonlinearities for predicting a quantitative y variable, but it is prone to overfit the original data, even after appropriate pruning. It turns out, however, that combining multiple regression trees grown on the same data can yield a much better prediction. Such methods are called ensemble methods. There are many ensemble methods based on regression trees, and some are known to produce very good predictions. But these methods are rather complex, and some of them are not straightforward to use.
10 | 
11 | ## This lecture
12 | 
13 | This lecture introduces two ensemble methods based on regression trees: random forest and boosting. We start by introducing the main idea of ensemble methods: combining results from many imperfect models can lead to a much better prediction than a single model that we try to build to perfection. Of the two methods, we discuss the random forest (RF) via `ranger` package in more detail. The random forest is perhaps the most frequently used method to predict a quantitative y variable, both because of its excellent predictive performance and because it is relatively simple to use. Even more than with a single tree, it is hard to understand the underlying patterns of association between y and x that drive the predictions of ensemble methods. We discuss some diagnostic tools that can help with that: variable importance plots, partial dependence plots, and examining the quality of predictions in subgroups. Finally, we show another method: boosting, an alternative approach to making predictions based on an ensemble of regression trees via `gbm`.
14 | 
15 | Note that some of the used methods take a considerable amount of time to run on a simple PC, thus pre-run model results are also uploaded to the repository, to speed up the seminar.
16 | 
17 | Case study:
18 |   - [Chapter 16, A: Predicting apartment prices with random forest](https://gabors-data-analysis.com/casestudies/#ch16a-predicting-apartment-prices-with-random-forest)
19 | 
20 | ## Learning outcomes
21 | 
22 | Lecturer/students should be aware that there is a separate file: [`airbnb_prepare.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture24-random-forest/codes/airbnb_prepare.R) for this seminar, overviewing only the data cleaning and feature engineering process. This is extremely important and powerful to understand how to prepare the data for these methods, as without it data analysts do garbage-in garbage-out analysis... Usually, due to time constraints, this part is not covered in the seminar but asked students to cover it before the seminar.
23 | 
24 | After successfully completing [`randomforest_airbnb.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture24-random-forest/codes/randomforest_airbnb.R), students should be able:
25 | 
26 |   - Estimate random forest via `ranger`
27 |     - unsderstand `mytr` parameter and other setup
28 |     - autotune option
29 |   - Understanding random forest's output
30 |     - variable importance plots: all, top 10 and grouped variables (typically factors)
31 |     - partial dependence plot
32 |     - sub-sample analysis for understanding prediction performance across groups
33 |   - Run a 'Horse-Race' prediction competition with:
34 |     - Linear regression (OLS)
35 |     - LASSO
36 |     - Regression Tree with CART
37 |     - Random Forest
38 |     - GBM model
39 | 
40 | ## Dataset used
41 | 
42 | - [airbnb](https://gabors-data-analysis.com/datasets/#airbnb)
43 | 
44 | ## Lecture Time
45 | 
46 | Ideal overall time: **100 mins**.
47 | 
48 | 
49 | ## Further material
50 | 
51 |   - This lecture is a modified version of [Ch16-airbnb-random-forest.R](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch16-airbnb-random-forest/Ch16-airbnb-random-forest.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
52 | 
53 | 


--------------------------------------------------------------------------------
/lecture24-random-forest/data/gbm_model.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture24-random-forest/data/gbm_model.RData


--------------------------------------------------------------------------------
/lecture24-random-forest/data/rf_model_1.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture24-random-forest/data/rf_model_1.RData


--------------------------------------------------------------------------------
/lecture24-random-forest/data/rf_model_2.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture24-random-forest/data/rf_model_2.RData


--------------------------------------------------------------------------------
/lecture25-classification-wML/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 25: Prediction and classification of binary outcome with ML tools
 2 | 
 3 | ## Motivation
 4 | 
 5 | Predicting whether people will repay their loans or default on them is important to a bank that sells such loans. Should the bank predict the default probability for applicants? Or, rather, should it classify applicants into prospective defaulters and prospective repayers? And how are the two kinds of predictions related? In particular, can the bank use probability predictions to classify applicants into defaulters and repayers, in a way that takes into account the bank’s costs when a default happens and its costs when it forgoes a good applicant?
 6 | 
 7 | Many companies have relationships with other companies, as suppliers or clients. Whether those other companies stay in business in the future is an important question for them. You have rich data on many companies across the years that allows you to see which companies stayed in business and which companies exited, and relate that to various features of the companies. How should you use that data to predict the probability of exit for each company? How should you predict which companies will exit and which will stay in business in the future?
 8 | 
 9 | In the previous seminars we covered the logic of predictive analytics and its most important steps, and we introduced specific methods to predict a quantitative y variable. But sometimes our y variable is not quantitative. The most important case is when y is binary: y = 1 or y = 0. How can we predict such a variable?
10 | 
11 | ## This lecture
12 | 
13 | This lecture introduces the framework and methods of probability prediction and classification analysis for binary y variables. Probability prediction means predicting the probability that y = 1, with the help of the predictor variables. Classification means predicting the binary y variable itself, with the help of the predictor variables: putting each observation in one of the y categories, also called classes. We build on what we know about probability models and the basics of probability prediction from [lecture16-binary-models](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture16-binary-models). In this seminar, we put that into the framework of predictive analytics to arrive at the best probability model for prediction purposes and to evaluate its performance. We then discuss how we can turn probability predictions into classification with the help of a classification threshold and how we should use a loss function to find the optimal threshold. We discuss how to evaluate a classification by making use of a confusion table and expected loss. We introduce the ROC curve, which illustrates the trade-off of selecting different classification threshold values. We discuss how we can use random forests based on classification trees. 
14 | 
15 | Case study:
16 |   - [Chapter 17, A: Predicting firm exit: probability and classification](https://gabors-data-analysis.com/casestudies/#ch17a-predicting-firm-exit-probability-and-classification)
17 | 
18 | ## Learning outcomes
19 | 
20 | Lecturer/students should be aware that there is a separate file at the official case studies repository: [`ch17-firm-exit-data-prep.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch17-predicting-firm-exit/ch17-firm-exit-data-prep.R) for this seminar, overviewing only the data cleaning and feature engineering process for binary outcomes. This is extremely important and powerful to understand how to prepare the data for these methods, as without it data analysts do garbage-in garbage-out analysis... Usually, due to time constraints, this part is not covered in the seminar but asked students to cover it before the seminar.
21 | 
22 | After successfully completing [`classification_wML.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture25-classification-wML/codes/classification_wML.R), students should be able:
23 | 
24 |   - What is winsorizing and how it helps
25 |   - Basic linear models for predicting probabilities
26 |     - simple linear probability model (review)
27 |     - simple logistic model (logit, review)
28 |     - Cross-validation with logit model (via `caret`)
29 |     - LASSO with logit model (via `glmnet` and `caret`)
30 |   - Evaluation of model prediction
31 |     - Calibration curve (review)
32 |     - Confusion matrix
33 |     - ROC curve and AUC (Area Under Curve) 
34 |   - Model comparison based on RMSE and AUC
35 |   - User-defined loss funtion
36 |     - find the optimal trheshold based on self-defined loss function
37 |     - Show ROC curve and optimal point
38 |     - Show loss-function values for different points on ROC  
39 |   - CART and Random Forest
40 |     - modelling porbabilities
41 |     - Random Forest with majority voting as a misunderstand method, especially with user-defined loss function     
42 | 
43 | ## Dataset used
44 | 
45 |   -[bisnode-firms](https://gabors-data-analysis.com/datasets/#bisnode-firms)
46 | 
47 | ## Lecture Time
48 | 
49 | Ideal overall time: **100 mins**.
50 | 
51 | 
52 | ## Further material
53 | 
54 |   - This lecture is a modified version of [`ch17-predicting-firm-exit.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch17-predicting-firm-exit/ch17-predicting-firm-exit.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
55 | 
56 | 


--------------------------------------------------------------------------------
/lecture25-classification-wML/codes/auxfuncs_binarywML.R:
--------------------------------------------------------------------------------
  1 | ############
  2 | # Helper functions for Bisnode analysis
  3 | 
  4 | twoClassSummaryExtended <- function(data, lev = NULL, model = NULL)
  5 | {
  6 |   lvls <- levels(data$obs)
  7 |   rmse <- sqrt(mean((data[, lvls[1]] - ifelse(data$obs == lev[2], 0, 1))^2))
  8 |   c(defaultSummary(data, lev, model), "RMSE" = rmse)
  9 | }
 10 | 
 11 | 
 12 | createRocPlot <- function(r, file_name,  myheight_small = 5.625, mywidth_small = 7.5) {
 13 |   all_coords <- coords(r, x="all", ret="all", transpose = FALSE)
 14 |   
 15 |   roc_plot <- ggplot(data = all_coords, aes(x = fpr, y = tpr)) +
 16 |     geom_line(color='red', size = 0.7) +
 17 |     geom_area(aes(fill = 'green', alpha=0.4), alpha = 0.3, position = 'identity', color = 'red') +
 18 |     scale_fill_viridis(discrete = TRUE, begin=0.6, alpha=0.5, guide = "none") +
 19 |     xlab("False Positive Rate (1-Specifity)") +
 20 |     ylab("True Positive Rate (Sensitivity)") +
 21 |     geom_abline(intercept = 0, slope = 1,  linetype = "dotted", col = "black") +
 22 |     scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, .1), expand = c(0, 0.01)) +
 23 |     scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1), expand = c(0.01, 0)) +
 24 |     theme_bw()
 25 |   
 26 |   roc_plot
 27 | }
 28 | 
 29 | 
 30 | create_calibration_plot <- function(data, prob_var, actual_var, y_lab = "Actual event probability", n_bins = 10, breaks = NULL) {
 31 |   
 32 |   if (is.null(breaks)) {
 33 |     breaks <- seq(0,1,length.out = n_bins + 1)
 34 |   }
 35 |   
 36 |   binned_data <- data %>%
 37 |     mutate(
 38 |       prob_bin = cut(!!as.name(prob_var), 
 39 |                      breaks = breaks,
 40 |                      include.lowest = TRUE)
 41 |     ) %>%
 42 |     group_by(prob_bin, .drop=FALSE) %>%
 43 |     summarise(mean_prob = mean(!!as.name(prob_var)), mean_actual = mean(!!as.name(actual_var)), n = n())
 44 |   
 45 |   p <- ggplot(data = binned_data) +
 46 |     geom_line(aes(mean_prob, mean_actual), color='red', size=0.6, show.legend = TRUE) +
 47 |     geom_point(aes(mean_prob,mean_actual), color = 'red', size = 1, shape = 16, alpha = 0.7, show.legend=F, na.rm = TRUE) +
 48 |     geom_segment(x=min(breaks), xend=max(breaks), y=min(breaks), yend=max(breaks), color='blue', size=0.3) +
 49 |     theme_bw() +
 50 |     labs(x= "Predicted event probability",
 51 |          y= y_lab) +
 52 |     coord_cartesian(xlim=c(0,1), ylim=c(0,1))+
 53 |     expand_limits(x = 0.01, y = 0.01) +
 54 |     scale_y_continuous(expand=c(0.01,0.01),breaks=c(seq(0,1,0.1))) +
 55 |     scale_x_continuous(expand=c(0.01,0.01),breaks=c(seq(0,1,0.1))) 
 56 |   
 57 |   p
 58 | }
 59 | 
 60 | 
 61 | createLossPlot <- function(r, best_coords,  myheight_small = 5.625, mywidth_small = 7.5) {
 62 |   t <- best_coords$threshold[1]
 63 |   sp <- best_coords$specificity[1]
 64 |   se <- best_coords$sensitivity[1]
 65 |   n <- rowSums(best_coords[c("tn", "tp", "fn", "fp")])[1]
 66 |   
 67 |   all_coords <- coords(r, x="all", ret="all", transpose = FALSE)
 68 |   all_coords <- all_coords %>%
 69 |     mutate(loss = (fp*FP + fn*FN)/n)
 70 |   l <- all_coords[all_coords$threshold == t, "loss"]
 71 |   
 72 |   loss_plot <- ggplot(data = all_coords, aes(x = threshold, y = loss)) +
 73 |     geom_line(color='red', size=0.7) +
 74 |     scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
 75 |     geom_vline(xintercept = t, color = 'blue' ) +
 76 |     annotate(geom = "text", x = t, y= min(all_coords$loss),
 77 |              label=paste0("best threshold: ", round(t,2)),
 78 |              colour='blue', angle=90, vjust = -1, hjust = -0.5, size = 7) +
 79 |     annotate(geom = "text", x = t, y= l,
 80 |              label= round(l, 2), hjust = -0.3, size = 7) +
 81 |     theme_bw()
 82 |   
 83 |   loss_plot
 84 | }
 85 | 
 86 | 
 87 | createRocPlotWithOptimal <- function(r, best_coords, file_name,  myheight_small = 5.625, mywidth_small = 7.5) {
 88 |   
 89 |   all_coords <- coords(r, x="all", ret="all", transpose = FALSE)
 90 |   t <- best_coords$threshold[1]
 91 |   sp <- best_coords$specificity[1]
 92 |   se <- best_coords$sensitivity[1]
 93 |   
 94 |   roc_plot <- ggplot(data = all_coords, aes(x = specificity, y = sensitivity)) +
 95 |     geom_line(color='red', size=0.7) +
 96 |     scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
 97 |     scale_x_reverse(breaks = seq(0, 1, by = 0.1)) +
 98 |     geom_point(aes(x = sp, y = se)) +
 99 |     annotate(geom = "text", x = sp, y = se,
100 |              label = paste(round(sp, 2),round(se, 2),sep = ", "),
101 |              hjust = 1, vjust = -1, size = 7) +
102 |     xlab("False Positive Rate (1-Specifity)") +
103 |     ylab("True Positive Rate (Sensitivity)") +
104 |     theme_bw()
105 | 
106 |   
107 |   roc_plot
108 | }
109 | 


--------------------------------------------------------------------------------
/lecture25-classification-wML/data/bisnode_firms_clean.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabors-data-analysis/da-coding-rstats/963580e2015d842cc0c427602207ff1b8d7934bc/lecture25-classification-wML/data/bisnode_firms_clean.RData


--------------------------------------------------------------------------------
/lecture26-long-term-time-series-wML/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 26: Forecasting from Time Series Data I - ML methods for simple models
 2 | 
 3 | ## Motivation
 4 | 
 5 | Your task is to predict the number of daily tickets sold for next year in a swimming pool in a large city. The swimming pool sells tickets through its sales terminal that records all transactions. You aggregate that data to daily frequency. How should you use the information on daily sales to produce your forecast? In particular, how should you model trends, and how should you model seasonality by months of the year and days of the week to produce the best prediction?
 6 | 
 7 | 
 8 | ## This lecture
 9 | 
10 | This lecture discusses forecasting: prediction from time series data for one or more time periods in the future. The focus of this chapter is forecasting future values of one variable, by making use of past values of the same variable, and possibly other variables, too. We build on what we learned about time series regressions in [lecture18-timeseries-regression](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture18-timeseries-regression). We start with forecasts with a long horizon, which means many time periods into the future. Such forecasts use the information on trends, seasonality, and other long-term features of the time series. 
11 | 
12 | Case study:
13 |  - [Chapter 18, A: Forecasting daily ticket sales for a swimming pool](https://gabors-data-analysis.com/casestudies/#ch18a-forecasting-daily-ticket-sales-for-a-swimming-pool)
14 | 
15 | ## Learning outcomes
16 | After successfully completing [`long_term_swimming.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture26-long-term-time-series-wML/long_term_swimming.R), students should be able:
17 | 
18 |   - Data munging with time series (review)
19 |   - Adding deterministic variables such as trends, yearly/mounthly/weekly seasonality
20 |   - Adding deterministic variables with `timeDate` package such as holidays, weekdays, etc.
21 |   - Sample splitting with time series
22 |   - Simple linear models:
23 |     - deterministic trend/seasonality and/or other deterministic variables (holidays, etc.)
24 |   - Cross-validation with time series
25 |   - `prophet` package
26 |   - Forecasting
27 |     - Comparing model based on forecasting performance (RMSE)
28 |     - Graphical representation of model fit and forecasts
29 | 
30 | ## Dataset used
31 | 
32 |  - [swim-transactions](https://gabors-data-analysis.com/datasets/#swim-transactions)
33 | 
34 | ## Lecture Time
35 | 
36 | Ideal overall time: **50-60 mins**.
37 | 
38 | 
39 | ## Further material
40 | 
41 |   - This lecture is a modified version of [ch18-swimmingpool-predict.R](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch18-swimmingpool/ch18-swimmingpool-predict.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
42 | 
43 | 


--------------------------------------------------------------------------------
/lecture27-short-term-time-series-ARIMA-VAR/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 27: Forecasting from Time Series Data II - ARIMA and VAR models
 2 | 
 3 | ## Motivation
 4 | 
 5 | Your task is to predict how house prices will move in a particular city in the next months. You have monthly data on the house price index of the city, and you can collect monthly data on other variables that may be correlated with how house prices move. How should you use that data to forecast changes in house prices for the next few months? In particular, how should you use those other variables to help that forecast even though you don’t know their future values?
 6 | 
 7 | ## This lecture
 8 | 
 9 | This lecture discusses forecasting: prediction from time series data for one or more time periods in the future. The focus of this chapter is forecasting future values of one variable, by making use of past values of the same variable, and possibly other variables, too. We build on what we learned about time series regressions in [lecture18-timeseries-regression](https://github.com/gabors-data-analysis/da-coding-rstats/tree/main/lecture18-timeseries-regression). Now, we then turn to short horizon forecasts that forecast y for a few time periods ahead. These forecasts make use of serial correlation of the time series of y besides those long-term features. We introduce autoregression (AR) and ARIMA models via `fpp3` package, which captures the patterns of serial correlation and can use for short horizon forecasting. We then turn to use other variables in forecasting and introduce vector autoregression (VAR) models that help in forecasting future values of those x variables that we can use to forecast y. We discuss how to carry out cross-validation in forecasting and the specific challenges and opportunities the time series nature of our data provides for assessing external validity.
10 | 
11 | Case study:
12 |   - [Chapter 18, B: Forecasting a house price index](https://gabors-data-analysis.com/casestudies/#ch18b-forecasting-a-house-price-index)
13 | 
14 | ## Learning outcomes
15 | After successfully completing [`short_term_priceindex.R`](https://github.com/gabors-data-analysis/da-coding-rstats/blob/main/lecture27-short-term-time-series-ARIMA-VAR/short_term_priceindex.R), students should be able:
16 | 
17 |   - Decide if a conversion of data to stationarity is needed
18 |   - ARIMA models with `fpp3` package
19 |     - self specified lags for AR, I, and MA components
20 |     - auto select the lags
21 |     - handling trend and seasonality within ARIMA
22 |     - understand 'S' from SARIMA and why we do not use it in this course
23 |   - Cross-validation with ARIMA models
24 |   - Vector AutoRegressive models (VAR)
25 |     - estimation and cross-validation
26 |   - Forecasting
27 |     - comparing models based on forecast performance
28 |     - external validity check on a longer horizon
29 |     - Fan charts for assessing risks   
30 | 
31 | ## Lecture Time
32 | 
33 | Ideal overall time: **50-80 mins**.
34 | 
35 | 
36 | ## Further material
37 | 
38 |   - This lecture is a modified version of [`ch18-ts-pred-homeprices.R`](https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch18-case-shiller-la/ch18-ts-pred-homeprices.R) from [Gabor's case study repository](https://github.com/gabors-data-analysis/da_case_studies).
39 | 
40 | 


--------------------------------------------------------------------------------