├── .gitignore ├── README.md ├── apps ├── README.md ├── ch03-histograms │ ├── README.md │ └── app.R ├── ch08-corr-coeff-diagrams │ ├── README.md │ └── app.R ├── ch10-heights-data │ ├── README.md │ ├── app.R │ └── helpers.R ├── ch11-regression-residuals │ ├── README.md │ └── app.R ├── ch11-regression-strips │ ├── README.md │ ├── app.R │ └── helpers.R ├── ch16-chance-error │ ├── README.md │ └── app.R ├── ch17-demere-games │ ├── README.md │ └── app.R ├── ch17-expected-value-std-error │ ├── README.md │ ├── app.R │ └── helpers.R ├── ch18-coin-tossing │ ├── README.md │ └── app.R ├── ch18-roll-dice-product │ ├── README.md │ ├── app.R │ └── helpers.R ├── ch18-roll-dice-sum │ ├── README.md │ ├── app.R │ └── helpers.R ├── ch20-sampling-men │ ├── README.md │ └── app.R ├── ch21-accuracy-percentages │ ├── README.md │ ├── app.R │ └── helpers.R └── ch23-accuracy-averages │ ├── README.md │ ├── app.R │ └── helpers.R ├── data ├── abalone.csv ├── distributions.csv ├── galton.csv ├── nba_players.csv ├── pearson.csv ├── stock-earnings-prices.csv └── vegetables-smoking.csv ├── hw ├── README.md ├── hw01-questions.pdf ├── hw02-questions.pdf ├── hw03-questions.pdf ├── hw04-questions.pdf ├── hw05-questions.pdf ├── hw06-questions.pdf ├── hw07-questions.pdf ├── hw08-questions.pdf ├── hw09-questions.pdf ├── hw10-questions.pdf ├── hw11-questions.pdf └── hw12-questions.pdf ├── labs └── README.md ├── lectures └── README.md ├── other ├── Karl-Pearson-and-the-origins-of-modern-statistics.pdf ├── Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf ├── The-strange-science-of-Francis-Galton.pdf ├── formula-sheet-final.pdf ├── formula-sheet-midterm1.pdf ├── formula-sheet-midterm2.pdf ├── standard-normal-table.pdf ├── t-table.pdf └── z-table.pdf ├── scripts ├── 01-R-introduction.Rmd ├── 01-R-introduction.pdf ├── 02-data-variables.Rmd ├── 02-data-variables.pdf ├── 03-histograms.Rmd ├── 03-histograms.pdf ├── 04-measures-center.Rmd ├── 04-measures-center.pdf ├── 05-measures-spread.Rmd ├── 05-measures-spread.pdf ├── 06-normal-curve.Rmd ├── 06-normal-curve.pdf ├── 07-scatter-diagrams.Rmd ├── 07-scatter-diagrams.pdf ├── 08-correlation.Rmd ├── 08-correlation.pdf ├── 09-regression-line.Rmd ├── 09-regression-line.pdf ├── 10-prediction-and-errors-in-regression.Rmd ├── 10-prediction-and-errors-in-regression.pdf ├── 11-binomial-formula.Rmd ├── 11-binomial-formula.pdf ├── 12-chance-process.Rmd ├── 12-chance-process.pdf ├── Makefile ├── README.md └── images │ ├── karl-pearson.jpg │ └── western-conference-standings-2016.png └── syllabus ├── README.md ├── mrs-mutner-rules.jpg ├── syllabus-stat131A.md └── syllabus-stat20.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Mac specific 2 | *.DS_Store 3 | 4 | # latex specific 5 | *.aux 6 | *.log 7 | 8 | # files in labs/ 9 | labs/.DS_Store 10 | labs/*.html 11 | 12 | # files in data/ 13 | data/.DS_Store 14 | data/.Rhistory 15 | 16 | # files in units/ 17 | scripts/.DS_Store 18 | scripts/.Rhistory 19 | 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## About 2 | 3 | This repository holds the course materials for the Spring 2017 edition of: 4 | 5 | - Stat 20 __Introduction to Probability and Statistics__ at UC Berkeley. 6 | - Stat 131A __Introduction to Probability and Statistics for Life Scientists__ at UC Berkeley. 7 | 8 | 9 | ## Contents 10 | 11 | - [Syllabus](syllabus): Course logistics and policies. 12 | - [Lectures](lectures): Calendar of weekly topics, and lectures material. 13 | - [HW Assignments](hw): Weekly assignments. 14 | - [Labs](labs): Topics from textbook for lab. 15 | - [Scripts](scripts): Tutorial R scripts. 16 | - [Apps](apps): Shiny apps used in lecture's demos. 17 | - [Data](data): Data sets. 18 | - [Other](other): Other resources (e.g. tables, articles). 19 | 20 | 21 | ## R and RStudio 22 | 23 | We will use the statistical software __[R](https://www.r-project.org/)__ and the 24 | [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment) 25 | __[RStudio](https://www.rstudio.com/)__ as a computational tool to 26 | practice and apply the key concepts of the course. 27 | 28 | Both R and RStudio are free, and are available for Mac OS X, Windows, and Linux. 29 | 30 | To install R (Binary version): 31 | 32 | - Mac: [https://cran.cnr.berkeley.edu/bin/macosx/](https://cran.cnr.berkeley.edu/bin/macosx/) 33 | - Windows: [https://cran.cnr.berkeley.edu/bin/windows/](https://cran.cnr.berkeley.edu/bin/windows/) 34 | 35 | To install RStudio (free desktop version): 36 | 37 | - RStudio Desktop version [https://www.rstudio.com/products/rstudio/download/](https://www.rstudio.com/products/rstudio/download/) 38 | 39 | 40 | ----- 41 | 42 | ### License 43 | 44 | Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 45 | 46 | Author: [Gaston Sanchez](http://gastonsanchez.com) 47 | -------------------------------------------------------------------------------- /apps/README.md: -------------------------------------------------------------------------------- 1 | # Shiny Apps 2 | 3 | This is a collection of Shiny apps to be used mainly during lecture to illustrate some of the concepts in the textbook _Statistics_ (FPP) 4th edition. 4 | 5 | 6 | ## Running the apps 7 | 8 | The easiest way to run an app is with the `runGitHub()` function from the `"shiny"` package. Please make sure you have installed the package `"shiny"`. In case of doubt, run: 9 | 10 | ```R 11 | install.packages("shiny") 12 | ``` 13 | 14 | 15 | For instance, to run the app contained in the [ch03-histograms](/ch03-histograms) folder, run the following code in R: 16 | 17 | ```R 18 | library(shiny) 19 | 20 | # Run an app from a subdirectory in the repo 21 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch03-histograms") 22 | ``` 23 | -------------------------------------------------------------------------------- /apps/ch03-histograms/README.md: -------------------------------------------------------------------------------- 1 | # Histograms for NBA players data 2 | 3 | This is a Shiny app that generates histograms using data of NBA players from the season 2015-2016.. 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide examples of histograms and distributions of quantitative variables. __Statistics, Chapter 3: The Histogram__ (pages 32-56): 9 | 10 | - A _variable_ is a characteristic of the subjects in a study. It can be either qualitative or quantitative. 11 | - A _histogram_ is a visual display used to look at the distribution of a quantitative variable. 12 | - A _histogram_ represents precents by area. It consists of a set of blocks. The area of each block represents the percentage of cases in the correspoding class interval. 13 | - With the _density scale_, the height of each block equals the percentage of cases in the the corresponding class interval, divided by the length of that interval. 14 | 15 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 16 | 17 | 18 | ## Data 19 | 20 | The data set is in the `nba_players.csv` file (see `data/` folder) which contains 528 rows and 39 columns, although this app only uses quantitative variables. 21 | 22 | 23 | ## How to run it? 24 | 25 | 26 | ```R 27 | library(shiny) 28 | 29 | # Easiest way is to use runGitHub 30 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch03-histograms") 31 | ``` 32 | 33 | -------------------------------------------------------------------------------- /apps/ch03-histograms/app.R: -------------------------------------------------------------------------------- 1 | # Title: Histograms with NBA Players Data 2 | # Description: this app uses data of NBA players to show various histograms 3 | # Author: Gaston Sanchez 4 | 5 | library(shiny) 6 | 7 | # data set 8 | nba <- read.csv('../../data/nba_players.csv', header = TRUE) 9 | 10 | # quantitative variables 11 | quantitative <- c( 12 | "height","weight","salary","experience","age","games","games_started", 13 | "minutes_played","field_goals","field_goal_attempts","field_goal_percent", 14 | "points3","points3_attempts","points3_percent","points2","points2_attempts", 15 | "points2_percent","effective_field_goal_percent","free_throws", 16 | "free_throw_attempts","free_throw_percent","offensive_rebounds", 17 | "defensive_rebounds","total_rebounds","assists","steals","blocks", 18 | "turnovers","fouls","points") 19 | 20 | # select just quantitative variables 21 | dat <- nba[ ,quantitative] 22 | 23 | 24 | # Define UI for application that draws a histogram 25 | ui <- fluidPage( 26 | 27 | # Application title 28 | titlePanel("NBA Players"), 29 | 30 | # Sidebar with a slider input for number of bins 31 | sidebarLayout( 32 | sidebarPanel( 33 | selectInput("variable", "Select a Variable", 34 | choices = colnames(dat), selected = 'height'), 35 | 36 | sliderInput("bins", 37 | "Number of bins:", 38 | min = 1, 39 | max = 50, 40 | value = 10), 41 | 42 | checkboxInput('density', label = strong('Use density scale')) 43 | ), 44 | 45 | # Show a plot of the generated distribution 46 | mainPanel( 47 | plotOutput("histogram") 48 | ) 49 | ) 50 | ) 51 | 52 | 53 | # Define server logic required to draw a histogram 54 | server <- function(input, output) { 55 | 56 | output$histogram <- renderPlot({ 57 | # generate bins based on input$bins from ui.R 58 | x <- na.omit(dat[ ,input$variable]) 59 | bins <- seq(min(x), max(x), length.out = input$bins + 1) 60 | 61 | histogram <- hist(x, breaks = bins, 62 | probability = input$density, 63 | col = 'gray80', border = 'white', las = 1, 64 | axes = FALSE, xlab = "", 65 | main = paste("Histogram of", input$variable)) 66 | axis(side = 2, las = 1) 67 | axis(side = 1, at = bins, labels = round(bins, 2)) 68 | 69 | }) 70 | } 71 | 72 | # Run the application 73 | shinyApp(ui = ui, server = server) 74 | 75 | -------------------------------------------------------------------------------- /apps/ch08-corr-coeff-diagrams/README.md: -------------------------------------------------------------------------------- 1 | # Correlation Coefficient Diagrams 2 | 3 | This is a Shiny app that generates scatter diagrams based on the specified correlation coefficient. 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide examples of scatter diagrams as those displayed on the FPP book, page 127. See Chapter 8: The Correlation Coefficient (page 127). 9 | 10 | The scatter diagrams are based on random generated data following a multivariate normal distribution. 11 | 12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 13 | 14 | 15 | ## How to run it? 16 | 17 | 18 | ```R 19 | library(shiny) 20 | 21 | # Easiest way is to use runGitHub 22 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch08-corr-coeff-diagrams") 23 | ``` 24 | 25 | -------------------------------------------------------------------------------- /apps/ch08-corr-coeff-diagrams/app.R: -------------------------------------------------------------------------------- 1 | # 2 | # This is a Shiny web application. You can run the application by clicking 3 | # the 'Run App' button above. 4 | # 5 | # Find out more about building applications with Shiny here: 6 | # 7 | # http://shiny.rstudio.com/ 8 | # 9 | 10 | library(shiny) 11 | library(MASS) 12 | 13 | 14 | # Define UI for application that draws a histogram 15 | ui <- fluidPage( 16 | 17 | # Application title 18 | titlePanel("Scatter Diagrams and Correlation"), 19 | 20 | # Sidebar with a slider input for number of bins 21 | sidebarLayout( 22 | sidebarPanel( 23 | numericInput("seed", 24 | "Random seed", 25 | min = 100, 26 | max = 99999, 27 | value = 1234), 28 | sliderInput("corr", 29 | "Correlation Coefficient", 30 | min = -1, 31 | max = 1, 32 | step = 0.05, 33 | value = 0.7), 34 | sliderInput("size", 35 | "Number of points", 36 | min = 10, 37 | max = 5000, 38 | step = 5, 39 | value = 500), 40 | sliderInput("cex", 41 | "Size of points", 42 | min = 0, 43 | max = 5, 44 | step = 0.1, 45 | value = 1), 46 | sliderInput("alpha", 47 | "Transparency of points", 48 | min = 0, 49 | max = 1, 50 | step = 0.01, 51 | value = 0.8) 52 | ), 53 | 54 | # Show a plot of the generated distribution 55 | mainPanel( 56 | plotOutput("scatterplot") 57 | ) 58 | ) 59 | ) 60 | 61 | # Define server logic required to draw a histogram 62 | server <- function(input, output) { 63 | 64 | output$scatterplot <- renderPlot({ 65 | # generate bins based on input$bins from ui.R 66 | set.seed(input$seed) 67 | cor_matrix = matrix(c(1, input$corr, input$corr, 1), 2) 68 | xy = mvrnorm(input$size, c(0, 0), cor_matrix) 69 | plot(xy, type = "n", axes=FALSE, xlab="", ylab="", 70 | xlim=c(-3, 3), ylim=c(-3, 3)) 71 | abline(h=0, v=0, col="gray80", lwd = 2) 72 | points(xy[,1], xy[,2], pch=20, cex=input$cex, 73 | col=rgb(0.45, 0.59, 0.84, alpha = input$alpha)) 74 | }) 75 | } 76 | 77 | # Run the application 78 | shinyApp(ui = ui, server = server) 79 | 80 | -------------------------------------------------------------------------------- /apps/ch10-heights-data/README.md: -------------------------------------------------------------------------------- 1 | # Regression Scatterplot for Pearson's Height data 2 | 3 | This is a Shiny app that generates a scatter diagram to illustrate the regression method using Pearson's heights data set. 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide a visual display of some of the concepts from __Statistics, Chapter 10: Regression__ (pages 158-165): 9 | 10 | - Point of averages 11 | - SD line 12 | - Graph of averages 13 | - Regression line 14 | - Correlation coefficient 15 | 16 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 17 | 18 | 19 | ## Data 20 | 21 | This app uses the Pearson's Height Data. The data is in the `data/` folder. which contains 1078 rows and 2 columns: 22 | 23 | - `Father`: The father's height, in inches 24 | - `Son`: The height of the son, in inches 25 | 26 | The app only uses variables: `Father, Mother, Child` 27 | 28 | Original source: [http://www.math.uah.edu/stat/data/Pearson.csv](http://www.math.uah.edu/stat/data/Pearson.csv) 29 | 30 | 31 | ## How to run it? 32 | 33 | There are many ways to download the app and run it: 34 | 35 | ```R 36 | library(shiny) 37 | 38 | # Easiest way is to use runGitHub 39 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch10-heights-data") 40 | ``` 41 | -------------------------------------------------------------------------------- /apps/ch10-heights-data/app.R: -------------------------------------------------------------------------------- 1 | # 2 | # This is a Shiny web application. You can run the application by clicking 3 | # the 'Run App' button above. 4 | # 5 | # Find out more about building applications with Shiny here: 6 | # 7 | # http://shiny.rstudio.com/ 8 | # 9 | 10 | library(shiny) 11 | source('helpers.R') 12 | 13 | # reading a couple of lines just to get the names of variables 14 | dat <- read.csv('../../data/pearson.csv') 15 | 16 | # Define UI for application that draws a histogram 17 | ui <- fluidPage( 18 | 19 | # Application title 20 | titlePanel("Pearson's Height Data Set"), 21 | 22 | # Define the sidebar with one input 23 | sidebarPanel( 24 | selectInput("xvar", "X-axis variable", 25 | choices = colnames(dat), selected = 'Father'), 26 | selectInput("yvar", "Y-axis variable", 27 | choices = colnames(dat), selected = 'Son'), 28 | sliderInput("cex", 29 | label = "Size of points", 30 | min = 0, max = 3, value = 2, step = 0.1), 31 | checkboxInput('reg_line', label = strong('Regression line')), 32 | checkboxInput('point_avgs', label = strong('Point of Averages')), 33 | checkboxInput('sd_line', label = strong('SD line')), 34 | checkboxInput('sd_guides', label = strong('SD guides')), 35 | sliderInput("breaks", 36 | label = "Graph of Averages", 37 | min = 0, max = 10, value = 0, step = 1), 38 | hr(), 39 | helpText('Correlation:'), 40 | verbatimTextOutput("correlation") 41 | ), 42 | 43 | # Show a plot of the generated distribution 44 | mainPanel( 45 | plotOutput("datPlot") 46 | ) 47 | ) 48 | 49 | 50 | # Define server logic required to draw a histogram 51 | server <- function(input, output) { 52 | 53 | # Correlation 54 | output$correlation <- renderPrint({ 55 | cor(dat[,input$xvar], dat[,input$yvar]) 56 | }) 57 | 58 | # Fill in the spot we created for a plot 59 | output$datPlot <- renderPlot({ 60 | # standard deviations 61 | sdx <- sd(dat[,input$xvar]) 62 | sdy <- sd(dat[,input$yvar]) 63 | avgx <- mean(dat[,input$xvar]) 64 | avgy <- mean(dat[,input$yvar]) 65 | 66 | # Render scatterplot 67 | plot(dat[,input$xvar], dat[,input$yvar], 68 | main = 'scatter diagram', type = 'n', axes = FALSE, 69 | xlab = paste(input$xvar, " height (in)"), 70 | ylab = paste(input$yvar, "height (in)")) 71 | box() 72 | axis(side = 1) 73 | axis(side = 2, las = 1) 74 | points(dat[,input$xvar], dat[,input$yvar], 75 | pch = 21, col = 'white', bg = '#777777aa', 76 | lwd = 2, cex = input$cex) 77 | # Point of Averages 78 | if (input$point_avgs) { 79 | points(avgx, avgy, 80 | pch = 21, col = 'white', bg = 'tomato', 81 | lwd = 3, cex = 3) 82 | } 83 | # SD line 84 | if (input$sd_line) { 85 | cor_xy <- cor(dat[,input$xvar], dat[,input$yvar]) 86 | if (cor_xy >= 0) { 87 | sd_line <- line_equation(avgx - sdx, avgy - sdy, avgx + sdx, avgy + sdy) 88 | abline(a = sd_line$intercept, b = sd_line$slope, 89 | lwd = 4, lty = 2, col = 'orange') 90 | } else { 91 | sd_line <- line_equation(avgx + sdx, avgy - sdy, avgx - sdx, avgy + sdy) 92 | abline(a = sd_line$intercept, b = sd_line$slope, 93 | lwd = 4, lty = 2, col = 'orange') 94 | } 95 | } 96 | # SD guides 97 | if (input$sd_guides) { 98 | abline(v = c(avgx - sdx, avgx + sdx), 99 | h = c(avgy - sdy, avgy + sdy), 100 | lty = 1, lwd = 3, col = '#FFA600aa') 101 | } 102 | # Graph of averages 103 | if (input$breaks > 1) { 104 | graph_avgs <- averages(dat[,input$xvar], dat[,input$yvar], 105 | breaks = input$breaks) 106 | points(graph_avgs$x, graph_avgs$y, pch = "+", 107 | col = '#ff6700', cex = 3) 108 | } 109 | # Regression line 110 | if (input$reg_line) { 111 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar]) 112 | abline(reg = reg, lwd = 4, col = '#4878DF') 113 | } 114 | 115 | }, height = 650, width = 650) 116 | } 117 | 118 | 119 | # Run the application 120 | shinyApp(ui = ui, server = server) 121 | 122 | -------------------------------------------------------------------------------- /apps/ch10-heights-data/helpers.R: -------------------------------------------------------------------------------- 1 | # compuptes the slope and intercept terms of a line between two points 2 | line_equation <- function(x1, y1, x2, y2) { 3 | slope <- (y2 - y1) / (x2 - x1) 4 | intercept <- y1 - slope*x1 5 | list(intercept = intercept, slope = slope) 6 | } 7 | 8 | # computes x,y averages depending on a given number of intervals (x-axis) 9 | # (to be used for showing graph of averages) 10 | averages <- function(x, y, breaks = 5) { 11 | x_cut<- cut(x, breaks = breaks) 12 | y_averages <- as.vector(tapply(y, x_cut, mean)) 13 | x_boundaries <- gsub('\\(', '', levels(x_cut)) 14 | x_boundaries <- gsub('\\]', '', x_boundaries) 15 | x_boundaries <- strsplit(x_boundaries, ',') 16 | x1 <- as.numeric(sapply(x_boundaries, function(u) u[1])) 17 | x2 <- as.numeric(sapply(x_boundaries, function(u) u[2])) 18 | x_midpoints <- x1 + (x2 - x1) / 2 19 | list(x = x_midpoints, y = y_averages) 20 | } 21 | -------------------------------------------------------------------------------- /apps/ch11-regression-residuals/README.md: -------------------------------------------------------------------------------- 1 | # Regression Residuals 2 | 3 | This is a Shiny app that generates two graphs: 1) a scatterplot with a 4 | regression line, and 2) a residual plot from the fitted regression line. 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to illustrate the concepts of homoscedastic and heteroscedastic 10 | residuals described in 11 | __Statistics, Chapter 11: The R.M.S. Error for Regression__ (pages 180-201): 12 | 13 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger 14 | Purves (2007). Fourth Edition. Norton & Company. 15 | 16 | 17 | ## Data 18 | 19 | This app uses the data from NBA basketball players in the 2015-2016 season. 20 | The csv file `nba_players.csv` is the `data/` folder of the github repository. 21 | 22 | 23 | ## How to run it? 24 | 25 | 26 | ```R 27 | library(shiny) 28 | 29 | # Easiest way is to use runGitHub 30 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch11-regression-residuals") 31 | ``` 32 | -------------------------------------------------------------------------------- /apps/ch11-regression-residuals/app.R: -------------------------------------------------------------------------------- 1 | # Title: Spread within vertical strips in regression 2 | # Description: this app uses Pearson's height data 3 | # Author: Gaston Sanchez 4 | 5 | library(shiny) 6 | 7 | # reading a couple of lines just to get the names of variables 8 | nba <- read.csv('../../data/nba_players.csv') 9 | 10 | # quantitative variables 11 | quantitative <- c( 12 | "height","weight","experience","age","games","games_started", 13 | "minutes_played","field_goals","field_goal_attempts","field_goal_percent", 14 | "points3","points3_attempts","points3_percent","points2","points2_attempts", 15 | "points2_percent","effective_field_goal_percent","free_throws", 16 | "free_throw_attempts","free_throw_percent","offensive_rebounds", 17 | "defensive_rebounds","total_rebounds","assists","steals","blocks", 18 | "turnovers","fouls","points") 19 | 20 | # select just quantitative variables 21 | dat <- nba[ ,quantitative] 22 | 23 | # Define UI for application that draws a histogram 24 | ui <- fluidPage( 25 | # Give the page a title 26 | titlePanel("NBA Players"), 27 | 28 | # Generate a row with a sidebar 29 | sidebarLayout( 30 | 31 | # Define the sidebar with one input 32 | sidebarPanel( 33 | selectInput("xvar", "X-axis variable", 34 | choices = colnames(dat), selected = 'height'), 35 | selectInput("yvar", "Y-axis variable", 36 | choices = colnames(dat), selected = 'weight'), 37 | hr(), 38 | helpText('Correlation:'), 39 | verbatimTextOutput("correlation"), 40 | helpText('r.m.s. error:'), 41 | verbatimTextOutput("rms_error") 42 | ), 43 | 44 | # Create a spot for the barplot 45 | mainPanel( 46 | plotOutput("datPlot"), 47 | plotOutput("residualPlot") 48 | ) 49 | 50 | ) 51 | ) 52 | 53 | 54 | # Define server logic required to draw a histogram 55 | server <- function(input, output) { 56 | 57 | # Correlation 58 | output$correlation <- renderPrint({ 59 | cor(dat[,input$xvar], dat[,input$yvar]) 60 | }) 61 | 62 | # r.m.s. error 63 | output$rms_error <- renderPrint({ 64 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar]) 65 | sqrt(mean((reg$residuals)^2)) 66 | }) 67 | 68 | # Fill in the spot we created for a plot 69 | output$datPlot <- renderPlot({ 70 | 71 | # Render a scatter diagram 72 | plot(dat[,input$xvar], dat[,input$yvar], 73 | main = 'scatter diagram', type = 'n', axes = FALSE, 74 | xlab = input$xvar, ylab = input$yvar) 75 | box() 76 | axis(side = 1) 77 | axis(side = 2, las = 1) 78 | points(dat[,input$xvar], dat[,input$yvar], 79 | pch = 21, col = 'white', bg = '#4878DFaa', 80 | lwd = 2, cex = 2) 81 | # regression line 82 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar]) 83 | abline(reg = reg, lwd = 3, col = '#e35a6d') 84 | 85 | }) 86 | 87 | # histogram 88 | output$residualPlot <- renderPlot({ 89 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar]) 90 | # Render scatterplot 91 | plot(dat[,input$xvar], reg$residuals, las = 1, 92 | main = 'Residual plot', xlab = input$xvar, 93 | ylab = 'residuals', col = '#ACB6F1', type = 'n') 94 | abline(h = 0, col = 'gray70', lw = 2) 95 | points(dat[,input$xvar], reg$residuals, 96 | pch = 20, col = '#888888aa', cex = 2) 97 | }) 98 | } 99 | 100 | 101 | 102 | # Run the application 103 | shinyApp(ui = ui, server = server) 104 | 105 | -------------------------------------------------------------------------------- /apps/ch11-regression-strips/README.md: -------------------------------------------------------------------------------- 1 | # Vertical Strips for Pearson's Height data 2 | 3 | This is a Shiny app that generates a scatter diagram to illustrate the 4 | distribution of values on the y-axis within a vertical strip. 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to provide a visual display of some of the concepts from 10 | __Statistics, Chapter 11: The R.M.S. Error for Regression__ (pages 180-201): 11 | 12 | - Looking at vertical strips 13 | - Using the normal curve inside a vertical strip 14 | 15 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger 16 | Purves (2007). Fourth Edition. Norton & Company. 17 | 18 | 19 | ## Data 20 | 21 | This app uses the Pearson and Lee's Height Data as described in 22 | [Pearson's Height Data](http://www.math.uah.edu/stat/data/Pearson.csv). 23 | The data is in the `pearson.csv` file, available in the `data/` folder of 24 | the github repository. 25 | 26 | 27 | ## How to run it? 28 | 29 | 30 | ```R 31 | library(shiny) 32 | 33 | # Easiest way is to use runGitHub 34 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch11-regression-strips") 35 | ``` 36 | -------------------------------------------------------------------------------- /apps/ch11-regression-strips/app.R: -------------------------------------------------------------------------------- 1 | # Title: Spread within vertical strips in regression 2 | # Description: this app uses Pearson's height data 3 | # Author: Gaston Sanchez 4 | 5 | library(shiny) 6 | source('helpers.R') 7 | 8 | # reading a couple of lines just to get the names of variables 9 | dat <- read.csv('../../data/pearson.csv') 10 | 11 | # Define UI for application that draws a histogram 12 | ui <- fluidPage( 13 | 14 | # Application title 15 | titlePanel("Pearson's Height Data Set"), 16 | 17 | # Define the sidebar with one input 18 | sidebarPanel( 19 | selectInput("xvar", "X-axis variable", 20 | choices = colnames(dat), selected = 'Father'), 21 | selectInput("yvar", "Y-axis variable", 22 | choices = colnames(dat), selected = 'Son'), 23 | checkboxInput('reg_line', label = strong('Regression line')), 24 | sliderInput("cex", 25 | label = "Size of points", 26 | min = 0, max = 3, value = 1.5, step = 0.1), 27 | #checkboxInput('point_avgs', label = strong('Point of Averages')), 28 | #checkboxInput('sd_line', label = strong('SD line')), 29 | #checkboxInput('sd_guides', label = strong('SD guides')), 30 | sliderInput("center", 31 | label = "x location", 32 | min = 60, 33 | max = 76, 34 | value = 70, step = 0.25), 35 | sliderInput("width", 36 | label = "width", 37 | min = 0, 38 | max = 4, 39 | value = 0, step = 0.1), 40 | hr(), 41 | helpText('Correlation:'), 42 | verbatimTextOutput("correlation") 43 | ), 44 | 45 | # Show a plot of the generated distribution 46 | mainPanel( 47 | plotOutput("datPlot"), 48 | plotOutput("histogram") 49 | ) 50 | ) 51 | 52 | 53 | # Define server logic required to draw a histogram 54 | server <- function(input, output) { 55 | 56 | # Correlation 57 | output$correlation <- renderPrint({ 58 | cor(dat[,input$xvar], dat[,input$yvar]) 59 | }) 60 | 61 | # Fill in the spot we created for a plot 62 | output$datPlot <- renderPlot({ 63 | # Render scatterplot 64 | plot(dat[,input$xvar], dat[,input$yvar], 65 | main = 'scatter diagram', type = 'n', axes = FALSE, 66 | xlab = input$xvar, ylab = input$yvar) 67 | box() 68 | axis(side = 1) 69 | axis(side = 2, las = 1) 70 | points(dat[,input$xvar], dat[,input$yvar], 71 | pch = 21, col = 'white', bg = '#777777aa', 72 | lwd = 2, cex = input$cex) 73 | # vertical strips 74 | abline(v = c(input$center - input$width, input$center + input$width), 75 | lty = 1, lwd = 3, col = '#5A6DE3') 76 | # Regression line 77 | if (input$reg_line) { 78 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar]) 79 | abline(reg = reg, lwd = 3, col = '#e35a6d') 80 | } 81 | }) 82 | 83 | # histogram 84 | output$histogram <- renderPlot({ 85 | xmin <- input$center - input$width 86 | xmax <- input$center + input$width 87 | child <- dat$Son[dat$Father >= xmin & dat$Father <= xmax] 88 | hist(child, main = '', col = '#ACB6FF', las = 1) 89 | }) 90 | 91 | } 92 | 93 | # Run the application 94 | shinyApp(ui = ui, server = server) 95 | 96 | -------------------------------------------------------------------------------- /apps/ch11-regression-strips/helpers.R: -------------------------------------------------------------------------------- 1 | line_equation <- function(x1, y1, x2, y2) { 2 | slope <- (y2 - y1) / (x2 - x1) 3 | intercept <- y1 - slope*x1 4 | list(intercept = intercept, slope = slope) 5 | } 6 | 7 | 8 | averages <- function(x, y, breaks = 5) { 9 | x_cut<- cut(x, breaks = breaks) 10 | y_averages <- as.vector(tapply(y, x_cut, mean)) 11 | x_boundaries <- gsub('\\(', '', levels(x_cut)) 12 | x_boundaries <- gsub('\\]', '', x_boundaries) 13 | x_boundaries <- strsplit(x_boundaries, ',') 14 | x1 <- as.numeric(sapply(x_boundaries, function(u) u[1])) 15 | x2 <- as.numeric(sapply(x_boundaries, function(u) u[2])) 16 | x_midpoints <- x1 + (x2 - x1) / 2 17 | list(x = x_midpoints, y = y_averages) 18 | } 19 | 20 | -------------------------------------------------------------------------------- /apps/ch16-chance-error/README.md: -------------------------------------------------------------------------------- 1 | # Chance Error 2 | 3 | This is a Shiny app that illustrates the concept of chance error when simulating tossing a coin a given number of times. 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide a visual display motivated by John Kerrich's coin-tossing experiment __Statistics, Chapter 16: The Law of Averages__ 9 | 10 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 11 | 12 | 13 | ## Data 14 | 15 | The data simulates tossing a coin using the random binomial generator function `rbinom()`. The input parameters are the number of tosses, and optionally, the probability of heads. 16 | 17 | 18 | ## Plot 19 | 20 | There are two options for the displayed plot: 21 | 22 | 1. shows the chance error (i.e. number of heads minus half the number of tosses) on the y-axis, and the number of tosses on the x-axis. 23 | 2. shows the percent error (i.e. proportion of heads) on the y-axis, and the number of tosses on the x-axis. 24 | 25 | 26 | ## How to run it? 27 | 28 | ```R 29 | library(shiny) 30 | 31 | # Easiest way is to use runGitHub 32 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch16-chance-error") 33 | ``` 34 | -------------------------------------------------------------------------------- /apps/ch16-chance-error/app.R: -------------------------------------------------------------------------------- 1 | # Title: Chance Error and Percent Error 2 | # Description: Chance error when tossing a coin (based on John Kerrich's) 3 | # Chapter 16: The Law of Averages, p 275-278 4 | # Author: Gaston Sanchez 5 | 6 | library(shiny) 7 | 8 | # Define UI for application that draws a histogram 9 | ui <- fluidPage( 10 | 11 | # Give the page a title 12 | titlePanel("Coin Tossing Experiment"), 13 | 14 | # Generate a row with a sidebar 15 | sidebarLayout( 16 | 17 | # Define the sidebar with one input 18 | sidebarPanel( 19 | numericInput("seed", label = "Random Seed:", 12345, 20 | min = 10000, max = 50000, step = 1), 21 | sliderInput("chance", label = "Chance of heads:", 22 | min = 0, max = 1, value = 0.5, step = 0.01), 23 | sliderInput("tosses", label = "Number of tosses:", 24 | min = 100, max = 10000, value = 3000, step = 50), 25 | radioButtons("error", label = "Display", 26 | choices = list("Chance error" = 1, 27 | "Percent error" = 2), 28 | selected = 2), 29 | hr(), 30 | helpText('Total number of heads:'), 31 | verbatimTextOutput("num_heads"), 32 | helpText('Proportion of heads:'), 33 | verbatimTextOutput("prop_heads") 34 | ), 35 | 36 | # Create a spot for the barplot 37 | mainPanel( 38 | plotOutput("chancePlot") 39 | ) 40 | ) 41 | ) 42 | 43 | 44 | # Define server logic required to draw a histogram 45 | server <- function(input, output) { 46 | 47 | seed <- reactive({ 48 | input$seed 49 | }) 50 | tosses <- reactive({ 51 | input$tosses 52 | }) 53 | chance <- reactive({ 54 | input$chance 55 | }) 56 | 57 | # Number of heads 58 | output$num_heads <- renderPrint({ 59 | set.seed(seed()) 60 | flips <- rbinom(n = tosses(), 1, prob = chance()) 61 | sum(flips) 62 | }) 63 | 64 | # Proportion of heads 65 | output$prop_heads <- renderPrint({ 66 | set.seed(seed()) 67 | flips <- rbinom(n = tosses(), 1, prob = chance()) 68 | round(100 * sum(flips) / tosses(), 2) 69 | }) 70 | 71 | # Fill in the spot we created for a plot 72 | output$chancePlot <- renderPlot({ 73 | set.seed(input$seed) 74 | tosses <- input$tosses 75 | flips <- rbinom(n = tosses, 1, prob = chance()) 76 | num_heads <- cumsum(flips) 77 | prop_heads <- (num_heads / 1:tosses) 78 | num_tosses <- 1:tosses 79 | 80 | # Render a barplot 81 | difference <- num_heads[num_tosses] - (chance() * num_tosses) 82 | proportion <- prop_heads[num_tosses] 83 | if (input$error == 1) { 84 | plot(num_tosses, difference, 85 | col = '#627fe2', type = 'l', lwd = 2, 86 | xlab = "Number of tosses", 87 | ylab = '# of heads - 1/2 # of tosses', 88 | axes = FALSE, main = 'Chance Error: # successes - # expected') 89 | abline(h = 0, col = '#88888855', lwd = 2, lty = 2) 90 | axis(side = 2, las = 1) 91 | } else { 92 | plot(num_tosses, proportion, ylim = c(0, 1), 93 | col = '#627fe2', type = 'l', lwd = 2, 94 | xlab = 'Number of tosses', 95 | ylab = 'Proportion of heads', 96 | axes = FALSE, main = 'Percent Error: % successes - % expected') 97 | abline(h = chance(), col = '#88888855', lwd = 2, lty = 2) 98 | axis(side = 2, las = 1, at = seq(0, 1, 0.1)) 99 | } 100 | axis(side = 1) 101 | }) 102 | 103 | } 104 | 105 | # Run the application 106 | shinyApp(ui = ui, server = server) 107 | 108 | -------------------------------------------------------------------------------- /apps/ch17-demere-games/README.md: -------------------------------------------------------------------------------- 1 | # Expected value and Standard Error with De Mere's Games 2 | 3 | This is a Shiny app that illustrates the concept of Expected Value and Standard Error 4 | when simulating De Mere's games (100 times by default). 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to provide a visual display for the concepts in __Statistics, Chapter 17: The Expected Value and Standard Error.__ 10 | 11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 12 | 13 | 14 | ## Data 15 | 16 | The app allows you to simulate two main scenarios: 17 | 18 | 1. __Rolling a fair die 4 times.__ This actually done by drawing 4 tickets out of a box with six tickets. 19 | The structure of the box consists of one ticket `1`, and five tickets `0`. 20 | 2. __Rolling a pair of dice 24 times.__ This actually done by drawing 24 tickets out of a box with 36 tickets. 21 | The structure of the box consists of 1 ticket `1`, and 35 tickets `0`. 22 | 23 | 24 | ## Plots 25 | 26 | There are three displayed graphs. 27 | 28 | 1. A probability distribution (theoretical probabilities for the number of tickets `1`). 29 | 2. An empirical pareto chart (cumulative distribution) with the proportion of tickets `1` when 30 | playing the game a given number of times. 31 | 3. A line chart with the empirical (net) gain when playing the game a given number of times. 32 | 33 | 34 | ## How to run it? 35 | 36 | ```R 37 | library(shiny) 38 | 39 | # Easiest way is to use runGitHub 40 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch17-demere-games") 41 | ``` 42 | -------------------------------------------------------------------------------- /apps/ch17-demere-games/app.R: -------------------------------------------------------------------------------- 1 | # Title: More Expected Value and Standard Error 2 | # Description: Simulation of De Mere's rolling dice games using 3 | # a box model for the number of "aces" (ticket 1). 4 | # Author: Gaston Sanchez 5 | 6 | library(shiny) 7 | 8 | # Define UI for application that draws a histogram 9 | ui <- fluidPage( 10 | 11 | # Give the page a title 12 | titlePanel("De Mere's games"), 13 | 14 | # Generate a row with a sidebar 15 | sidebarLayout( 16 | 17 | # Define the sidebar with one input 18 | sidebarPanel( 19 | fluidRow( 20 | column(5, 21 | numericInput("tickets1", "# Tickets 1", 1, 22 | min = 1, max = 35, step = 1)), 23 | column(5, 24 | numericInput("tickets0", "# Tickets 0", 5, 25 | min = 1, max = 35, step = 1)) 26 | ), 27 | helpText('Avg of box, and SD of box'), 28 | verbatimTextOutput("avg_sd_box"), 29 | numericInput("draws", label = "Number of Draws:", value = 4, 30 | min = 1, max = 100, step = 1), 31 | helpText('Expected Value and SE'), 32 | verbatimTextOutput("ev_se"), 33 | hr(), 34 | sliderInput("reps", label = "Number of games:", 35 | min = 1, max = 5000, value = 100, step = 1), 36 | helpText('Actual gain'), 37 | verbatimTextOutput("gain"), 38 | numericInput("seed", label = "Random Seed:", 12345, 39 | min = 10000, max = 50000, step = 1) 40 | ), 41 | 42 | # Create a spot for the barplot 43 | mainPanel( 44 | tabsetPanel(type = "tabs", 45 | tabPanel("Sum", plotOutput("sumPlot")), 46 | tabPanel("Pareto", plotOutput("paretoPlot")), 47 | tabPanel("Games", plotOutput("gamesPlot")) 48 | ) 49 | ) 50 | ) 51 | ) 52 | 53 | 54 | # Define server logic required to draw a histogram 55 | server <- function(input, output) { 56 | 57 | tickets <- reactive({ 58 | tickets <- c(rep(1, input$tickets1), rep(0, input$tickets0)) 59 | }) 60 | 61 | avg_box <- reactive({ 62 | mean(tickets()) 63 | }) 64 | 65 | sd_box <- reactive({ 66 | total <- input$tickets1 + input$tickets0 67 | sqrt((input$tickets1 / total) * (input$tickets0 / total)) 68 | }) 69 | 70 | sum_draws <- reactive({ 71 | set.seed(input$seed) 72 | samples <- 1:input$reps 73 | for (i in 1:input$reps) { 74 | samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE)) 75 | } 76 | samples 77 | }) 78 | 79 | # Average and SD of box 80 | output$avg_sd_box <- renderPrint({ 81 | cat(avg_box(), ", ", sd_box(), sep = '') 82 | }) 83 | 84 | # Expected Value, and Standard Error 85 | output$ev_se <- renderPrint({ 86 | ev = input$draws * avg_box() 87 | se = sqrt(input$draws) * sd_box() 88 | cat(ev, ", ", se, sep = '') 89 | }) 90 | 91 | # Probability Histogram 92 | output$sumPlot <- renderPlot({ 93 | # Render a barplot 94 | total_tickets <- input$tickets1 + input$tickets0 95 | prob_ticket1 <- input$tickets1 / total_tickets 96 | probabilities <- dbinom(0:input$draws, size = input$draws, prob_ticket1) 97 | barplot(round(probabilities, 4), border = NA, las = 1, 98 | names.arg = 0:input$draws, 99 | xlab = paste("Number of tickets 1"), 100 | ylab = 'Probability', 101 | main = paste("Probability Distribution\n", 102 | "(# ticekts 1)")) 103 | abline(h = 0.5, col = '#EC5B5B99', lty = 2, lwd = 1.4) 104 | }) 105 | 106 | # Pareto chart: cumulative percentage of draws 107 | output$paretoPlot <- renderPlot({ 108 | # Render a barplot 109 | freqs_draws <- table(sum_draws()) / input$reps 110 | freq_aux <- barplot(freqs_draws, plot = FALSE) 111 | barplot(freqs_draws, 112 | ylim = c(0, 1.1), 113 | border = NA, las = 1, 114 | xlab = paste('Number of tickets 1 in', input$reps, 'games'), 115 | ylab = 'Percentage', 116 | main = paste("Empirical Cumulative Relative Frequency\n", 117 | "(at least one ticket 1)")) 118 | abline(h = 0.5, col = '#EC5B5B99', lty = 2, lwd = 1.4) 119 | lines(freq_aux[-1], cumsum(freqs_draws[-1]), lwd = 3, col = "gray60") 120 | points(freq_aux[-1], cumsum(freqs_draws[-1]), pch=19, col="gray30") 121 | text(freq_aux[-1], cumsum(freqs_draws[-1]), 122 | round(cumsum(freqs_draws[-1]), 3), pos = 3) 123 | }) 124 | 125 | # Plot with gains 126 | output$gamesPlot <- renderPlot({ 127 | results <- rep(-1, input$reps) 128 | results[sum_draws() > 0] <- 1 129 | plot(1:input$reps, cumsum(results), type = "n", axes = FALSE, 130 | xlab = paste('Number of tickets 1 in', input$reps, 'games'), 131 | ylab = "Gained amount", 132 | main = "Empirical Gain") 133 | abline(h = 0, col = '#EC5B5B99', lty = 2, lwd = 1.4) 134 | axis(side = 1) 135 | axis(side = 2, las = 1, pos = 0) 136 | lines(1:input$reps, cumsum(results), lwd = 1.5) 137 | }) 138 | 139 | # actual gain 140 | output$gain <- renderPrint({ 141 | results <- rep(-1, input$reps) 142 | results[sum_draws() > 0] <- 1 143 | sum(results) 144 | }) 145 | } 146 | 147 | # Run the application 148 | shinyApp(ui = ui, server = server) 149 | 150 | -------------------------------------------------------------------------------- /apps/ch17-expected-value-std-error/README.md: -------------------------------------------------------------------------------- 1 | # Expected value and Standard Error 2 | 3 | This is a Shiny app that illustrates the concept of Expected Value and Standard Error when simulating rolling a die (5 times by default). 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide a visual display for the concepts in __Statistics, Chapter 17: The Expected Value and Standard Error.__ 9 | 10 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 11 | 12 | 13 | ## Data 14 | 15 | The data simulates rolling a die 5 times by default. 16 | 17 | 18 | ## Plot 19 | 20 | A bar-chart of frequencies for the sum of draws is displayed. 21 | 22 | 23 | ## How to run it? 24 | 25 | ```R 26 | library(shiny) 27 | 28 | # Easiest way is to use runGitHub 29 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch17-expected-value-std-error") 30 | ``` 31 | -------------------------------------------------------------------------------- /apps/ch17-expected-value-std-error/app.R: -------------------------------------------------------------------------------- 1 | # Title: Expected Value and Standard Error 2 | # Description: EV and SE when rolling a die 5 times 3 | # Chapter 17: The EV and SE, p 288-296 4 | # Author: Gaston Sanchez 5 | 6 | library(shiny) 7 | source("helpers.R") 8 | 9 | # Define UI for application that draws a histogram 10 | ui <- fluidPage( 11 | 12 | # Give the page a title 13 | titlePanel("Rolling a Die"), 14 | 15 | # Generate a row with a sidebar 16 | sidebarLayout( 17 | 18 | # Define the sidebar with one input 19 | sidebarPanel( 20 | numericInput("dice", label = "Number of dice:", 5, 21 | min = 1, max = 10, step = 1), 22 | numericInput("seed", label = "Random Seed:", 12330, 23 | min = 10000, max = 50000, step = 1), 24 | sliderInput("reps", label = "Number of repetitions:", 25 | min = 100, max = 10000, value = 100, step= 10), 26 | hr(), 27 | helpText('Average of sums:'), 28 | verbatimTextOutput("num_heads"), 29 | helpText('SD of sums:'), 30 | verbatimTextOutput("prop_heads") 31 | ), 32 | 33 | # Create a spot for the barplot 34 | mainPanel( 35 | plotOutput("chancePlot") 36 | ) 37 | ) 38 | ) 39 | 40 | 41 | # Define server logic required to draw a histogram 42 | server <- function(input, output) { 43 | 44 | # Empirical average of sum of draws 45 | output$num_heads <- renderPrint({ 46 | set.seed(input$seed) 47 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 48 | # avg of sums 49 | mean(total_points) 50 | }) 51 | 52 | # Empirical SD of sum of draws 53 | output$prop_heads <- renderPrint({ 54 | set.seed(input$seed) 55 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 56 | # avg of sums 57 | sd(total_points) * sqrt((input$reps - 1)/input$reps) 58 | }) 59 | 60 | # Fill in the spot we created for a plot 61 | output$chancePlot <- renderPlot({ 62 | set.seed(input$seed) 63 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 64 | # put in relative terms 65 | prop_points <- 100 * table(total_points) / input$reps 66 | ymax <- find_ymax(max(prop_points), 2) 67 | # Render a barplot 68 | barplot(prop_points, las = 1, border = "gray40", 69 | space = 0, ylim = c(0, ymax), 70 | main = sprintf("%s Repetitions", input$reps)) 71 | }) 72 | } 73 | 74 | # Run the application 75 | shinyApp(ui = ui, server = server) 76 | 77 | -------------------------------------------------------------------------------- /apps/ch17-expected-value-std-error/helpers.R: -------------------------------------------------------------------------------- 1 | # helper functions to simulate rolling a die 2 | # and adding the number of sposts 3 | 4 | # roll one die 5 | roll_die <- function(times = 1) { 6 | die <- 1:6 7 | sample(die, times, replace = TRUE) 8 | } 9 | 10 | 11 | # roll a pair of dice 12 | roll_pair <- function() { 13 | roll_die(2) 14 | } 15 | 16 | # sum of spots 17 | sum_rolls <- function(times = 1) { 18 | sum(roll_die(times)) 19 | } 20 | 21 | # product of numbers 22 | prod_rolls <- function(times = 1) { 23 | prod(roll_die(times)) 24 | } 25 | 26 | # check whether 'x' is multiple of 'num' 27 | is_multiple <- function(x, num) { 28 | x %% num == 0 29 | } 30 | 31 | # find the y-max value for ylim in barplot() 32 | find_ymax <- function(x, num) { 33 | if (is_multiple(x, num)) { 34 | return(max(x)) 35 | } else { 36 | return(num * ((x %/% num) + 1)) 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /apps/ch18-coin-tossing/README.md: -------------------------------------------------------------------------------- 1 | # Tossing Coins 2 | 3 | This is a Shiny app that generates a probability histogram when tossing 4 | a coin a specified number of times. 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to provide a visual display similar to the probability histograms 10 | in chapter 18 of "Statistics". 11 | 12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 13 | 14 | 15 | ## Data 16 | 17 | The data computes the probabilities when tossing a coin a specified number of times. 18 | The input parameters are the number of tosses, and the chance of heads. 19 | 20 | 21 | ## Plot 22 | 23 | The produced plot is a probability histogram. 24 | 25 | 26 | ## How to run it? 27 | 28 | ```R 29 | library(shiny) 30 | 31 | # Easiest way is to use runGitHub 32 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-tossing-coins") 33 | ``` 34 | -------------------------------------------------------------------------------- /apps/ch18-coin-tossing/app.R: -------------------------------------------------------------------------------- 1 | # Title: Probability histograms 2 | # Description: Probability histograms for the number of heads 3 | # in "n" tosses of a coin 4 | # Chapter 18: Normal Approx, p 316 5 | # Author: Gaston Sanchez 6 | 7 | library(shiny) 8 | 9 | # Define UI for application that draws a histogram 10 | ui <- fluidPage( 11 | 12 | # Give the page a title 13 | titlePanel("Tossing Coins"), 14 | 15 | # Generate a row with a sidebar 16 | sidebarLayout( 17 | 18 | # Define the sidebar with one input 19 | sidebarPanel( 20 | sliderInput("tosses", label = "Number of tosses:", 21 | min = 1, max = 500, value = 100, step = 1), 22 | sliderInput("chance", label = "Chance of heads", 23 | min = 0, max = 1, value = 0.5, step= 0.05), 24 | hr(), 25 | helpText('Expected Value:'), 26 | verbatimTextOutput("exp_value"), 27 | helpText('Standard Error'), 28 | verbatimTextOutput("std_error") 29 | ), 30 | 31 | # Create a spot for the barplot 32 | mainPanel( 33 | plotOutput("chancePlot") 34 | ) 35 | ) 36 | ) 37 | 38 | 39 | # Define server logic required to draw a histogram 40 | server <- function(input, output) { 41 | 42 | # Expected Value 43 | output$exp_value <- renderPrint({ 44 | input$tosses * input$chance 45 | }) 46 | 47 | # Standard Error 48 | output$std_error <- renderPrint({ 49 | sqrt(input$tosses * input$chance * (1 - input$chance)) 50 | }) 51 | 52 | # Fill in the spot we created for a plot 53 | output$chancePlot <- renderPlot({ 54 | probs <- 100 * dbinom(0:input$tosses, 55 | size = input$tosses, 56 | prob = input$chance) 57 | 58 | exp_value <- input$tosses * input$chance 59 | std_error <- sqrt(input$tosses * input$chance * (1 - input$chance)) 60 | 61 | below3se <- (exp_value - 3 * std_error) 62 | above3se <- (exp_value + 3 * std_error) 63 | 64 | from <- floor(below3se) + 1 65 | to <- ceiling(above3se) + 1 66 | 67 | if (input$tosses >= 10 & from > 0) { 68 | xpos <- barplot(probs[from:to], plot = FALSE) 69 | # Render probability histogram as a barplot 70 | op = par(mar = c(6.5, 4.5, 4, 2)) 71 | barplot(probs[from:to], axes = FALSE, col = "gray70", 72 | names.arg = (from-1):(to-1), border = NA, 73 | ylim = c(0, ceiling(max(probs))), 74 | ylab = "probability (%)", 75 | main = sprintf("Probability Histogram\n %s Tosses", 76 | input$tosses)) 77 | axis(side = 2, las = 1) 78 | axis(side = 1, line = 3, 79 | at = seq(xpos[1], xpos[length(xpos)], length.out = 7), 80 | labels = seq(-3, 3, 1)) 81 | mtext("Standard Units", side = 1, line = 5.5) 82 | par(op) 83 | } else { 84 | barplot(probs, axes = FALSE, col = "gray70", 85 | names.arg = 0:input$tosses, border = NA, 86 | ylim = c(0, ceiling(max(probs))), 87 | ylab = "probability (%)", 88 | main = sprintf("Probability Histogram\n %s Tosses", 89 | input$tosses)) 90 | axis(side = 2, las = 1) 91 | } 92 | }) 93 | } 94 | 95 | # Run the application 96 | shinyApp(ui = ui, server = server) 97 | 98 | -------------------------------------------------------------------------------- /apps/ch18-roll-dice-product/README.md: -------------------------------------------------------------------------------- 1 | # Rolling Dice: Sum of Points 2 | 3 | This is a Shiny app that generates empirical histograms when simulating 4 | rolling dice and finding the total product of spots. 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to provide a visual display similar to the empirical and 10 | probability histograms shown in page 313 of "Statistics", chapter 18. 11 | 12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 13 | 14 | 15 | ## Data 16 | 17 | The data simulates rolling (by default) a pair of dice (but the user can choose between 18 | one and 10 dices). The input parameters are the number of dice, the random seed, and 19 | the number of repetitions. 20 | 21 | 22 | ## Plots 23 | 24 | There are two tabs: 25 | 26 | 1. An empirical histogram. 27 | 2. A probability histogram (probability distribution). 28 | 29 | 30 | ## How to run it? 31 | 32 | ```R 33 | library(shiny) 34 | 35 | # Easiest way is to use runGitHub 36 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-roll-dice-product") 37 | ``` 38 | -------------------------------------------------------------------------------- /apps/ch18-roll-dice-product/app.R: -------------------------------------------------------------------------------- 1 | # Title: Roll dice and multiply spots 2 | # Description: Empirical vs Probability Histograms 3 | # Chapter 18: Probability Histograms, page 313 4 | # Author: Gaston Sanchez 5 | 6 | library(shiny) 7 | source("helpers.R") 8 | 9 | # Define UI for application that draws a histogram 10 | ui <- fluidPage( 11 | 12 | # Give the page a title 13 | titlePanel("Rolling Dice: Product"), 14 | 15 | # Generate a row with a sidebar 16 | sidebarLayout( 17 | 18 | # Define the sidebar with one input 19 | sidebarPanel( 20 | numericInput("dice", label = "Number of dice:", 2, 21 | min = 1, max = 10, step = 1), 22 | numericInput("seed", label = "Random Seed:", 12330, 23 | min = 10000, max = 50000, step = 1), 24 | sliderInput("reps", label = "Number of repetitions:", 25 | min = 100, max = 10000, value = 100, step= 10) 26 | ), 27 | 28 | # Create tabs for plots 29 | mainPanel( 30 | tabsetPanel(type = "tabs", 31 | tabPanel("Empirical", plotOutput("empiricalPlot")), 32 | tabPanel("Probability", plotOutput("probabilityPlot")) 33 | ) 34 | ) 35 | ) 36 | ) 37 | 38 | 39 | # Define server logic required to draw a histogram 40 | server <- function(input, output) { 41 | 42 | # Empirical average of sum of draws 43 | output$num_heads <- renderPrint({ 44 | set.seed(input$seed) 45 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 46 | # avg of sums 47 | mean(total_points) 48 | }) 49 | 50 | # Empirical SD of product of draws 51 | output$prop_heads <- renderPrint({ 52 | set.seed(input$seed) 53 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 54 | # avg of sums 55 | sd(total_points) * sqrt((input$reps - 1)/input$reps) 56 | }) 57 | 58 | # Empirical Histogram 59 | output$empiricalPlot <- renderPlot({ 60 | set.seed(input$seed) 61 | total_points <- sapply(rep(input$dice, input$reps), prod_rolls) 62 | # put in relative terms 63 | prop_points <- 100 * table(total_points) / input$reps 64 | ymax <- find_ymax(max(prop_points), 2) 65 | # Render a barplot 66 | # Frequencies of products 67 | freq <- numeric((6^input$dice)) 68 | freq[1:(6^input$dice) %in% names(prop_points)] <- prop_points 69 | names(freq) <- 1:(6^input$dice) 70 | 71 | # Render a barplot 72 | barplot(freq, space = 0, las = 1, border = "gray40", 73 | cex.names = 0.8, ylim = c(0, ymax), 74 | main = sprintf("%s Repetitions", input$reps)) 75 | }) 76 | 77 | # Probability Histogram 78 | output$probabilityPlot <- renderPlot({ 79 | outcomes <- multiply_spots(input$dice) 80 | freq <- rep(0, 6^input$dice) 81 | freq[as.numeric(names(outcomes))] <- outcomes 82 | # Render a barplot 83 | barplot(100 * freq, names.arg = 1:6^input$dice, 84 | las = 1, border = "gray40", 85 | space = 0, 86 | xlab = "Number of spots", 87 | ylab = "Chance (%)", 88 | main = "Probability Histogram") 89 | }) 90 | 91 | } 92 | 93 | # Run the application 94 | shinyApp(ui = ui, server = server) 95 | 96 | -------------------------------------------------------------------------------- /apps/ch18-roll-dice-product/helpers.R: -------------------------------------------------------------------------------- 1 | # Helper functions to simulate rolling a die 2 | # and adding-or-multiplying the number of spots 3 | 4 | # roll one die 5 | roll_die <- function(times = 1) { 6 | die <- 1:6 7 | sample(die, times, replace = TRUE) 8 | } 9 | 10 | 11 | # roll a pair of dice 12 | roll_pair <- function() { 13 | roll_die(2) 14 | } 15 | 16 | # sum of spots 17 | sum_rolls <- function(times = 1) { 18 | sum(roll_die(times)) 19 | } 20 | 21 | # product of numbers 22 | prod_rolls <- function(times = 1) { 23 | prod(roll_die(times)) 24 | } 25 | 26 | # check whether 'x' is multiple of 'num' 27 | is_multiple <- function(x, num) { 28 | x %% num == 0 29 | } 30 | 31 | # find the y-max value for ylim in barplot() 32 | find_ymax <- function(x, num) { 33 | if (is_multiple(x, num)) { 34 | return(max(x)) 35 | } else { 36 | return(num * ((x %/% num) + 1)) 37 | } 38 | } 39 | 40 | # reps <- 100 41 | # total_points <- sapply(rep(2, reps), sum_rolls) 42 | # prop_points <- 100 * table(total_points) / reps 43 | # barplot(prop_points, las = 1, 44 | # space = 0, ylim = c(0, 30), 45 | # main = sprintf("%s Repetitions", reps)) 46 | 47 | 48 | 49 | 50 | # function that multiplies spots to a given result 51 | multiply_rolls <- function(given) { 52 | results <- rep(0, length(given) * 6) 53 | aux <- 1 54 | for (i in 1:length(given)) { 55 | for (j in 1:6) { 56 | results[aux] <- given[i] * j 57 | aux <- aux + 1 58 | } 59 | } 60 | results 61 | } 62 | 63 | # function that computes theoretical probabilities 64 | # for the addition of spots when rolling "k" dice 65 | multiply_spots <- function(num_dice) { 66 | # just one die 67 | if (num_dice == 1) { 68 | outcomes <- table(1:6) / (6^num_dice) 69 | } else { 70 | # two or more dice 71 | current <- 1:6 72 | for (k in 2:num_dice) { 73 | current <- multiply_rolls(current) 74 | } 75 | outcomes <- table(current) / (6^num_dice) 76 | } 77 | outcomes 78 | } 79 | 80 | # multiply_spots(1) 81 | # multiply_spots(2) 82 | # multiply_spots(3) 83 | 84 | 85 | 86 | -------------------------------------------------------------------------------- /apps/ch18-roll-dice-sum/README.md: -------------------------------------------------------------------------------- 1 | # Rolling Dice: Sum of Points 2 | 3 | This is a Shiny app that generates empirical histograms when simulating 4 | rolling dice and finding the total number of spots. 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to provide a visual display similar to the empirical and 10 | probability histograms shown in page 311 of "Statistics", chapter 18. 11 | 12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 13 | 14 | 15 | ## Data 16 | 17 | The data simulates rolling (by default) a pair of dice (but the user can choose between 18 | one and 10 dices). The input parameters are the number of dice, the random seed, and 19 | the number of repetitions. 20 | 21 | 22 | ## Plots 23 | 24 | There are two tabs: 25 | 26 | 1. An empirical histogram. 27 | 2. A probability histogram (probability distribution). 28 | 29 | 30 | ## How to run it? 31 | 32 | ```R 33 | library(shiny) 34 | 35 | # Easiest way is to use runGitHub 36 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-roll-dice-sum") 37 | ``` 38 | -------------------------------------------------------------------------------- /apps/ch18-roll-dice-sum/app.R: -------------------------------------------------------------------------------- 1 | # Title: Roll dice and add spots 2 | # Description: Empirical vs Probability Histograms 3 | # Chapter 18: Probability Histograms, page 311 4 | # Author: Gaston Sanchez 5 | 6 | library(shiny) 7 | source("helpers.R") 8 | 9 | # Define UI for application that draws a histogram 10 | ui <- fluidPage( 11 | 12 | # Give the page a title 13 | titlePanel("Rolling Dice: Sum"), 14 | 15 | # Generate a row with a sidebar 16 | sidebarLayout( 17 | 18 | # Define the sidebar with one input 19 | sidebarPanel( 20 | numericInput("dice", label = "Number of dice:", 2, 21 | min = 1, max = 10, step = 1), 22 | numericInput("seed", label = "Random Seed:", 12330, 23 | min = 10000, max = 50000, step = 1), 24 | sliderInput("reps", label = "Number of repetitions:", 25 | min = 100, max = 10000, value = 100, step= 10) 26 | #hr(), 27 | #helpText('Average of sums:'), 28 | #verbatimTextOutput("num_heads"), 29 | #helpText('SD of sums:'), 30 | #verbatimTextOutput("prop_heads") 31 | ), 32 | 33 | # Create tabs for plots 34 | mainPanel( 35 | tabsetPanel(type = "tabs", 36 | tabPanel("Empirical", plotOutput("empiricalPlot")), 37 | tabPanel("Probability", plotOutput("probabilityPlot")) 38 | ) 39 | ) 40 | ) 41 | ) 42 | 43 | 44 | # Define server logic required to draw a histogram 45 | server <- function(input, output) { 46 | 47 | # Empirical average of sum of draws 48 | output$num_heads <- renderPrint({ 49 | set.seed(input$seed) 50 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 51 | # avg of sums 52 | mean(total_points) 53 | }) 54 | 55 | # Empirical SD of sum of draws 56 | output$prop_heads <- renderPrint({ 57 | set.seed(input$seed) 58 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 59 | # avg of sums 60 | sd(total_points) * sqrt((input$reps - 1)/input$reps) 61 | }) 62 | 63 | # Empirical Histogram 64 | output$empiricalPlot <- renderPlot({ 65 | set.seed(input$seed) 66 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls) 67 | # put in relative terms 68 | prop_points <- 100 * table(total_points) / input$reps 69 | ymax <- find_ymax(max(prop_points), 2) 70 | # Render a barplot 71 | barplot(prop_points, las = 1, border = "gray40", 72 | space = 0, ylim = c(0, ymax), 73 | xlab = "Number of spots", 74 | ylab = "Relative Frequency", 75 | main = sprintf("%s Repetitions", input$reps)) 76 | }) 77 | 78 | # Probability Histogram 79 | output$probabilityPlot <- renderPlot({ 80 | outcomes <- sum_spots(input$dice) 81 | # Render a barplot 82 | barplot(100 * outcomes, 83 | las = 1, border = "gray40", 84 | space = 0, 85 | xlab = "Number of spots", 86 | ylab = "Chance (%)", 87 | main = "Probability Histogram") 88 | }) 89 | 90 | } 91 | 92 | # Run the application 93 | shinyApp(ui = ui, server = server) 94 | 95 | -------------------------------------------------------------------------------- /apps/ch18-roll-dice-sum/helpers.R: -------------------------------------------------------------------------------- 1 | # Helper functions to simulate rolling a die 2 | # and adding-or-multiplying the number of spots 3 | 4 | # roll one die 5 | roll_die <- function(times = 1) { 6 | die <- 1:6 7 | sample(die, times, replace = TRUE) 8 | } 9 | 10 | 11 | # roll a pair of dice 12 | roll_pair <- function() { 13 | roll_die(2) 14 | } 15 | 16 | # sum of spots 17 | sum_rolls <- function(times = 1) { 18 | sum(roll_die(times)) 19 | } 20 | 21 | # product of numbers 22 | prod_rolls <- function(times = 1) { 23 | prod(roll_die(times)) 24 | } 25 | 26 | # check whether 'x' is multiple of 'num' 27 | is_multiple <- function(x, num) { 28 | x %% num == 0 29 | } 30 | 31 | # find the y-max value for ylim in barplot() 32 | find_ymax <- function(x, num) { 33 | if (is_multiple(x, num)) { 34 | return(max(x)) 35 | } else { 36 | return(num * ((x %/% num) + 1)) 37 | } 38 | } 39 | 40 | # reps <- 100 41 | # total_points <- sapply(rep(2, reps), sum_rolls) 42 | # prop_points <- 100 * table(total_points) / reps 43 | # barplot(prop_points, las = 1, 44 | # space = 0, ylim = c(0, 30), 45 | # main = sprintf("%s Repetitions", reps)) 46 | 47 | 48 | 49 | 50 | # function that adds spots to a given result 51 | add_rolls <- function(given) { 52 | results <- rep(0, length(given) * 6) 53 | aux <- 1 54 | for (i in 1:length(given)) { 55 | for (j in 1:6) { 56 | results[aux] <- given[i] + j 57 | aux <- aux + 1 58 | } 59 | } 60 | results 61 | } 62 | 63 | # function that computes theoretical probabilities 64 | # for the addition of spots when rolling "k" dice 65 | sum_spots <- function(num_dice) { 66 | # just one die 67 | if (num_dice == 1) { 68 | outcomes <- table(1:6) / (6^num_dice) 69 | } else { 70 | # two or more dice 71 | current <- 1:6 72 | for (k in 2:num_dice) { 73 | current <- add_rolls(current) 74 | } 75 | outcomes <- table(current) / (6^num_dice) 76 | } 77 | outcomes 78 | } 79 | 80 | # sum_spots(1) 81 | # sum_spots(2) 82 | # sum_spots(3) 83 | 84 | 85 | 86 | -------------------------------------------------------------------------------- /apps/ch20-sampling-men/README.md: -------------------------------------------------------------------------------- 1 | # Sampling Men 2 | 3 | This is a Shiny app that illustrates the concept of chance errors in sampling. 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide a visual display for the Introduction example in 9 | __Statistics, Chapter 20: Chance Errors in Sampling__ 10 | 11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company. 12 | 13 | 14 | ## Data 15 | 16 | The data consists of a box model with 6672 tickets: 3091 __1's__, and 3581 __0's__. 17 | The 1's tickets represent men, while the 0's represent women. 18 | The app simulates taking samples from the box. There are two parameters, one is the sample size, and the other is the number samples (i.e. # of repetitions). 19 | 20 | 21 | ## Plots 22 | 23 | There are two plots: 24 | 25 | 1. The first tab shows a histogram with the number of men in the samples. 26 | 2. The second tab shows a histogram with the percentage of men in the samples. 27 | 28 | 29 | ## How to run it? 30 | 31 | There are many ways to download the app and run it: 32 | 33 | ```R 34 | library(shiny) 35 | 36 | # Easiest way is to use runGitHub 37 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch20-sampling-men") 38 | ``` 39 | -------------------------------------------------------------------------------- /apps/ch20-sampling-men/app.R: -------------------------------------------------------------------------------- 1 | # Title: Roll dice and add spots 2 | # Description: Empirical vs Probability Histograms 3 | # Chapter 18: Probability Histograms, page 311 4 | # Author: Gaston Sanchez 5 | 6 | library(shiny) 7 | 8 | # Define UI for application that draws a histogram 9 | ui <- fluidPage( 10 | 11 | # Give the page a title 12 | titlePanel("Sampling Men (p 359)"), 13 | 14 | # Generate a row with a sidebar 15 | sidebarLayout( 16 | 17 | # Define the sidebar with one input 18 | sidebarPanel( 19 | fluidRow( 20 | column(5, 21 | numericInput("tickets1", "men [1]", 3091, 22 | min = 1, max = 4000, step = 1)), 23 | column(5, 24 | numericInput("tickets0", "women [0]", 3581, 25 | min = 1, max = 4000, step = 1)) 26 | ), 27 | # helpText('Avg box, SD box'), 28 | verbatimTextOutput("avg_sd_box"), 29 | numericInput("size", label = "Sample Size (# draws):", value = 100, 30 | min = 10, max = 1500, step = 1), 31 | sliderInput("reps", label = "Number of repetitions:", 32 | min = 50, max = 2000, value = 100, step = 50), 33 | numericInput("seed", label = "Random Seed:", 12345, 34 | min = 10000, max = 50000, step = 1), 35 | hr(), 36 | helpText('Number average'), 37 | verbatimTextOutput("num_avg"), 38 | helpText('Percent average'), 39 | verbatimTextOutput("perc_avg") 40 | ), 41 | 42 | # Create tabs for plots 43 | mainPanel( 44 | tabsetPanel(type = "tabs", 45 | tabPanel("Number", plotOutput("numberPlot")), 46 | tabPanel("Percentage", plotOutput("percentPlot")) 47 | ) 48 | ) 49 | ) 50 | ) 51 | 52 | 53 | # Define server logic required to draw a histogram 54 | server <- function(input, output) { 55 | 56 | # Number of men 57 | output$avg_sd_box <- renderPrint({ 58 | total <- input$tickets1 + input$tickets0 59 | avg_box <- input$tickets1 / total 60 | sd_box <- sqrt((input$tickets1/total) * (input$tickets0/total)) 61 | cat(sprintf('Avg = %0.3f, SD = %0.3f', avg_box, sd_box)) 62 | }) 63 | 64 | num_men <- reactive({ 65 | tickets <- rep(c(1, 0), c(input$tickets1, input$tickets0)) 66 | 67 | set.seed(input$seed) 68 | size <- input$size 69 | samples <- 1:input$reps 70 | for (i in 1:input$reps) { 71 | samples[i] <- sum(sample(tickets, size = size)) 72 | } 73 | samples 74 | }) 75 | 76 | # Number of men 77 | output$num_avg <- renderPrint({ 78 | round(mean(num_men()), 2) 79 | }) 80 | 81 | # Percentage of men 82 | output$perc_avg <- renderPrint({ 83 | round(100 * mean(num_men() / input$size), 2) 84 | }) 85 | 86 | # Plot with number of men in samples 87 | output$numberPlot <- renderPlot({ 88 | # Render a barplot 89 | barplot(table(num_men()), 90 | space = 0, las = 1, 91 | xlab = 'Number of men', 92 | ylab = '', 93 | main = 'Sample Men') 94 | }) 95 | 96 | # Plot with percentage of men in samples 97 | output$percentPlot <- renderPlot({ 98 | # Render a barplot 99 | percentage_men <- round(100 * num_men() / input$size) 100 | barplot(table(percentage_men) / length(num_men()), 101 | space = 0, las = 1, 102 | xlab = 'Percentage of men', 103 | ylab = 'Proportion', 104 | main = 'Sample Men') 105 | }) 106 | 107 | } 108 | 109 | # Run the application 110 | shinyApp(ui = ui, server = server) 111 | 112 | -------------------------------------------------------------------------------- /apps/ch21-accuracy-percentages/README.md: -------------------------------------------------------------------------------- 1 | # Ch21 - Percent Estimation 2 | 3 | This is a Shiny app that illustrates the concept of accuracy of percentages. 4 | In other words, confidence intervals when esitmating a percentage. 5 | 6 | 7 | ## Motivation 8 | 9 | The goal is to provide a visual display for the various examples in 10 | __Statistics, Chapter 21: Accuracy of Percentages__ 11 | 12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). 13 | Fourth Edition. Norton & Company. 14 | 15 | 16 | ## Data 17 | 18 | - The data consists of a box model with two types of tickets: 0's and 1's. 19 | - The user can specify the nummber of both types of tickets (# of 1's, # of 0's). 20 | - The app simulates drawing tickets with replacement from the box. 21 | - There are three arguments: 22 | + the number of draws (i.e. sample size) 23 | + the number samples (i.e. # of repetitions) 24 | + the confidence level 25 | 26 | 27 | ## Plots 28 | 29 | There are three plots: 30 | 31 | 1. The first tab shows a histogram for the sum of draws. 32 | 2. The second tab shows a histogram for the percentage of tickets 1's. 33 | 3. The third tab shows a chart with the percentage of the box (i.e. population percentage), 34 | and the confidence intervals of the drawn samples. 35 | 36 | 37 | ## How to run it? 38 | 39 | ```R 40 | library(shiny) 41 | 42 | # Easiest way is to use runGitHub 43 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch21-accuracy-percentages") 44 | ``` 45 | -------------------------------------------------------------------------------- /apps/ch21-accuracy-percentages/app.R: -------------------------------------------------------------------------------- 1 | # Box with two types of tickets [# 1's, # 0's] 2 | # Drawing tickets from the box 3 | # Chapter 21: Accuracy of Percentages 4 | 5 | library(shiny) 6 | 7 | source('helpers.R') 8 | 9 | # Define the overall UI 10 | ui <- fluidPage( 11 | 12 | # Give the page a title 13 | titlePanel("Accuracy of Percentages"), 14 | 15 | # Generate a row with a sidebar 16 | sidebarLayout( 17 | 18 | # Define the sidebar with one input 19 | sidebarPanel( 20 | fluidRow( 21 | column(5, 22 | numericInput("tickets1", "# Tickets 1", 5, 23 | min = 1, max = 100, step = 1)), 24 | column(5, 25 | numericInput("tickets0", "# Tickets 0", 5, 26 | min = 1, max = 200, step = 1)) 27 | ), 28 | helpText('Avg of box, and SD of box'), 29 | verbatimTextOutput("avg_sd_box"), 30 | hr(), 31 | sliderInput("draws", label = "Sample size (# draws):", value = 25, 32 | min = 5, max = 500, step = 1), 33 | numericInput("reps", label = "Number of samples (# reps):", 34 | min = 10, max = 1000, value = 50, step = 10), 35 | checkboxInput('param', value = TRUE, label = strong('Show parameter')), 36 | sliderInput("confidence", label = "Confidence level (%):", value = 68, 37 | min = 1, max = 99, step = 1), 38 | numericInput("seed", label = "Random Seed:", 12345, 39 | min = 10000, max = 50000, step = 1) 40 | ), 41 | 42 | # Create a spot for the barplot 43 | mainPanel( 44 | tabsetPanel(type = "tabs", 45 | tabPanel("Sum", plotOutput("sumPlot")), 46 | tabPanel("Percentage", plotOutput("percentPlot")), 47 | tabPanel("Estimates", plotOutput("intervalPlot")) 48 | ) 49 | ) 50 | ) 51 | ) 52 | 53 | 54 | 55 | # Define server logic required to draw a histogram 56 | server <- function(input, output) { 57 | tickets <- reactive({ 58 | tickets <- c(rep(1, input$tickets1), rep(0, input$tickets0)) 59 | }) 60 | 61 | sum_draws <- reactive({ 62 | set.seed(input$seed) 63 | samples <- 1:input$reps 64 | for (i in 1:input$reps) { 65 | samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE)) 66 | } 67 | samples 68 | }) 69 | 70 | avg_box <- reactive({ 71 | mean(tickets()) 72 | }) 73 | 74 | sd_box <- reactive({ 75 | total <- input$tickets1 + input$tickets0 76 | sqrt((input$tickets1 / total) * (input$tickets0 / total)) 77 | }) 78 | 79 | # Average and SD of box 80 | output$avg_sd_box <- renderPrint({ 81 | cat(avg_box(), ", ", sd_box(), sep = '') 82 | }) 83 | 84 | # Plot with sum of draws 85 | output$sumPlot <- renderPlot({ 86 | # Render a barplot 87 | barplot(table(sum_draws()), 88 | space = 0, las = 1, 89 | xlab = 'Sum', 90 | ylab = '', 91 | main = sprintf('Sum of Box for %s Draws', input$draws)) 92 | }) 93 | 94 | # Plot with percentage of draws 95 | output$percentPlot <- renderPlot({ 96 | # Render a barplot 97 | avg_draws <- round(sum_draws() / input$draws, 2) 98 | barplot(table(avg_draws), 99 | space = 0, las = 1, 100 | xlab = 'Percentage', 101 | ylab = '', 102 | main = "Percentage of 1's") 103 | }) 104 | 105 | # Plot with confidence intervals 106 | output$intervalPlot <- renderPlot({ 107 | avg_box <- mean(tickets()) 108 | n <- length(tickets()) 109 | sd_box <- sqrt((n-1)/n) * sd(tickets()) 110 | se_sum <- sqrt(input$draws) * sd_box 111 | se_perc <- se_sum / input$draws 112 | 113 | # Render plot 114 | samples <- sum_draws() / input$draws 115 | 116 | #a <- samples - se_perc 117 | #b <- samples + se_perc 118 | 119 | a <- samples - ci_factor(input$confidence) * se_perc 120 | b <- samples + ci_factor(input$confidence) * se_perc 121 | covers <- (a <= avg_box & avg_box <= b) 122 | ci_cols <- rep('#ff000088', input$reps) 123 | ci_cols[covers] <- '#0000ff88' 124 | 125 | #xlim <- c(min(samples) - ci_factor(input$confidence) * se_perc, 126 | # max(samples) + ci_factor(input$confidence) * se_perc) 127 | xlim <- c(min(samples) - 3 * se_perc, 128 | max(samples) + 3 * se_perc) 129 | plot(samples, 1:length(samples), axes = FALSE, 130 | col = '#444444', pch = 21, cex = 0.5, 131 | xlim = xlim, 132 | ylab = 'Number of samples', 133 | xlab = "Confidence Intervals", 134 | main = "Percentage of 1's") 135 | axis(side = 1, at = seq(0, 1, 0.1)) 136 | axis(side = 2, las = 1) 137 | if (input$param) { 138 | # display line for parameter 139 | abline(v = avg_box, col = '#0000FFdd', lwd = 2.5) 140 | } 141 | segments(x0 = a, 142 | x1 = b, 143 | y0 = 1:length(samples), 144 | y1 = 1:length(samples), 145 | col = ci_cols) 146 | }) 147 | 148 | } 149 | 150 | 151 | # Run the application 152 | shinyApp(ui = ui, server = server) 153 | 154 | -------------------------------------------------------------------------------- /apps/ch21-accuracy-percentages/helpers.R: -------------------------------------------------------------------------------- 1 | # function to compute SE factor 2 | # for a confidence level 3 | ci_factor <- function(level = 95) { 4 | area <- level + ((100 - level) / 2) 5 | qnorm(area/100) 6 | } 7 | 8 | # tests 9 | 10 | # 90% confidence level 11 | # ci_factor(90) 12 | 13 | # 95% confidence level 14 | # ci_factor(95) 15 | 16 | # 99% confidence level 17 | # ci_factor(99) 18 | -------------------------------------------------------------------------------- /apps/ch23-accuracy-averages/README.md: -------------------------------------------------------------------------------- 1 | # Ch23 - Accuracy of Averages 2 | 3 | This is a Shiny app that illustrates the concept of accuracy of averages. 4 | 5 | 6 | ## Motivation 7 | 8 | The goal is to provide a visual display for the Introduction example in 9 | __Statistics, Chapter 23: Accuracy of Averages__ 10 | 11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). 12 | Fourth Edition. Norton & Company. 13 | 14 | 15 | ## Data 16 | 17 | The data consists of a box model with default tickets: 1, 2, 3, 4, 5, 6, 7. 18 | However, the numbers in the box can be changed by the user. 19 | The app simulates taking random samples from the box. 20 | There are two parameters, one is the number of draws (i.e. sample size), 21 | and the other is the number samples (i.e. # of repetitions). 22 | 23 | 24 | ## Plots 25 | 26 | There are three plots: 27 | 28 | 1. The first tab shows a histogram for the sum of draws. 29 | 2. The second tab shows a histogram for the average of draws. 30 | 3. The third tab shows a chart with the average of the box (i.e. population avg), 31 | and the confidence intervals of the drawn samples (i.e. sample averages) 32 | 33 | 34 | ## How to run it? 35 | 36 | ```R 37 | library(shiny) 38 | 39 | # Easiest way is to use runGitHub 40 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch23-accuracy-averages") 41 | ``` 42 | -------------------------------------------------------------------------------- /apps/ch23-accuracy-averages/app.R: -------------------------------------------------------------------------------- 1 | # Box with two types of tickets [# 1's, # 0's] 2 | # Drawing tickets from the box 3 | # Chapter 21: Accuracy of Percentages 4 | 5 | library(shiny) 6 | 7 | source('helpers.R') 8 | 9 | # Define the overall UI 10 | ui <- fluidPage( 11 | 12 | # Give the page a title 13 | titlePanel("Accuracy of Averages"), 14 | 15 | # Generate a row with a sidebar 16 | sidebarLayout( 17 | 18 | # Define the sidebar with one input 19 | sidebarPanel( 20 | textInput("tickets", label = "Numbers in box:", 21 | value = '1, 2, 3, 4, 5, 6, 7'), 22 | helpText('Avg of box, and SD of box'), 23 | verbatimTextOutput("avg_sd_box"), 24 | hr(), 25 | sliderInput("draws", label = "Sample size (# draws):", value = 25, 26 | min = 5, max = 500, step = 1), 27 | sliderInput("reps", label = "Number of samples (# reps):", 28 | min = 10, max = 1000, value = 50, step = 10), 29 | checkboxInput('param', value = TRUE, label = strong('Show parameter')), 30 | sliderInput("confidence", label = "Confidence level (%):", value = 68, 31 | min = 1, max = 99, step = 1), 32 | numericInput("seed", label = "Random Seed:", 12345, 33 | min = 10000, max = 50000, step = 1) 34 | ), 35 | 36 | # Create a spot for the barplot 37 | mainPanel( 38 | tabsetPanel(type = "tabs", 39 | tabPanel("Sum", plotOutput("sumPlot")), 40 | tabPanel("Average", plotOutput("averagePlot")), 41 | tabPanel("Estimates", plotOutput("intervalPlot")) 42 | ) 43 | ) 44 | ) 45 | ) 46 | 47 | 48 | 49 | # Define server logic required to draw a histogram 50 | server <- function(input, output) { 51 | tickets <- reactive({ 52 | tickets <- gsub(' ', '', input$tickets) 53 | tickets <- unlist(strsplit(tickets, ',')) 54 | as.numeric(tickets) 55 | }) 56 | 57 | sum_draws <- reactive({ 58 | set.seed(input$seed) 59 | samples <- 1:input$reps 60 | for (i in 1:input$reps) { 61 | samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE)) 62 | } 63 | samples 64 | }) 65 | 66 | avg_box <- reactive({ 67 | mean(tickets()) 68 | }) 69 | 70 | sd_box <- reactive({ 71 | n <- length(tickets()) 72 | sqrt((n-1)/n) * sd(tickets()) 73 | }) 74 | 75 | # Average and SD of box 76 | output$avg_sd_box <- renderPrint({ 77 | cat(avg_box(), ", ", sd_box(), sep = '') 78 | }) 79 | 80 | # Plot with sum of draws 81 | output$sumPlot <- renderPlot({ 82 | # Render a barplot 83 | barplot(table(sum_draws()), 84 | space = 0, las = 1, 85 | xlab = 'Sum', 86 | ylab = '', 87 | main = sprintf('Sum of Box for %s Draws', input$draws)) 88 | }) 89 | 90 | # Plot with average of draws 91 | output$averagePlot <- renderPlot({ 92 | # Render a barplot 93 | avg_draws <- round(sum_draws() / input$draws, 2) 94 | barplot(table(avg_draws), 95 | space = 0, las = 1, 96 | xlab = 'Average', 97 | ylab = '', 98 | main = "Average") 99 | }) 100 | 101 | # Plot with confidence intervals 102 | output$intervalPlot <- renderPlot({ 103 | avg_box <- mean(tickets()) 104 | n <- length(tickets()) 105 | sd_box <- sqrt((n-1)/n) * sd(tickets()) 106 | se_sum <- sqrt(input$draws) * sd_box 107 | se_perc <- se_sum / input$draws 108 | 109 | # Render plot 110 | samples <- sum_draws() / input$draws 111 | 112 | #a <- samples - se_perc 113 | #b <- samples + se_perc 114 | 115 | a <- samples - ci_factor(input$confidence) * se_perc 116 | b <- samples + ci_factor(input$confidence) * se_perc 117 | covers <- (a <= avg_box & avg_box <= b) 118 | ci_cols <- rep('#ff000088', input$reps) 119 | ci_cols[covers] <- '#0000ff88' 120 | 121 | #xlim <- c(min(samples) - ci_factor(input$confidence) * se_perc, 122 | # max(samples) + ci_factor(input$confidence) * se_perc) 123 | xlim <- c(min(samples) - 3 * se_perc, 124 | max(samples) + 3 * se_perc) 125 | plot(samples, 1:length(samples), axes = FALSE, 126 | col = '#444444', pch = 21, cex = 0.5, 127 | xlim = xlim, 128 | ylab = 'Number of samples', 129 | xlab = "Confidence Intervals", 130 | main = "Average") 131 | axis(side = 1) 132 | axis(side = 2, las = 1) 133 | if (input$param) { 134 | # display line for parameter 135 | abline(v = avg_box, col = '#0000FFdd', lwd = 2.5) 136 | } 137 | segments(x0 = a, 138 | x1 = b, 139 | y0 = 1:length(samples), 140 | y1 = 1:length(samples), 141 | col = ci_cols) 142 | }) 143 | 144 | } 145 | 146 | 147 | # Run the application 148 | shinyApp(ui = ui, server = server) 149 | 150 | -------------------------------------------------------------------------------- /apps/ch23-accuracy-averages/helpers.R: -------------------------------------------------------------------------------- 1 | # function to compute SE factor 2 | # for a confidence level 3 | ci_factor <- function(level = 95) { 4 | area <- level + ((100 - level) / 2) 5 | qnorm(area/100) 6 | } 7 | 8 | # tests 9 | 10 | # 90% confidence level 11 | # ci_factor(90) 12 | 13 | # 95% confidence level 14 | # ci_factor(95) 15 | 16 | # 99% confidence level 17 | # ci_factor(99) 18 | -------------------------------------------------------------------------------- /data/stock-earnings-prices.csv: -------------------------------------------------------------------------------- 1 | "industry","earnings","price" 2 | "auto",3.3,2.9 3 | "banks",8.6,6.5 4 | "chemicals",6.6,3.1 5 | "computers",10.2,5.3 6 | "drugs",11.3,10.0 7 | "electrical equipment",8.5,8.2 8 | "food",7.6,6.5 9 | "household products",9.7,10.1 10 | "machinery",5.1,4.7 11 | "oil domestic",7.4,7.3 12 | "oil international",7.7,7.7 13 | "oil equipment",10.1,10.8 14 | "railroad",6.6,6.6 15 | "retail food",6.9,6.9 16 | "department stores",10.1,9.5 17 | "soft drinks",12.7,12.0 18 | "steel",-1.0,-1.6 19 | "tobacco",12.3,11.7 20 | "utilities electric",2.8,1.4 21 | "utilities gas",5.2,6.2 22 | -------------------------------------------------------------------------------- /data/vegetables-smoking.csv: -------------------------------------------------------------------------------- 1 | state,vegetables,smoking 2 | Alabama,20.1,18.8 3 | Alaska,24.8,18.8 4 | Arizona,23.7,13.7 5 | Arkansas,21,18.1 6 | California,28.9,9.8 7 | Colorado,24.5,13.5 8 | Connecticut,27.4,12.4 9 | Delaware,21.3,15.5 10 | Florida,26.2,15.2 11 | Georgia,23.2,16.4 12 | Hawaii,24.5,12.1 13 | Idaho,23.2,13.3 14 | Illinois,24,14.2 15 | Indiana,22,20.8 16 | Iowa,19.5,16.1 17 | Kansas,19.9,13.6 18 | Kentucky,16.8,23.5 19 | Louisiana,20.2,16.4 20 | Maine,28.7,15.9 21 | Maryland,28.7,13.4 22 | Massachusetts,28.6,13.5 23 | Michigan,22.8,16.7 24 | Minnesota,24.5,14.9 25 | Mississippi,16.5,18.6 26 | Missouri,22.6,18.5 27 | Montana,24.7,14.5 28 | Nebraska,20.2,16.1 29 | Nevada,22.5,16.6 30 | NewHampshire,29.1,15.4 31 | NewJersey,25.9,12.8 32 | NewMexico,21.5,14.6 33 | NewYork,26,14.6 34 | NorthCarolina,22.5,17.1 35 | NorthDakota,21.8,15 36 | Ohio,22.6,17.6 37 | Oklahoma,15.7,19 38 | Oregon,25.9,13.4 39 | Pennsylvania,23.9,17.9 40 | RhodeIsland,26.8,15.3 41 | SouthCarolina,21.2,17 42 | SouthDakota,20.5,13.8 43 | Tennessee,26.5,20.4 44 | Texas,22.6,13.2 45 | Utah,22.1,8.5 46 | Vermont,30.8,14.4 47 | Virginia,26.2,15.3 48 | Washington,25.2,12.5 49 | WestVirginia,20,21.3 50 | Wisconsin,22.2,15.9 51 | Wyoming,21.8,16.3 52 | -------------------------------------------------------------------------------- /hw/README.md: -------------------------------------------------------------------------------- 1 | ## Homework Assignments 2 | 3 | - HW assignments are due on Thursdays (before midnight). 4 | - Further instructions will be posted on bCourses (see "Assignments" section). 5 | - Submit your homework electronically via bCourses as a word, text, pdf, or html file. 6 | - Please do NOT submit any other file format (e.g. `.pages`, `.Rmd`, `.R`) since it won't be rendered on bCourses. 7 | - Please become familiar with the HW policy described in the syllabus. 8 | 9 | 10 | Tentative Calendar, Spring 2017 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 40 | 41 | 42 | 43 | 44 | 47 | 48 | 49 | 50 | 51 | 54 | 55 | 56 | 57 | 58 | 60 | 61 | 62 | 63 | 64 | 67 | 68 | 69 | 70 | 71 | 73 | 74 | 75 | 76 | 77 | 80 | 81 | 82 | 83 | 84 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 97 | 98 | 99 |
HWDueTopic
1Jan 26Ch-3: A3,7, C2, R4,8, extra questions
2Feb 02Ch-4: B5, D8, E4, R6,9, extra questions
3Feb 09 38 | Ch-5: C1, D1, E3, Rev7,10
39 | Ch-8: B6,8,9, R9, extra questions
4Feb 16 45 | Ch-9: A10, B2, E3, R4,8
46 | Ch-10: A2,4, C4, extra questions
5Feb 23Ch-10: C2, R3,4
52 | Ch-11: B1,2, D2, E1, R4,7
53 | Ch-12: R3,5
6Mar 02Ch-13: R2,4,7,8,9 59 |
7Mar 09 65 | Ch-14: R1,3,5,6,9
66 | and Binomial Probability
8Mar 16Ch-16: B2,6, R1,4,9
72 | Ch-17: A1, B2, C1, E1, R2
9Mar 23Ch-18: B3, B5, C5, R2
78 | Ch-19: R5,7
79 | Ch-20: A4, B3, Rev3,4,6
10Apr 13Ch-21: A7,8, B4, C6,7, R2,7
85 | Ch-23: A2,5, C2, R4,10,12
11Apr 20Ch-26: B5, C1, F1, F7, R2,5,7,8,9
12Apr 27Ch-26: F4, R1; Ch-27: R1, R10
96 | Ch-29: R4,7,9,11
100 | -------------------------------------------------------------------------------- /hw/hw01-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw01-questions.pdf -------------------------------------------------------------------------------- /hw/hw02-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw02-questions.pdf -------------------------------------------------------------------------------- /hw/hw03-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw03-questions.pdf -------------------------------------------------------------------------------- /hw/hw04-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw04-questions.pdf -------------------------------------------------------------------------------- /hw/hw05-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw05-questions.pdf -------------------------------------------------------------------------------- /hw/hw06-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw06-questions.pdf -------------------------------------------------------------------------------- /hw/hw07-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw07-questions.pdf -------------------------------------------------------------------------------- /hw/hw08-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw08-questions.pdf -------------------------------------------------------------------------------- /hw/hw09-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw09-questions.pdf -------------------------------------------------------------------------------- /hw/hw10-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw10-questions.pdf -------------------------------------------------------------------------------- /hw/hw11-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw11-questions.pdf -------------------------------------------------------------------------------- /hw/hw12-questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw12-questions.pdf -------------------------------------------------------------------------------- /labs/README.md: -------------------------------------------------------------------------------- 1 | ## Lab Discussions 2 | 3 | Tentative Calendar, Spring 2017 4 | 5 |
6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 21 | 25 | 26 | 27 | 28 | 31 | 35 | 36 | 37 | 38 | 41 | 45 | 46 | 47 | 48 | 51 | 55 | 56 | 57 | 58 | 62 | 67 | 68 | 69 | 70 | 73 | 76 | 77 | 78 | 79 | 82 | 85 | 86 | 87 | 88 | 91 | 94 | 95 | 96 | 97 | 100 | 104 | 105 | 106 | 109 | 113 | 114 | 115 | 116 | 120 | 124 | 125 | 126 | 127 | 130 | 133 | 134 | 135 | 136 | 139 | 143 | 144 | 145 | 146 | 149 | 152 | 153 | 154 |
WeekLabTopic
1 19 | Jan 23-24
20 | Jan 25-26
22 | Ch-3: A4,5,6, B2, C4
23 | Ch-4: A4,5,6,9, B3,4 24 |
2 29 | Jan 03-31
30 | Feb 01-02
32 | Ch-4: C4,5, D1,2,6, E1,5,6,7
33 | Ch-5: A1,2, B1,2,5, F1, R11 34 |
3 39 | Feb 06-07
40 | Feb 08-09
42 | Ch-8: A5,6, B4,7, D2,3
43 | Ch-9: A2-5,9, B1, C4, D1, E4 44 |
4 49 | Feb 13-14
50 | Feb 15-16
52 | Ch-10: A1, B1,4, C1,5, D1, E1,2
53 | Ch-11: A4,6,8 54 |
5 59 | Feb 21
60 | Feb 22-23
61 | Feb 24
63 | Ch-11: B1,2,3, D4,5,6,7, E2,3
64 | Ch-12: A2, B3,4,5
65 | Test 1 66 |
6 71 | Feb 28
72 | Mar 02
74 | Ch-13: A1, B1,2, C1, D2,4,5,6
75 | Ch-14: B1,3, C3, D4
7 80 | Mar 07
81 | Mar 09
83 | Ch-15: A3-6, R1,5,6
84 | Ch-16: A3,4, B1, 3, C1, R6
8 89 | Mar 14
90 | Mar 16
92 | Ch-17: B1,3,5,6
93 | Ch-17: D4, R6,11
9 98 | Mar 21
99 | Mar 23
101 | Ch-18: B1; C2,7,8, R3,12,13,14
102 | Ch-19: A5,6,8,12, R6,12 103 |
10 107 | Mar 28
108 | Mar 30
110 | Spring Break
111 | Spring Break 112 |
11 117 | Apr 04
118 | Apr 06
119 | Apr 07
121 | Ch-20: A1,2, B1,2
122 | Ch-20: C1,3,5
123 | Test 2
12 128 | Apr 11
129 | Apr 13
131 | Ch-21: A4,5,6, B3, C4,5, D2, E1,2
132 | Ch-23: A1, B1, C3, D1,2,3,4, R3,8
13 137 | Apr 18
138 | Apr 20
140 | Ch-26: A4,5, B2, C4,5, D3
141 | Extra exercises 142 |
14 147 | Apr 25
148 | Apr 27
150 | Ch-27: A4,5, B2,3,5, D5,6
151 | Ch-29: A2, B2,3,4,5,6,7, D2, E2
155 | 156 | -------------------------------------------------------------------------------- /lectures/README.md: -------------------------------------------------------------------------------- 1 | ## Lectures 2 | 3 | Tentative Calendar, Spring 2017. 4 | 5 | Material based on __Statistics__ (4th edition) by Freedman, Pisani and Purves. 6 | 7 | 8 | | Week | Date | Monday | Wednesday | Friday | 9 | |------|--------|-----------------------------|-------------------------|-----------------------| 10 | | 0 | Jan-16 | | Data and variables | Intro to R & RStudio | 11 | | 1 | Jan-23 | Ch 3: Histograms | Ch 4: Average | Ch 4: Spread | 12 | | 2 | Jan-30 | Ch 5: Normal curve | Ch 5: Normal Curve | Ch 8: Correlation | 13 | | 3 | Feb-06 | Ch 9: More Correlation | History of Regression | Ch 10: Regression | 14 | | 4 | Feb-13 | Ch 11: RMS Error | Ch 12: Regression line | Regression in R | 15 | | 5 | Feb-20 | _Holiday_ | Review | __MIDTERM 1__ | 16 | | 6 | Feb-27 | Ch 13: Probability | Ch 14: More Probability | Ch 15: Binomial prob. | 17 | | 7 | Mar-06 | Ch 16: Law of Averages | Ch 16: Box Models | Ch 17: Expected Value | 18 | | 8 | Mar-13 | Ch 17: Standard Error | Ch 18: Normal Approx | Ch 18: Normal Approx | 19 | | 9 | Mar-20 | Ch 19: Sampling | Ch 19: Sampling | Ch 20: Chance Errors | 20 | | 10 | Mar-27 | _Spring Break_ | _Spring Break_ | _Spring Break_ | 21 | | 11 | Apr-03 | Review | Review | __MIDTERM 2__ | 22 | | 12 | Apr-10 | Ch 21: Accuracy Percentages | Ch 21: Conf. Intervals | Ch 23: Accuracy Averages| 23 | | 13 | Apr-17 | Ch 26: Significance Tests | Ch 26: z-test | Ch 26: t-test | 24 | | 14 | Apr-24 | Ch 27: Two-sample z-test | Ch 27: Two-sample z-test| Ch 29: More about tests | 25 | | 15 | May-01 | _RRR_ | _RRR_ | _RRR_ | 26 | 27 | 28 | - May-09: __Final Stat 131A__, 11:30-2:30pm in Birge 50 29 | - May-10: __Final Stat 20__, 3:00-6:00pm in VLSB 2050 30 | 31 | ----- 32 | 33 | ## Slides and Scripts 34 | 35 | - Jan 18-20: 36 | + [Data and Variables](https://docs.google.com/presentation/d/1k0Ti3489qKExV-X9VzgOq0rCRk0EcjsEB800TDyvfG0/edit?usp=sharing) 37 | + In-class: [Getting started with R and RStudio](../scripts/01-R-introduction.pdf) 38 | + [Intro to R and RStudio](https://docs.google.com/presentation/d/1jtPoAMnT2-56REz-pFZQWSSSzFVHXOI069vrQCA0r6k/edit?usp=sharing) auxiliary slides 39 | + Practice: script about [data and variables](../scripts/02-data-variables.pdf) 40 | - Jan 23-27: 41 | + In-class: [Histograms](https://docs.google.com/presentation/d/1D_QNv8HPBRQGqy3ofiJDuLgOpB-awMwwpMchX9n0My4/edit?usp=sharing) slides 42 | + App: [ch03-histograms](../apps/03-histograms) 43 | + Practice: script about [Histograms in R](../scripts/03-histograms.pdf) 44 | + In-class: [Measures of Center (Average and Median)](https://docs.google.com/presentation/d/15jjBpSkQmYs99S8A2yvGGR4lwusUcJgBXZYU88158pE/edit?usp=sharing) 45 | + Practice: script about [Average and median in R](../scripts/04-measures-center.pdf) 46 | + In-class: [Measures of Spread (RMS, Standard Deviation)](https://docs.google.com/presentation/d/1olNOkShLZTBwEywn1AsuX92PvimntXoKMn7eRDh5MRE/edit?usp=sharing) 47 | + Practice: script about [measures of spread in R](../scripts/05-measures-spread.pdf) 48 | - JanFeb 30-03: 49 | + In-class: [Normal Curve](https://docs.google.com/presentation/d/1_6ZEhuTCDvxesw6H99nJxnJz7shMIU9Hzq4GzWzw0dE/edit?usp=sharing) slides 50 | + Practice: script about [normal curve in R](../scripts/06-normal-curve.pdf) 51 | + In-class: [Scatter Diagrams and Correlation](https://docs.google.com/presentation/d/1qLtoiX8CrpHL70lZ8LBQN0F-xHuwEnhpVNZalaBnSM8/edit?usp=sharing) slides 52 | + App: [ch08-corr-coeff-diagrams](../apps/ch08-corr-coeff-diagrams) 53 | + Practice: script about [scatter diagrams in R](../scripts/07-scatter-diagrams.pdf) 54 | - Feb 06-10: 55 | + In-class: [More about Correlation](https://docs.google.com/presentation/d/1TNmvkcGnhIpZ3N-XLEJwuOcG9tDd6KbdIDzU4K6wivE/edit?usp=sharing) slides 56 | + In-class: [A bit of history about origins of regression](https://docs.google.com/presentation/d/1VBdCiJn_QmfeTsCzP29RlL4ldjripPdrSXkUSYfq0Rc/edit?usp=sharing) auxiliary slides 57 | + In-class: [Intro to Regression Method](https://docs.google.com/presentation/d/10eQJ3DxVVuC00mQ5aEBNb0nWZh8oX-vJ5mCJRQH39VA/edit?usp=sharing) slides 58 | + App: [ch10-heights-data](../apps/ch10-heights-data) 59 | + Practice: script about [Regression Line with R](../scripts/09-regression-line.pdf) 60 | - Feb 13-17 61 | + In-class: [R.M.S. Error for Regression](https://docs.google.com/presentation/d/1KSws7X-9jr1YWtJwPUmdnooodMqBMzRLjDWhsgq04Iw/edit?usp=sharing) slides 62 | + App: [ch11-regression-residuals](../apps/ch10-heights-data) 63 | + Practice: script about [Predictions and Errors in Regression with R](../scripts/10-prediction-and-errors-in-regression.pdf) 64 | + In-class: [Regression Line](https://docs.google.com/presentation/d/1bEV8MWCZ6xE2zm5egZXq5wcXOGOnHDJiJvj2tTGMhyI/edit?usp=sharing) slides 65 | + App: [ch11-regression-strips](../apps/ch11-regression-strips) 66 | - Feb 20-24 67 | + Regression Line 68 | + __Midterm 1__ Friday Feb-24 69 | - FebMar 27-03 70 | + In-class: [Probability Rules (part 1)](https://docs.google.com/presentation/d/1cgU096Vr5Ep30rXoQ68940YbbCM7wvpznsC623Zx5N0/edit?usp=sharing) 71 | + In-class: [Probability Rules (part 2)](https://docs.google.com/presentation/d/1C-bEAHd3naLPxk_WDSrMuWHd9kMdVVo7vh2x9lWaFvc/edit?usp=sharing) 72 | + In-class: [Binomial Formula](https://docs.google.com/presentation/d/1M6Xk1xwAmdewO1K5lVIAOXz45LcIfvrZOgzQs9EXc1c/edit?usp=sharing) 73 | + Practice: script about [binomial probability in R](../scripts/11-binomial-formula.pdf) 74 | - Mar 06-10 75 | + In-class: [Law of Averages](https://docs.google.com/presentation/d/1WDS0RyPXBjo0kgYSC5AIR33Vr78lKbOURXqJ2TMXvtI/edit?usp=sharing) 76 | + Practice: script about simulating basic [chance process with R](../scripts/12-chance-processes.pdf) 77 | + App: [ch16-chance-errors](../apps/ch16-chance-errors) 78 | + In-class: [Expected Value and Standard Error](https://docs.google.com/presentation/d/1QCSwf7zN80253dLYUAkZ3C4h01M6rLTFJ33h1tFD9To/edit?usp=sharing) 79 | + App: [ch17-demere-games](../apps/ch17-demere-games) 80 | + App: [ch17-expected-value-std-error](../apps/ch17-expected-value-std-error) 81 | - Mar 13-17 82 | + In-class: [Probability Histograms and Normal Approximation](https://docs.google.com/presentation/d/1AZ61AYdl1mmT3Uy1XebT8qpTbbR7uqiP0y_n740Vp8E/edit?usp=sharing) 83 | + App: [ch18-roll-dice-sum](../apps/ch18-roll-dice-sum) 84 | + App: [ch18-roll-dice-product](../apps/ch18-roll-dice-product) 85 | + App: [ch18-coin-tossing](../apps/ch18-coin-tossing) 86 | - Mar 20-24 87 | + In-class: [Sample Surveys](https://docs.google.com/presentation/d/1n-zZKPrpCoNqhf1hnDlUNVx-XL_qxdZgwngeKWMULiM/edit?usp=sharing) 88 | + In-class: [Sample Designs](https://docs.google.com/presentation/d/1KWmjAxrSNM7hRjWPh9veTLl8_FLneJozhA-6OUSLqK8/edit?usp=sharing) 89 | + In-class: [Chance Errors in Sampling](https://docs.google.com/presentation/d/1jRFpoepvu7RWwl6fsxPD7wFkdhZk83dlLwzb9SNMXSE/edit?usp=sharing) 90 | + App: [ch20-sampling-men](../apps/ch20-sampling-men) 91 | - Apr 03-07 92 | + Review 93 | + __Midterm 2__ Friday Apr-07 94 | - Apr 10-14 95 | + In-class: [Accuracy of Percentages](https://docs.google.com/presentation/d/1Ia5dA9BuEHUTX0dxLRJ9RervShAHmtqk8Si8hXPak-0/edit?usp=sharing) 96 | + App: [ch21-accuracy-percentages](../apps/ch21-accuracy-percentages) 97 | + In-class: [Accuracy of Averages](https://docs.google.com/presentation/d/1FnUXMu_5qYST5Stou895O_vUjdAULxeyxhvrGImAVEA/edit?usp=sharing) 98 | + App: [ch23-accuracy-averages](../apps/ch23-accuracy-averages) 99 | - Apr 17-21 100 | + In-class: [Hypothesis Tests](https://docs.google.com/presentation/d/1FQN-qh-plq87aB1d2vOUoi3YVLl6LE28uYUhXS5RFcI/edit?usp=sharing) 101 | + In-class: [One sample z-test](https://docs.google.com/presentation/d/1HhVMfQ0n8iebx527qscSFtj3wHQAqFk2xSSEnitu91g/edit?usp=sharing) 102 | + In-class: [One sample t-test](https://docs.google.com/presentation/d/1GTWOiwk4Gkeh_nXnKKK47hCcE1sWMGT-s9Q313VTUFM/edit?usp=sharing) 103 | - Apr 24-28 104 | + In-class: [Two sample z-test](https://docs.google.com/presentation/d/19PpdMovtJSbydDAc1Mv1wh3Mu5YWT0dFCh5aPcdE6dU/edit?usp=sharing) 105 | - May 01-05 106 | + RRR Week 107 | - May 07-12 108 | + __Final: Stat 131A__ Tue May-09 109 | + __Final: Stat 20__ Wed May-10 110 | -------------------------------------------------------------------------------- /other/Karl-Pearson-and-the-origins-of-modern-statistics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/Karl-Pearson-and-the-origins-of-modern-statistics.pdf -------------------------------------------------------------------------------- /other/Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf -------------------------------------------------------------------------------- /other/The-strange-science-of-Francis-Galton.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/The-strange-science-of-Francis-Galton.pdf -------------------------------------------------------------------------------- /other/formula-sheet-final.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-final.pdf -------------------------------------------------------------------------------- /other/formula-sheet-midterm1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-midterm1.pdf -------------------------------------------------------------------------------- /other/formula-sheet-midterm2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-midterm2.pdf -------------------------------------------------------------------------------- /other/standard-normal-table.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/standard-normal-table.pdf -------------------------------------------------------------------------------- /other/t-table.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/t-table.pdf -------------------------------------------------------------------------------- /other/z-table.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/z-table.pdf -------------------------------------------------------------------------------- /scripts/01-R-introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Getting started with R" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | header-includes: \usepackage{float} 6 | output: html_document 7 | urlcolor: blue 8 | --- 9 | 10 | > ### Learning Objectives 11 | > 12 | > - Complete installation of R and RStudio 13 | > - Get started with R as a scientific calculator 14 | > - First steps using RStudio 15 | > - Getting help in R 16 | > - Installing packages 17 | > - Using R script files 18 | > - Using Rmd files 19 | > - Get to know markdown syntax 20 | 21 | ------ 22 | 23 | ## R and RStudio 24 | 25 | - Install __R__ 26 | - R for Mac: [https://cran.r-project.org/bin/macosx/](https://cran.r-project.org/bin/macosx/) 27 | - R for windows: [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/) 28 | - Install __RStudio__ 29 | - RStudio download (desktop version): [https://www.rstudio.com/products/rstudio/download/](https://www.rstudio.com/products/rstudio/download/) 30 | 31 | 32 | ### Difference between R-GUI and RStudio 33 | 34 | The default installation of R comes with R-GUI which is a simple graphical 35 | user interface. In contrast, RStudio is an _Integrated Development Environment_ 36 | (IDE). This means that RStudio is much more than a simple GUI, providing a nice 37 | working environment and development framework. In this course, you will use R 38 | mainly for doing computations and plots, not really for programming purposes. 39 | And you are going to interact with R via RStudio, using the so-called __Rmd__ 40 | files. 41 | 42 | ----- 43 | 44 | ## R as a scientific calculator 45 | 46 | Open RStudio and locate the _console_ (or prompt) pane. 47 | Let's start typing basic things in the console, using R as a scientific calculator: 48 | 49 | ```r 50 | # addition 51 | 1 + 1 52 | 2 + 3 53 | 54 | # subtraction 55 | 4 - 2 56 | 5 - 7 57 | 58 | # multiplication 59 | 10 * 0 60 | 7 * 7 61 | 62 | # division 63 | 9 / 3 64 | 1 / 2 65 | 66 | # power 67 | 2 ^ 2 68 | 3 ^ 3 69 | ``` 70 | 71 | 72 | ### Functions 73 | 74 | R has many functions. To use a function, type its name followed by parenthesis. 75 | Inside the parenthesis you pass an input. Most functions will produce some 76 | type of output: 77 | 78 | ```r 79 | # absolute value 80 | abs(10) 81 | abs(-4) 82 | 83 | # square root 84 | sqrt(9) 85 | 86 | # natural logarithm 87 | log(2) 88 | ``` 89 | 90 | 91 | ### Comments in R 92 | 93 | All programming languages use a set of characters to indicate that a 94 | specific part or lines of code are __comments__, that is, things that are 95 | not to be executed. R uses the hash or pound symbol `#` to specify comments. 96 | Any code to the right of `#` will not be executed by R. 97 | 98 | ```r 99 | # this is a comment 100 | # this is another comment 101 | 2 * 9 102 | 103 | 4 + 5 # you can place comments like this 104 | ``` 105 | 106 | 107 | ### Variables and Assignment 108 | 109 | R is more powerful than a calculator, and you can do many more things than 110 | practically most scientific calculators. One of the things you will be 111 | doing a lot in R is creating variables or objects to store values. 112 | 113 | For instance, you can create a variable `x` and give it the value of 1. 114 | This is done using what is known as the __assignment operator__ `<-`, 115 | also known in R as the _arrow_ operator: 116 | 117 | ```r 118 | x <- 1 119 | x 120 | ``` 121 | 122 | This is a way to tell R: "create an object `x` and store in it the number 1". 123 | Alternatively, you can use the equals sign `=` as an assignment operator: 124 | 125 | ```r 126 | y = 2 127 | y 128 | ``` 129 | 130 | With variables, you can operate the way you do algebraic operations (addition, subtraction, multiplication, division, power, etc): 131 | 132 | ```r 133 | x + y 134 | x - y 135 | x * y 136 | x / y 137 | x ^ y 138 | ``` 139 | 140 | 141 | ### Case Sensitive 142 | 143 | R is case sensitive. This means that `abs()` is not the same 144 | as `Abs()` or `ABS()`. Only the function `abs()` is the valid one. 145 | 146 | ```r 147 | # case sensitive 148 | x = 1 149 | X = 2 150 | x + x 151 | x + X 152 | X + X 153 | ``` 154 | 155 | 156 | ### Some Examples 157 | 158 | Here are some examples that illustrate how to use R to define 159 | variables and perform basic calculations: 160 | 161 | ```r 162 | # convert Fahrenheit degrees to Celsius degrees 163 | fahrenheit = 50 164 | celsius = (fahrenheit - 32) * (5/9) 165 | celsius 166 | 167 | 168 | # compute the area of a rectangle 169 | rec_length = 10 170 | rec_height = 5 171 | rec_area = rec_length * rec_height 172 | rec_area 173 | 174 | 175 | # degrees to radians 176 | deg = 90 177 | rad = (deg * pi) / 180 178 | rad 179 | ``` 180 | 181 | ----- 182 | 183 | ## More about RStudio 184 | 185 | You will be working with RStudio a lot, and you will have time to learn 186 | many of the bells and whistles RStudio provides. Think about RStudio as 187 | your "workbench". Keep in mind that RStudio is NOT R. RStudio is an environment 188 | that makes it easier to work with R, while taking care of the little tasks that 189 | can be a hassle. 190 | 191 | 192 | ### A quick tour of RStudio 193 | 194 | - Understand the __pane layout__ (i.e. windows) of RStudio 195 | - Source 196 | - Console 197 | - Environment, History, etc 198 | - Files, Plots, Packages, Help, Viewer 199 | - Customize RStudio Appearance of source pane 200 | - font 201 | - size 202 | - background 203 | 204 | 205 | ### Using an R script file 206 | 207 | Most of the time you won't be working directly on the console. 208 | Instead, you will be typing your commands in some _source_ file. 209 | The basic type of source files are known as _R script files_. 210 | Open a new script file in the _source_ pane, and rewrite the 211 | previous commands. 212 | 213 | You can copy the commands in your source file and paste them in the 214 | console. But that's not very efficient. Find out how to run (execute) 215 | the commands (in your source file) and pass them to the console pane. 216 | 217 | 218 | ### Getting help 219 | 220 | Because we work with functions all the time, it's important to know certain 221 | details about how to use them, what input(s) is required, and what is the 222 | returned output. 223 | 224 | There are several ways to get help. 225 | 226 | If you know the name of a function you are interested in knowing more, 227 | you can use the function `help()` and pass it the name of the function you 228 | are looking for: 229 | 230 | ```r 231 | # documentation about the 'abs' function 232 | help(abs) 233 | 234 | # documentation about the 'mean' function 235 | help(mean) 236 | ``` 237 | 238 | Alternatively, you can use a shortcut using the question mark `?` followed 239 | by the name of the function: 240 | 241 | ```r 242 | # documentation about the 'abs' function 243 | ?abs 244 | 245 | # documentation about the 'mean' function 246 | ?mean 247 | ``` 248 | 249 | - How to read the manual documentation: 250 | - Title 251 | - Description 252 | - Usage of function 253 | - Arguments 254 | - Details 255 | - See Also 256 | - Examples!!! 257 | 258 | `help()` only works if you know the name of the function your are looking for. 259 | Sometimes, however, you don't know the name but you may know some keywords. 260 | To look for related functions associated to a keyword, use `help.search()` or 261 | simply `??` 262 | 263 | ```r 264 | # search for 'absolute' 265 | help.search("absolute") 266 | 267 | # alternatively you can also search like this: 268 | ??absolute 269 | ``` 270 | 271 | Notice the use of quotes surrounding the input name inside `help.search()` 272 | 273 | 274 | ### Installing Packages 275 | 276 | R comes with a large set of functions and packages. A package is a collection 277 | of functions that have been designed for a specific purpose. One of the great 278 | advantages of R is that many analysts, scientists, programmers, and users 279 | can create their own pacakages and make them available for everybody to use them. 280 | R packages can be shared in different ways. The most common way to share a 281 | package is to submit it to what is known as __CRAN__, the 282 | _Comprehensive R Archive Network_. 283 | 284 | You can install a package using the `install.packages()` function. 285 | Just give it the name of a package, surrounded by quotes, and R will look for 286 | it in CRAN, and if it finds it, R will download it to your computer. 287 | 288 | ```r 289 | # installing 290 | install.packages("knitr") 291 | ``` 292 | 293 | You can also install a bunch of packages at once: 294 | 295 | ```r 296 | install.packages(c("readr", "ggplot2")) 297 | ``` 298 | 299 | The installation of a package needs to be done only once. 300 | After a package has been installed, you can start using its functions 301 | by _loading_ the package with the function `library()` 302 | 303 | ```r 304 | library(knitr) 305 | ``` 306 | 307 | 308 | ### Your turn 309 | 310 | - Install packages `"stringr"`, `"RColorBrewer"` 311 | - Calculate: $3x^2 + 4x + 8$ when $x = 2$ 312 | - Look for the manual (i.e. help) documentation of the function `exp` 313 | - Find out how to look for information about binary operators 314 | like `+` or `^` 315 | - There are several tabs in the pane `Files, Plots, Packages, Help, Viewer`. 316 | Find out what does the tab __Files__ is good for? 317 | 318 | ----- 319 | 320 | ## Introduction to Rmd files 321 | 322 | Besides using R script files to write source code, you will be using other 323 | type of source files known as _R markdown_ files, simply called `Rmd` files. 324 | These files use a special syntax called 325 | [markdown](https://en.wikipedia.org/wiki/Markdown). 326 | 327 | 328 | ### Get to know the `Rmd` files 329 | 330 | In the menu bar of RStudio, click on __File__, then __New File__, 331 | and choose __R Markdown__. Select the default option "Document" (HTML output), 332 | and click __Ok__. 333 | 334 | __Rmd__ files are a special type of file, referred to as a _dynamic document_, 335 | that allows you to combine narrative (text) with R code. It is extremeley 336 | important that you quickly become familiar with this resource. One reason is 337 | that you can use Rmd files to write your homework assignments and convert them 338 | to HTML, Word, or PDF files. 339 | 340 | Locate the button __Knit__ (the one with a knitting icon) and click on it 341 | so you can see how `Rmd` files are rendered and displayed as HTML documents. 342 | 343 | 344 | ### Yet Another Syntax to Learn 345 | 346 | R markdown (`Rmd`) files use [markdown](https://daringfireball.net/projects/markdown/) 347 | as the main syntax to write content that is not R code. Markdown is a very 348 | lightweight type of markup language, and it is relatively easy to learn. 349 | 350 | 351 | ### Your turn 352 | 353 | If you are new to Markdown, please take a look at the following tutorials: 354 | 355 | - [www.markdown-tutorial.com](http://www.markdown-tutorial.com) 356 | - [www.markdowntutorial.com](http://www.markdowntutorial.com) 357 | 358 | ----- 359 | 360 | ### Rmd basics 361 | 362 | - YAML header: 363 | - title 364 | - author 365 | - date 366 | - output: `html_document`, `word_document`, `pdf_document` 367 | - Code Chunks: 368 | - syntax 369 | - chunk options 370 | - graphics 371 | - Math notation: 372 | - inline `$z^2 = x^2 + y^2$` 373 | - paragraph `$$z^2 = x^2 + y^2$$` 374 | 375 | Example of inline equation: $z^2 = x^2 + y^2$ 376 | 377 | Example of equation in its own paragraph: 378 | $$ 379 | z^2 = x^2 + y^2 380 | $$ 381 | 382 | RStudio has a basic tutorial about R Markdown files: 383 | [Rstudio markdown tutorial](http://rmarkdown.rstudio.com/) 384 | 385 | Rmd files are able to render math symbols and expressions written using LaTeX 386 | notation. There are dozens of online resources to learn about math notation and 387 | equations in LaTeX. Here's some documentation from [www.sharelatex.com/learn](https://www.sharelatex.com/learn/) 388 | 389 | - [Mathematical expressions](https://www.sharelatex.com/learn/Mathematical_expressions) 390 | - [Subscripts and superscripts](https://www.sharelatex.com/learn/Subscripts_and_superscripts) 391 | - [Brackets and Parentheses](https://www.sharelatex.com/learn/Brackets_and_Parentheses) 392 | - [Fractions and Binomials](https://www.sharelatex.com/learn/Fractions_and_Binomials) 393 | - [Integrals, sums and limits](https://www.sharelatex.com/learn/Integrals,_sums_and_limits) 394 | - [List of Greek letters and math symbols](https://www.sharelatex.com/learn/List_of_Greek_letters_and_math_symbols) 395 | - [Operators](https://www.sharelatex.com/learn/Operators) 396 | -------------------------------------------------------------------------------- /scripts/01-R-introduction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/01-R-introduction.pdf -------------------------------------------------------------------------------- /scripts/02-data-variables.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data and Variables in R" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | header-includes: \usepackage{float} 6 | output: html_document 7 | urlcolor: blue 8 | --- 9 | 10 | > ### Learning Objectives 11 | > 12 | > - Basics of vectors 13 | > - Variables (as vectors and factors) 14 | > - Quantitative variables as numeric vectors 15 | > - Qualitative variables (as factors) 16 | > - Manipulating vectors 17 | 18 | 19 | ```{r setup, include=FALSE} 20 | knitr::opts_chunk$set(echo = TRUE) 21 | ``` 22 | 23 | ## NBA Data 24 | 25 | In this Rmd script we'll consider some NBA data from the website 26 | _Basketball Reference_. More specifically, let's look at the Western Conference 27 | Standings (season 2015-2016) shown in the following screenshot: 28 | 29 | ```{r out.width='60%', echo = FALSE, fig.align='center'} 30 | knitr::include_graphics('images/western-conference-standings-2016.png') 31 | ``` 32 | 33 | source: [http://www.basketball-reference.com/leagues/NBA_2016.html#all_confs_standings_E](http://www.basketball-reference.com/leagues/NBA_2016.html#all_confs_standings_E) 34 | 35 | The above table contains 15 rows with 8 columns. The first column contains the 36 | names of the teams in the Western Conference, and the rest of the columns are: 37 | 38 | - _W_: wins 39 | - _L_: losses 40 | - _W/L%_: win-loss percentange 41 | - _GB_: games behind (the top team) 42 | - _PS/G_: points per game 43 | - _PA/G_: opponent points per game 44 | - _SRS_: simple rating system 45 | 46 | From the statistical standpoint, we say that the table has 8 variables measured 47 | (or observed) on 15 individuals. In this case the "individuals" are the basketball 48 | teams. 49 | 50 | 51 | ## Basics of vectors 52 | 53 | In order to use R as the computational tool in this course, you need to learn 54 | how to input data. Before describing how to read in tables in R (we'll cover 55 | that later), we must talk about vectors. 56 | 57 | R vectors are the most basic structure to store data in R. Virtually all other 58 | data structures in R are based or derived from vectors. Using a vector is also 59 | the most basic way to manually input data. 60 | 61 | You can create vectors in several ways. The most common option is with the 62 | function `c()` (combine). Simply pass a series of values separated by commas. 63 | Here is how to create a vector `wins` with the first five values from the column 64 | _W_ of the conference standings table: 65 | 66 | ```{r} 67 | wins = c(73, 67, 55, 53, 44) 68 | ``` 69 | 70 | Likewise, we can create a vector `losses` like this: 71 | 72 | ```{r} 73 | losses = c(9, 15, 27, 29, 38) 74 | ``` 75 | 76 | Having the vectors `wins` and `losses`, we can use them to create another 77 | vector `win_loss_perc` for the column _W/L%_ (win-loss percentange): 78 | 79 | ```{r} 80 | win_loss_perc = wins / (wins + losses) 81 | win_loss_perc 82 | ``` 83 | 84 | You can think of vectors as variables. The previous vectors `wins`, `losses`, 85 | and `win_loss_perc` are what it's known as __quantitative__ variables. This 86 | means that each value in those variables (the numbers) reflect a quantity. 87 | 88 | Not all variables are quantitative. For instance, the first column of the table 89 | does not contain numbers but names. The name of a basketball team is referred 90 | to as a __qualitative__ variable. 91 | 92 | In R you can create a vector of names using a character vector. Again, we use 93 | the `c()` function and we pass it names surrounded by either single or double 94 | quotes. Here's how to create a vector `teams` with the names of the first five 95 | teams in the standings table: 96 | 97 | ```{r} 98 | teams = c('GSW', 'SAS', 'OCT', 'LAC', 'PTB') 99 | ``` 100 | 101 | The vector `teams` is referred in R to as a __character vector__ because it 102 | is formed by characters. 103 | 104 | 105 | ## Manipulating Vectors: Subsetting 106 | 107 | In addition to creating variables, you should also learn how to do some basic 108 | manipulation of vectors. The most common type of manipulation is called 109 | _subsetting_ which refers to extracting elements of a vector (or another R object). 110 | To do so, you use what is known as __bracket notation__. This implies using 111 | (square) brackets `[ ]` to get access to the elements of a vector. Inside the 112 | brackets you can specify one or more numeric values that correspond to the 113 | position(s) of the vector element(s): 114 | 115 | ```r 116 | # first element of 'wins' 117 | wins[1] 118 | 119 | # third element of 'losses' 120 | losses[3] 121 | 122 | # last element of teams 123 | teams[5] 124 | ``` 125 | 126 | Some common functions that you can use on vectors are: 127 | 128 | - `length()` gives the number of values 129 | - `sort()` sorts the values in increasing or decreasing ways 130 | - `rev()` reverses the values 131 | 132 | ```r 133 | length(teams) 134 | teams[length(teams)] 135 | sort(wins, decreasing = TRUE) 136 | rev(wins) 137 | ``` 138 | 139 | 140 | 141 | ### Subsetting with Logical Indices 142 | 143 | In addition to using numbers inside the brackets, you can also do 144 | _logical subsetting_. This type of subsetting involves using a __logical__ 145 | vector inside the brackets. A logical vector is a particular type of vector 146 | that takes the special values `TRUE` and `FALSE`, as well as `NA` 147 | (Not Available). 148 | 149 | This type of subsetting is very powerful because it allows you to 150 | extract elements based on some logical condition. 151 | Here are some examples of logical subsetting: 152 | 153 | ```r 154 | # wins of Golden State Warriors 155 | wins[teams == 'GSW'] 156 | 157 | # teams with wins > 40 158 | teams[wins > 40] 159 | 160 | # name of teams with losses between 10 and 29 161 | teams[losses >= 10 & losses <= 29] 162 | ``` 163 | 164 | 165 | ## Factors and Qualitative Variables 166 | 167 | As mentioned before, vectors are the most essential type of data structure 168 | in R. Related to vectors, there is another important data structure in R called 169 | __factor__. Factors are data structures exclusively designed to handle 170 | qualitative or categorical data. 171 | 172 | The term _factor_ as used in R for handling categorical variables, comes from 173 | the terminology used in _Analysis of Variance_, commonly referred to as ANOVA. 174 | In this statistical method, a categorical variable is commonly referred to as 175 | _factor_ and its categories are known as _levels_. 176 | 177 | To create a factor you use the homonym function `factor()`, which takes a 178 | vector as input. The vector can be either numeric, character or logical. 179 | 180 | ```{r} 181 | # numeric vector 182 | num_vector <- c(1, 2, 3, 1, 2, 3, 2) 183 | 184 | # creating a factor from num_vector 185 | first_factor <- factor(num_vector) 186 | 187 | first_factor 188 | ``` 189 | 190 | You can take the `teams` vector and convert it as a factor: 191 | 192 | ```{r} 193 | teams = factor(teams) 194 | teams 195 | ``` 196 | 197 | 198 | 199 | ## Sequences 200 | 201 | It is very common to generate sequences of numbers. For that R provides: 202 | 203 | - the colon operator `":"` 204 | - sequence function `seq()` 205 | 206 | ```r 207 | # colon operator 208 | 1:5 209 | 1:10 210 | -3:7 211 | 10:1 212 | ``` 213 | 214 | ```r 215 | # sequence function 216 | seq(from = 1, to = 10) 217 | seq(from = 1, to = 10, by = 1) 218 | seq(from = 1, to = 10, by = 2) 219 | seq(from = -5, to = 5, by = 1) 220 | ``` 221 | 222 | 223 | ### Repeated Vectors 224 | 225 | There is a function `rep()`. It takes a vector as the main input, and then it 226 | optionally takes various arguments: `times`, `length.out`, and `each`. 227 | 228 | ```{r} 229 | rep(1, times = 5) # repeat 1 five times 230 | rep(c(1, 2), times = 3) # repeat 1 2 three times 231 | rep(c(1, 2), each = 2) 232 | rep(c(1, 2), length.out = 5) 233 | ``` 234 | 235 | Here are some more complex examples: 236 | 237 | ```r 238 | rep(c(3, 2, 1), times = 3, each = 2) 239 | ``` 240 | 241 | 242 | ## From vectors to data frames 243 | 244 | Now that we've seen how to create some vectors and do some basic manipulation, 245 | we can describe how to combine them in a table in R. The standard tabular 246 | structure in R is a __data frame__. To manually create a data frame you use 247 | the function `data.frame()` and you pass it one or more vectors. Here's how 248 | to create a small data frame `dat` with the vectors `teams`, `wins`, `losses`, 249 | and `win_loss_perc`: 250 | 251 | ```{r} 252 | dat = data.frame( 253 | Teams = teams, 254 | Wins = wins, 255 | Losses = losses, 256 | WLperc = win_loss_perc 257 | ) 258 | 259 | dat 260 | ``` 261 | 262 | Manipulating data frames is more complex than manipulating vectors. However, 263 | manipulating the column of a data frame is essentially the same as manipulating 264 | a vector. 265 | 266 | There are a couple of ways to "select" a column of a data frame. One option 267 | consists of using the dollar `$` operator. This involves typing the name of 268 | the data frame, followed by the `$`, followed by the name of the column. 269 | For instance, to extract the values in column `Teams` simply type: 270 | 271 | ```{r} 272 | dat$Teams 273 | ``` 274 | 275 | Moreover, you can use bracket notation on the extracted column like with any 276 | type of vector: 277 | 278 | ```{r} 279 | dat$Wins[1] 280 | dat$Wins[5] 281 | ``` 282 | 283 | Likewise, you can do logical subsetting: 284 | 285 | ```r 286 | # wins of Golden State Warriors 287 | dat$Wins[dat$Teams == 'GSW'] 288 | 289 | # teams with wins > 40 290 | dat$Teams[dat$Wins > 40] 291 | 292 | # name of teams with losses between 10 and 29 293 | dat$Teams[dat$Losses >= 10 & dat$Losses <= 29] 294 | ``` 295 | 296 | 297 | ## Your Turn 298 | 299 | Refer to the table of Western Conference Standings shown at the beginning of 300 | this document. Your mission consists of creating a data frame `standings`. 301 | In order to create such data frame, you will have to first create the following 302 | eight vectors: 303 | 304 | - `teams` 305 | - `wins` 306 | - `losses` 307 | - `win_loss_perc` 308 | - `games_behind` 309 | - `points_scored` 310 | - `points_against` 311 | - `rating` 312 | 313 | You can create the vector `games_behind` by taking the won games of Golden 314 | State Warriors and subtracting the wins of the rest of the teams, that is: 315 | 316 | ```r 317 | wins[1] - wins 318 | ``` 319 | 320 | Once you have the previous listed vectors, use the function `data.frame()` 321 | to build `standings`. 322 | 323 | Select the _Points Scored_ from the table `standings` and sort it both in 324 | increasing as well as decreasing order. 325 | 326 | -------------------------------------------------------------------------------- /scripts/02-data-variables.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/02-data-variables.pdf -------------------------------------------------------------------------------- /scripts/03-histograms.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/03-histograms.pdf -------------------------------------------------------------------------------- /scripts/04-measures-center.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Measures of Center" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | urlcolor: blue 7 | --- 8 | 9 | > ### Learning Objectives 10 | > 11 | > - Compute the average 12 | > - Become familiar with the function `mean()` 13 | > - Interpret the average as the balancing point 14 | 15 | ```{r setup, include=FALSE} 16 | knitr::opts_chunk$set(echo = TRUE) 17 | ``` 18 | 19 | As we mentioned in the previous script, the first part of the course has to do 20 | with __Descriptive Statistics__. The main idea is to make a "large" or 21 | "complicated" dataset more compact and easier to understand by using three 22 | major tools: 23 | 24 | - summary and frequency tables 25 | - charts and graphics 26 | - key numeric summaries 27 | 28 | In this script we will focus on various numeric summaries that are typically 29 | used to condense information of quantitative variables. 30 | 31 | One common way to classify numeric summaries is in 1) measures of center, and 32 | 2) measures of spread or variability. 33 | The idea of both types of measures is to obtain one or more numeric values that 34 | reflect a "central" value, and the amount of "spread". 35 | 36 | - Measures of Center 37 | + average or mean 38 | + median 39 | - Measures of Spread 40 | + range 41 | + interquartile range 42 | + standard deviation (and variance) 43 | 44 | 45 | 46 | ## The Average 47 | 48 | Perhaps the most common type of measure of center is the average or mean. 49 | Consider a list of numbers formed by: 0, 1, 2, 3, 5, and 7. The average is 50 | calculated as the sum of all values divided the number of values: 51 | 52 | $$ 53 | average = \frac{0 + 1 + 2 + 3 + 5 + 7}{6} = 3 54 | $$ 55 | 56 | You can use R to compute the previous average: 57 | 58 | ```{r} 59 | (0 + 1 + 2 + 3 + 5 + 7) / 6 60 | ``` 61 | 62 | Algebraically, we typically denote a set of values by $x_1, x_2, \dots, x_n$, 63 | in which the index $n$ represents the total number of values. Using this 64 | notation, the formula of the average is expressed as: 65 | 66 | $$ 67 | average = \frac{x_1 + x_2 + \dots + x_n}{n} 68 | $$ 69 | 70 | Using summation notation, the average can be compactly expressed as: 71 | 72 | $$ 73 | average = \sum_{i=1}^{n} \frac{x_i}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i 74 | $$ 75 | 76 | Summation notation uses the uppercase Greek letter $\Sigma$ (sigma), 77 | is used as an abbreviation for the phrase "the sum of". So, in place of 78 | $x_1 + x_2 + \dots + x_n$, we can use summation notation as "the sum of the 79 | observations of the variable $x$." 80 | 81 | In R, you can create a vector `x` to store the previous numbers: 82 | 83 | ```{r} 84 | x = c(0, 1, 2, 3, 5, 7) 85 | ``` 86 | 87 | Then, you can use the function `sum()` to add all the values in `x`, and 88 | compute the average as: 89 | 90 | ```{r} 91 | sum(x) / length(x) 92 | ``` 93 | 94 | An alternative way to compute the average in R is using the `mean()` function: 95 | 96 | ```{r} 97 | mean(x) 98 | ``` 99 | 100 | 101 | ### The Average is the balancing point 102 | 103 | Usually, the average of a set of $n$ values $x_1, x_2, \dots, x_n$ is expressed as 104 | $\bar{x}$ (pronounced _x-bar_). 105 | 106 | To understand how the average is a type of central or mid-value, we need to 107 | talk about __deviations__. A deviation is the difference between an observed 108 | value $x_i$ and another value of reference $ref$, that is, $(x_i - ref)$. 109 | 110 | Taking the average value $\bar{x}$ as a reference value, we can calculate the 111 | deviations of all observations from the average: $(x_i - \bar{x})$ 112 | 113 | Given a reference value $ref$, we can also compute the sum of all deviations 114 | around such value: 115 | 116 | $$ 117 | \sum_{i=1}^{n} (x_i - ref) 118 | $$ 119 | 120 | It turns out that the average is the ONLY reference value such that the sum 121 | of deviations around it becomes zero: 122 | 123 | $$ 124 | \sum_{i=1}^{n} (x_i - \bar{x}) = 0 125 | $$ 126 | 127 | Let's verify that in R 128 | 129 | ```{r} 130 | avg = mean(x) 131 | deviations = x - avg 132 | deviations 133 | ``` 134 | 135 | The sum of the deviations around the mean should be zero: 136 | 137 | ```{r} 138 | sum(deviations) 139 | ``` 140 | 141 | This is the reason why we say that the average is one type of center or mid-value. 142 | In simpler terms, you can think of the average as the balance point of a 143 | distribution. The average is that point that cancels out the sum of deviations 144 | around it. 145 | 146 | 147 | 148 | ### Your turn 149 | 150 | We know that the average of `x` is `r mean(x)`. What happens to this average if: 151 | 152 | - you add a constant $b$ to all values in `x`? 153 | - you multiply the values in `x` times a constant $a$? 154 | 155 | For instance, let's add 2 to all vaues in `x`? 156 | 157 | ```{r} 158 | mean(x + 2) 159 | ``` 160 | 161 | Now, let's multiply by 2 all values in `x`: 162 | 163 | ```{r} 164 | mean(x * 2) 165 | ``` 166 | 167 | Spend some time in R to examine what happens to the average of $x + k$ and 168 | $k \times x$ with several choices of $k$, e.g. -2, 5, 100. 169 | 170 | Now, let's see what happens to the average when you add a constant $b$ to all 171 | values in `x`, and multiply them times some constant $a$? 172 | 173 | ```{r} 174 | mean(x) 175 | a = 2 176 | b = 3 177 | mean(a*x + b) 178 | ``` 179 | 180 | Again, spend some time in R trying different values for `a` and `b`. 181 | What's your conclusion? 182 | 183 | 184 | ## The Median 185 | 186 | Another common type of measure of center is the __median__. The median is the 187 | literal middle value of an ordered distribution. By _middle value_ we mean that 188 | half of observations are below the median, and the other half of observations 189 | are above it. 190 | 191 | The easiest way to calculate the median in R is with the homonym function 192 | `median()`. Consider again the numbers in the vector `x`, the median of 193 | this set of values is: 194 | 195 | ```{r} 196 | x = c(0, 1, 2, 3, 5, 7) 197 | 198 | median(x) 199 | ``` 200 | 201 | 202 | The median depends on the number of values. If you have a variable with an 203 | even number of values, then the median is the average of the two middle-values. 204 | If you have a variable with an odd number of values, then the median is the 205 | middle-value. 206 | 207 | 208 | 209 | ## More numeric summaries 210 | 211 | Another interesting function in R that you can use to obtain descriptive 212 | information about a variable is `summary()`. When you use this function on a 213 | numeric vector (i.e. quantitative variable), the returned output includes: 214 | 215 | - `Min.`: minimum 216 | - `1st Qu.`: first quartile 217 | - `Median`: median 218 | - `Mean`: average 219 | - `3rd Qu.`: third quartile 220 | - `Max.`: maximum 221 | 222 | ```{r} 223 | x = c(0, 1, 2, 3, 5, 7) 224 | 225 | summary(x) 226 | ``` 227 | 228 | 229 | ## Average -vs- Median 230 | 231 | Consider a new vector `x` that contains 25 numbers: five 1's, five 2's, five 3's, 232 | five 4's, and five 5's: 233 | 234 | ```{r} 235 | x = rep(1:5, each = 5) 236 | ``` 237 | 238 | As you can tell, all values in `x` occur with the same frequency. And if you 239 | get a histogram, R will plot all bars with the same height: 240 | 241 | ```{r out.width='50%', fig.align='center'} 242 | hist(x, breaks = c(0, 1, 2, 3, 4, 5), las = 1, col = 'gray80') 243 | ``` 244 | 245 | In this data, the average and the median are the same. In fact, this happens 246 | all the time you have a perfect symmetric distribution: 247 | 248 | ```{r} 249 | mean(x) 250 | median(x) 251 | ``` 252 | 253 | 254 | Now let's add one more observation to `x` with a value of 10, and obtain the 255 | average and the median: 256 | 257 | ```{r} 258 | y = c(x, 10) 259 | mean(y) 260 | median(y) 261 | ``` 262 | 263 | Note that the average increased from `r mean(x)` to `r round(mean(y), 2)`, 264 | while the median remained unchanged. 265 | 266 | Let's make it more extreme and instead of adding a value of 10 let's add a 267 | value of 100 to `x`. The average and the median are: 268 | 269 | ```{r} 270 | z = c(x, 100) 271 | mean(z) 272 | median(z) 273 | ``` 274 | 275 | You can look at the distributions of `x`, `y`, and `z` using the default plots 276 | produced by `hist()`: 277 | 278 | ```{r out.width='90%', fig.height=3, echo = FALSE, fig.align='center'} 279 | op = par(mfrow = c(1, 3)) 280 | hist(x, las = 1, col = 'gray80') 281 | hist(y, las = 1, col = 'gray80') 282 | hist(z, las = 1, col = 'gray80') 283 | par(op) 284 | ``` 285 | 286 | This is a toy example that illustrates one difference between the median and the 287 | average. The median is more resistant (or robust) to extreme values, 288 | but not the average. Small and large values affect the average of a distribution. 289 | 290 | 291 | ### Example 292 | 293 | Here's one more example that shows you how to use R to solve a typical textbook 294 | exercise. The average and median of the first 99 values of a data set of 198 295 | values are all equal to 120. If the average and median of the final 99 values 296 | are all equal to 100, what can you say about the average of the entire data set. 297 | What can you say about the median? 298 | 299 | You can solve theis type of questions analytically, or you can use R. Here's how. 300 | The problem deals with a data set of 198 values formed by two 301 | sets of numbers: the first 99 values are all equal to 120, the final 99 values 302 | are all equal to 100. You can create two R vectors to build the two sets of 303 | 99 values. This is achieved with the function `rep()` that allows 304 | you to __repeat__ one or more numeric values given a number of times: 305 | 306 | ```{r} 307 | # first 99 values equal to 120 308 | first_values = rep(120, times = 99) 309 | 310 | # final 99 values equal to 100 311 | final_values = rep(100, times = 99) 312 | 313 | # all values 314 | all_values = c(first_values, final_values) 315 | ``` 316 | 317 | Having defined `first_values` and `final_values`, we build the entire list of 318 | 198 values by combining them in the vector `all_values`. The next step involves 319 | finding the average and the median: 320 | 321 | ```{r} 322 | # average 323 | mean(all_values) 324 | 325 | # median 326 | median(all_values) 327 | ``` 328 | 329 | 330 | 331 | -------------------------------------------------------------------------------- /scripts/04-measures-center.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/04-measures-center.pdf -------------------------------------------------------------------------------- /scripts/05-measures-spread.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Measures of Spread" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | urlcolor: blue 7 | --- 8 | 9 | > ### Learning Objectives 10 | > 11 | > - Becoming familiar with various measures of spread 12 | > - Intro to the functions `range()`, `IQR()`, and `sd()` 13 | > - Understand the concept of r.m.s. size of a list of numbers 14 | > - Be aware of the difference between SD and SD+ 15 | 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = TRUE) 19 | ``` 20 | 21 | ## Introduction 22 | 23 | Quantitative variables can be summarized using two groups of measures: 24 | 1) center, and 2) spread. Just like there are various measures of center 25 | (e.g. average, median, mode), we also have several measures of spread or 26 | variability: 27 | 28 | - range 29 | - interquartile range 30 | - standard deviation (and variance) 31 | 32 | 33 | ## Range 34 | 35 | The most basic type of measure of spread is the __range__. The range is obtained 36 | as the difference between the maximum value and the minimum value. 37 | 38 | For example, let's consider the values 0, 5, -8, 7, and -3 used in the textbook 39 | (page 66). To find the range, you need to determine the smallest and largest 40 | values which in this case are 7 and -8, respectively. And then obtain the 41 | difference: 42 | 43 | $$ 44 | range = 7 - (-8) = 15 45 | $$ 46 | 47 | For illustration purposes, let's implement this minimalist example in R. First 48 | we create a vector `x` with the five values. You can use the functions `max()` 49 | and `min()` to get the largest and smallest values in `x`: 50 | 51 | ```{r} 52 | x = c(0, 5, -8, 7, -3) 53 | maximum = max(x) 54 | minimum = min(x) 55 | 56 | # range 57 | maximum - minimum 58 | ``` 59 | 60 | Actually, there is a `range()` function in R, which gives you the maximum and 61 | minimum value (but not the subtraction): 62 | 63 | ```{r} 64 | # range: max value, and min value 65 | range(x) 66 | ``` 67 | 68 | The range is one type of measure of variability. It tells you the _length_ 69 | of the scatter in the data. The issue with the range is that extreme values 70 | may have a considerable effect on it. For example, if you add a value of 20 to 71 | `x` the new range becomes: 72 | 73 | ```{r} 74 | y = c(x, 20) 75 | 76 | # range 77 | max(y) - min(y) 78 | ``` 79 | 80 | The presence of outliers will affect the magnitude of the range. 81 | 82 | 83 | ## Interquartile Range (IQR) 84 | 85 | To overcome the limitations of the range we can use a different type of range 86 | called the __interquartile range__ or _IQR_. This is a range based not on the 87 | minimum and maximum values but on the first and third quartiles. 88 | 89 | One way to compute quartiles in R is with the function `quantile()`. There are 90 | slightly different formulas to compute quartiles. To find the quartiles---as 91 | discussed in most introductory statistics books---you need to use the argument 92 | `type = 2` inside the `quantile()` function: 93 | 94 | ```{r} 95 | x = c(0, 5, -8, 7, -3) 96 | 97 | # 1st quartile 98 | Q1 = quantile(x, probs = 0.25, type = 2) 99 | 100 | # 3rd quartile 101 | Q3 = quantile(x, probs = 0.75, type = 2) 102 | 103 | # IQR 104 | Q3 - Q1 105 | ``` 106 | 107 | You can also use the dedicated function `IQR()` to compute the interquartile 108 | range: 109 | 110 | ```{r} 111 | IQR(x, type = 2) 112 | ``` 113 | 114 | Compared to the classic range, the IQR is more resistant to outliers because 115 | it does not consider the entire set of values, just those between the first 116 | and third quartile. If we add a extreme negative value -50, and an extreme 117 | positive value of 40 to `x`, the IQR should not be affected: 118 | 119 | ```{r} 120 | y = c(x, -50, 40) 121 | 122 | # IRQ 123 | IQR(y, type = 2) 124 | ``` 125 | 126 | 127 | ## The Root Means Square (RMS) 128 | 129 | Another measure of spread is the Standard Deviation (SD). However, in order to 130 | talk about the SD, I will follow the same approach of the FPP book and I will first talk about the __Root Mean Square__ or RMS. 131 | 132 | The values in our toy example are 0, 5, -8, 7, and -3. To find a central value 133 | we can use either the average or the median: 134 | 135 | ```{r results='hide'} 136 | x = c(0, 5, -8, 7, -3) 137 | 138 | mean(x) 139 | median(x) 140 | ``` 141 | 142 | What about a measure of _size_? In other words, how would you find a measure 143 | of how small or how big the values in `x` are? Is it possible to obtain a 144 | quantity that tells you something about the representative _magnitude_ of values 145 | in `x`? 146 | 147 | To answer this question about a typical magnitude of values we need to ignore 148 | the signs. One way to do this is by looking at the absolute values, and then 149 | compute the average: 150 | 151 | ```{r} 152 | abs(x) 153 | 154 | mean(abs(x)) 155 | ``` 156 | 157 | For convenience reasons (e.g. algebraic manipulation and nice mathematical properties), statisticians prefer to square the values instead of using the 158 | absolute values. And then compute the average of such squares: 159 | 160 | ```{r} 161 | # square value 162 | x^2 163 | 164 | # average of square values 165 | sum(x^2) / length(x) 166 | ``` 167 | 168 | The issue with using square values is that now you end up working with 169 | square units, and with a larger number that has little to do with a typical 170 | magnitude of the original values. To tackle this problem, we take the square 171 | root: 172 | 173 | ```{r} 174 | # root-mean-square (r.m.s) 175 | sqrt(sum(x^2) / length(x)) 176 | 177 | # equivalent to 178 | sqrt(mean(x^2)) 179 | ``` 180 | 181 | The value `r round(sqrt(mean(x^2)), 2)` is referred to as the _r.m.s. size_ of 182 | the numbers in `x`. The RMS size provides a numeric summary for the magnitude 183 | of the data. It is not really the average magnitude, but you can think of it 184 | as such. 185 | 186 | 187 | ## Standard Deviation (SD) 188 | 189 | Now that we have introduced the concept of r.m.s. size of a list of numbers, 190 | we can talk about a third measure of spread known as the 191 | __Standard Deviation__ (SD). Simply put, the Standard Deviation is a measure 192 | of spread that quantifies the amount of variation around the average. 193 | 194 | A keyword is the term __deviation__. In the previous script tutorial---about measures of center---I introduced the concept of _deviations_. 195 | If we denote a set of $n$ values with $x_1, x_2, \dots, x_n$, and a reference 196 | value by $ref$, a deviation is the difference between an observed 197 | value $x_i$ and the value of reference $ref$, that is, $(x_i - ref)$. 198 | 199 | A special type of deviation is when the reference value becomes the average. 200 | If $avg$ represents the average of $x_1, x_2, \dots, x_n$, we can calculate the 201 | deviations of all observations from the average: $(x_i - avg)$. 202 | 203 | The Standard Deviation is based on these deviations. To be more precise, it 204 | is based on the R.M.S. size of deviations from the average: 205 | 206 | $$ 207 | SD = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - avg)^2 } 208 | $$ 209 | 210 | The SD says how far away numbers $x_1, x_2, \dots x_n$ are from their average. 211 | In this sense, you can think of the SD as the typical magnitude of scatter 212 | around the average. 213 | 214 | 215 | ### The `sd()` function 216 | 217 | All statistical packages come with a function that allows you to calculate 218 | the Standard Deviation. In R, there is the function `sd()`. However, the way 219 | `sd()` works is by using a slightly different formula: 220 | 221 | $$ 222 | SD^{+} = \sqrt{ \frac{1}{n-1} \sum_{i=1}^{n} (x_i - avg)^2 } 223 | $$ 224 | 225 | Note that $SD^{+}$ divides by $n-1$ instead of $n$. When the number of values 226 | $n$ is big, $\sqrt{n-1}$ is very close to $\sqrt{n}$. However, for relatively 227 | small values of $n$, there diference between $\sqrt{n-1}$ and $\sqrt{n}$ can 228 | be considerable. 229 | 230 | If you want to use `sd()` to obtain $SD$, you need to multiply the output by a 231 | correction factor of $\frac{\sqrt{n-1}}{n}$: 232 | 233 | ```{r} 234 | x = c(0, 5, -8, 7, -3) 235 | n = length(x) 236 | 237 | # SD 238 | sqrt((n-1)/n) * sd(x) 239 | 240 | # SD+ 241 | sd(x) 242 | ``` 243 | 244 | 245 | -------------------------------------------------------------------------------- /scripts/05-measures-spread.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/05-measures-spread.pdf -------------------------------------------------------------------------------- /scripts/06-normal-curve.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "The Normal Curve" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | urlcolor: blue 7 | --- 8 | 9 | > ### Learning Objectives 10 | > 11 | > - Becoming familiar with the normal curve 12 | > - Intro to the functions `dnorm()`, `pnorm()`, and `qnorm()` 13 | > - How to find areas under the normal curve using R 14 | > - Converting values to standard units 15 | 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = TRUE) 19 | ``` 20 | 21 | 22 | ## Introduction 23 | 24 | Let's look at the distributions of some variables in the data of NBA players: 25 | 26 | ```{r} 27 | # assembling the URL of the CSV file 28 | repo = 'https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/' 29 | datafile = 'master/data/nba_players.csv' 30 | url = paste0(repo, datafile) 31 | # read in data set 32 | nba = read.csv(url) 33 | ``` 34 | 35 | More specifically, let's take a peek at the histograms of variables `height`, 36 | `weight`, `age`, `points2_percent` 37 | 38 | ```{r echo = FALSE, out.width='95%', fig.align='center'} 39 | variables = c('height', 'weight', 'age', 'points2_percent') 40 | op = par(mfrow = c(2, 2)) 41 | for (i in variables) { 42 | hist(nba[ ,i], xlab = i, 43 | col = 'gray80', las = 1, 44 | main = paste('Histogram of', i)) 45 | } 46 | par(op) 47 | ``` 48 | 49 | - `height` seems to have a slightly left skewed distribution. 50 | - `weight` looks roughly symmetric. 51 | - `age` has a right skewed distribution. 52 | - `points2_percent` appears to be fairly symmetric. 53 | 54 | These distributions are examples of some of the possible patterns that 55 | you will find when describing data in real life. If you are lucky, you may 56 | even get to see a perfect symmetric distribution one day. 57 | 58 | Among the wide range of distribution shapes that we encounter when looking 59 | at data, one special pattern has received most of the attention: the so-called 60 | _symmetric bell-shaped_ or mound-shaped distribution, like that of 61 | `points2_percent` and `weight`. It is true that these two histograms are far 62 | from perfect symmetry, but we can put them within the _fairly_ bell-shaped 63 | category. 64 | 65 | 66 | ## Normal Curve 67 | 68 | It turns out that there is one mathematical function that fits (density) 69 | histograms having a symmetric bell-shaped pattern: the famous __normal curve__ 70 | given by the following equation 71 | 72 | $$ 73 | y = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} x^2} 74 | $$ 75 | 76 | This equation, also known as the Laplace-Gaussian curve, was first discovered by 77 | Abraham de Moivre (circa 1720) while working on the first problems about 78 | probability. However, his work around the normal equation went unnoticed for 79 | many years. By the time historians realized he had been the first person to 80 | come up with the normal equation, most people had attributed authorship to 81 | either French scholar Pierre-Simon Laplace and/or German mathematician 82 | Carl Friedrich Gauss. 83 | 84 | In the past, before the 1880s, the curve was referred to as the _Error curve_, 85 | because of its application around the errors from measurements in astronomy. 86 | The name _normal_ appeared around the late 1870s and early 1880s, where British 87 | biometricians like Francis Galton, and later on his disciple Karl Pearson, 88 | together with Ronald Fisher, popularized the word _normal_. Galton never 89 | explained why he used the term "normal" although it seems that he was implying 90 | the sense of conforming to a norm (i.e. a standard, model, pattern, type). 91 | 92 | 93 | ### Plotting the Normal Curve in R 94 | 95 | You can use R to obtain a graph of the normal curve. One approach is to generate 96 | values for the x-axis, and then use the equation of the normal curve to obtain 97 | values for the y-axis: 98 | 99 | ```{r out.width='60%', fig.align='center'} 100 | x = seq(from = -3, to = 3, by = 0.01) 101 | y = (1/sqrt(2 * pi)) * exp(-(x^2)/2) 102 | 103 | plot(x, y, type = "l", lwd = 3, col = "blue") 104 | ``` 105 | 106 | First we generate a vector `x` with some values for the x-axis ranging from 107 | -3 to 3. Then we use `x` to find the heights of the `y` variable. Finally, 108 | we use the values in `x` and `y` as coordinates of the `plot()`. The argument 109 | `type = 'l'` is used to graph a line instead of dots. The argument `lwd` allows 110 | you to define the width of the line. And `col` lets you define a color. 111 | 112 | 113 | ## Normal Distribution Functions 114 | 115 | Instead of working with the equation `y = (1/sqrt(2 * pi)) * exp(-(x^2)/2)`, 116 | R has a family of four functions dedicated to the normal curve: 117 | 118 | - `dnorm()` density function 119 | - `pnorm()` distribution function 120 | - `qnorm()` quantile function 121 | - `rnorm()` random number generator function 122 | 123 | 124 | ### Heights of the curve with `dnorm()` 125 | 126 | The function `dnorm()` is the __density__ function. This is actually the function 127 | that lets you find the height of the curve (i.e. $y$ values). Instead of 128 | manually coding the normal equation, you can use `dnorm()` and get the 129 | previously obtained graph like this: 130 | 131 | ```{r out.width='60%', fig.align='center'} 132 | x = seq(from = -3, to = 3, by = 0.01) 133 | y = dnorm(x) 134 | 135 | plot(x, y, type = "l", lwd = 3, col = "blue") 136 | ``` 137 | 138 | 139 | ### Areas under the curve with `pnorm()` 140 | 141 | The function `pnorm()` is the distribution function. By default, `pnorm()` 142 | returns the area under the curve to the __left__ of a specified `x` value. For 143 | instance, the area to the left of 0 is 0.5 or 50%: 144 | 145 | ```{r} 146 | pnorm(0) 147 | ``` 148 | 149 | Try `pnorm()` with these values 150 | 151 | ```{r eval = FALSE} 152 | pnorm(-2) 153 | pnorm(-1) 154 | pnorm(1) 155 | pnorm(2) 156 | ``` 157 | 158 | You can also use `pnorm()` to find areas under the normal curve to the __right__ 159 | of a specific `x` value. This is done by using the argument `lower.tail = FALSE`: 160 | 161 | ```{r} 162 | # area to the right of 1 163 | pnorm(1, lower.tail = FALSE) 164 | ``` 165 | 166 | Try finding the areas to the right of: 167 | 168 | ```{r eval = FALSE} 169 | pnorm(-2.5, lower.tail = FALSE) 170 | pnorm(-2, lower.tail = FALSE) 171 | pnorm(0.5, lower.tail = FALSE) 172 | pnorm(1.5, lower.tail = FALSE) 173 | ``` 174 | 175 | 176 | Sometimes you need to find areas in between two $z$ values. For instance, the 177 | area between -1 and 1 (which is about 68%). Finding this type of areas involves 178 | subtracting the larger area to the left of 1 minus the smaller area to the 179 | left of -1: 180 | 181 | ```{r} 182 | # area between -1 and 1 183 | pnorm(1) - pnorm(-1) 184 | ``` 185 | 186 | What abot the area between -2 and 2? 187 | 188 | ```{r} 189 | # area between -2 and 2 190 | pnorm(2) - pnorm(-2) 191 | ``` 192 | 193 | 194 | 195 | ### Z values of a given area with `qnorm()` 196 | 197 | The function `qnorm()` is the quantile function. You can think of this function 198 | as the inverse of `pnorm()`. That is, for a given area under the curve, use 199 | `qnorm()` to find what is the corresponding `z` value (i.e. value on the x-axis): 200 | 201 | ```{r} 202 | # z-value such that the area to its left is 0.5 203 | qnorm(0.5) 204 | 205 | # z-value such that the area to its left is 0.3 206 | qnorm(0.3) 207 | ``` 208 | 209 | Likewise, you can use the argument `lower.tail = FALSE` to find values given 210 | a right-tail area: 211 | 212 | ```{r} 213 | # z-value such that the area to its right is 0.5 214 | qnorm(0.5, lower.tail = FALSE) 215 | 216 | # z-value such that the area to its right is 0.3 217 | qnorm(0.3, lower.tail = FALSE) 218 | ``` 219 | 220 | 221 | 222 | ## Standard Units 223 | 224 | In real life, most variables will be measured in some scale: `height` measured 225 | in inches, `weight` measured in ounces, `age` measured in years, 226 | `points2_percent` measured in percentage. To be able to use the normal curve 227 | as an approximation for symmetric bell-shaped distributions, you will need 228 | to convert the original units into __standard units__ (SU). 229 | 230 | Recall that the conversion formula from $x$ to standard units is: 231 | 232 | $$ 233 | SU = \frac{x - avg}{SD} 234 | $$ 235 | 236 | Let's see how you could convert `weight` values to SU using R. First we need 237 | to obtain find the average and standard deviation of `weight`: 238 | 239 | ```{r} 240 | # average weight 241 | avg_weight = mean(nba$weight) 242 | avg_weight 243 | 244 | # SD weight 245 | # (remember to use correction factor) 246 | n = nrow(nba) 247 | sd_weight = sqrt((n-1)/n) * sd(nba$weight) 248 | sd_weight 249 | ``` 250 | 251 | To convert the weights of the players to standard units, subtract the average 252 | and then divide by the SD: 253 | 254 | ```{r} 255 | # weight in SU 256 | su_weight = (nba$weight - avg_weight) / sd_weight 257 | 258 | # weights in SU of first 5 players 259 | su_weight[1:5] 260 | ``` 261 | 262 | 263 | How does the histogram for `su_weight` look like? 264 | 265 | ```{r out.width='60%', fig.align='center'} 266 | # density histogram 267 | hist(su_weight, las = 1, col = 'gray80', probability = TRUE, 268 | ylim = c(0, 0.5), xlim = c(-3.5, 3.5), 269 | main = 'Histogram of Weight in SU', xlab = 'standard units') 270 | ``` 271 | 272 | 273 | An alternative picture of the distribution of `su_weight` can be obtained by 274 | plotting a kernel density curve: 275 | 276 | ```{r out.width='60%', fig.align='center'} 277 | dens_weight = density(su_weight) 278 | plot(dens_weight, axes = FALSE, ylim = c(0, 0.5), xlim = c(-3.5, 3.5), 279 | main = 'Density Curve', xlab = 'standard units', lwd = 2, col = 'blue') 280 | # x-axis 281 | axis(side = 1) 282 | # y-axis 283 | axis(side = 2, las = 1) 284 | ``` 285 | 286 | Looking at both the histogram, and the kernel density curve, the shape of the 287 | distribution is symmertric but it does not have a peak around zero. 288 | You can say that it has moero of a plateau or flat peak. 289 | 290 | 291 | ## Using Normal Approximation 292 | 293 | Although `weight` does not have the central peak, we can try to see how good 294 | the normal curve approximates its distribution. From the attributes of the 295 | normal curve, we know that 50% of players should have a height below 296 | `avg_weight`. We can directly check what is the proportion of players below 297 | `avg_weight`: 298 | 299 | ```{r} 300 | # proportion of players below average weight 301 | sum(nba$weight <= avg_weight) / n 302 | ``` 303 | 304 | This confirms that `weight` (and `su_weight`) does have a symmetric shape. 305 | 306 | From the empirical 68-95-99.7 rule, we know that about 68% of players should 307 | have weights between `r round(avg_weight, 2)` plus-minus 308 | `r round(sd_weight, 2)`, that is, between `r round(avg_weight - sd_weight, 2)` 309 | and `r round(avg_weight + sd_weight, 2)` 310 | 311 | ```{r} 312 | weight_minus = avg_weight - sd_weight 313 | weight_plus = avg_weight + sd_weight 314 | 315 | # proportion of players within 1 SD from average weight 316 | sum(nba$weight <= weight_plus & nba$weight >= weight_minus) / n 317 | ``` 318 | 319 | As you can tell, the proportion of players around 1 SD is not 68% but 65%. 320 | However, the difference of 3% is not that big. 321 | 322 | 323 | ### Asumming Normality ... 324 | 325 | Let's pretend that `wheight` does have a symmetric bell-shaped distribution, 326 | and that we are interested in finding the proportion of players with weights 327 | below 200 pounds. 328 | 329 | You can use `pnorm()` to find such proportion, and without having to convert 330 | to standard units. All you need to do is specify the `mean` and `sd` arguments 331 | with the corresponding average and SD values, respectively: 332 | 333 | ```{r} 334 | # proportion of players with weight below 200 pounds 335 | pnorm(200, mean = avg_weight, sd = sd_weight) 336 | ``` 337 | 338 | To find the proportion of players with weights above 230 pounds, include the 339 | argument `lower.tail = FALSE`: 340 | 341 | ```{r} 342 | # proportion of players with weight above 230 pounds 343 | pnorm(230, mean = avg_weight, sd = sd_weight, lower.tail = FALSE) 344 | ``` 345 | 346 | You can also use `qnorm()` to find what would be the corresponding weight 347 | such that 60% of players are below it: 348 | 349 | ```{r} 350 | qnorm(0.6, mean = avg_weight, sd = sd_weight) 351 | ``` 352 | 353 | Or what is the weight wuch that 35% of players are above it: 354 | 355 | ```{r} 356 | qnorm(0.35, mean = avg_weight, sd = sd_weight, lower.tail = FALSE) 357 | ``` 358 | -------------------------------------------------------------------------------- /scripts/06-normal-curve.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/06-normal-curve.pdf -------------------------------------------------------------------------------- /scripts/07-scatter-diagrams.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Scatter Diagrams" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | urlcolor: blue 7 | --- 8 | 9 | > ### Learning Objectives 10 | > 11 | > - How to use `plot()` to create scatter diagrams 12 | > - Adding points with `points()` 13 | > - Adding lines with `abline()` 14 | > - How to use `ggplot()` to create scatter diagrams 15 | 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = TRUE) 19 | ``` 20 | 21 | ## Introduction 22 | 23 | The easiest way to plot scatter diagrams in R is with the `plot()` function. 24 | I should say that `plot()` produces different kinds of plots depending on the 25 | type of input(s) that you pass to it. 26 | 27 | If you pass two numeric variables (i.e. two R vectors) `x` and `y`, `plot()` 28 | will produce a scatter diagram. For example, consider the `height` and `weight` 29 | variables of the following toy data table: 30 | 31 | ```{r echo = FALSE} 32 | library(xtable) 33 | dat = data.frame( 34 | name = c('Luke', 'Leia', 'Obi-Wan', 'Yoda', 'Chebacca'), 35 | sex = c('male', 'female', 'male', 'male', 'male'), 36 | height = c(172, 150, 182, 66, 228), 37 | weight = c(77, 49, 44, 78, 112) 38 | ) 39 | ``` 40 | 41 | ```{r, echo=FALSE, results='asis', message=FALSE} 42 | xtb <- xtable(dat, digits = 2) 43 | print(xtb, comment = FALSE, type = 'latex', 44 | include.rownames = FALSE) 45 | ``` 46 | 47 | To make a scatter diagram with `height` and `weight`, you can create two 48 | vectors and pass them to `plot()`: 49 | 50 | ```{r out.width='50%', fig.align='center', fig.width=3, fig.height=3.5} 51 | height = c(172, 150, 182, 66, 228) # in centimeters 52 | weight = c(77, 49, 44, 78, 112) # in kilograms 53 | 54 | # default scatter diagram 55 | plot(height, weight) 56 | ``` 57 | 58 | If you pass a factor to `plot()` it will produce a bar-chart: 59 | 60 | ```{r out.width='50%', fig.align='center', fig.width=2.5, fig.height=3.5} 61 | # qualitative variable (as an R factor) 62 | sex = factor(c('male', 'female', 'male', 'male', 'male')) 63 | 64 | # default scatter diagram 65 | plot(sex) 66 | ``` 67 | 68 | 69 | Note that `plot()` displays a very simple, and kind of ugly, scatter diagram. 70 | This not an accident. In fact, the basic plots in R follow a "quick and dirty" 71 | approach. They are not publication quality, but that is OK. The default display 72 | of `plot()` was not designed to produce pretty graphics, but rather to produce 73 | visualizations that quickly allow you to explore the data, identify patterns, 74 | help you ask new research questions, and then move on with more visualizations 75 | or to the next analytical stages. 76 | 77 | Although `plot()` produces a basic graph, you can use several arguments, or 78 | graphical parameters, to obtain a nicer chart. To find more information about 79 | the available graphical parameters for `plot()`, take a look at the documentation 80 | provided by `help(plot)`. 81 | 82 | The following code uses various graphical parameters to display a more visually 83 | appealing scatter diagram: 84 | 85 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5} 86 | # nicer scatter diagram 87 | plot(height, weight, 88 | las = 1, # orientation of y-axis tick marks 89 | pch = 19, # filled dots 90 | col = '#598CDD', # color of dots 91 | xlab = 'Height (cm)', # x-axis label 92 | ylab = 'Weight (kg)', # y-axis label 93 | main = 'Height -vs- Weight scatter diagram') 94 | ``` 95 | 96 | 97 | ## Adding points and lines 98 | 99 | Often, you may want to add more points and/or line(s) to a given plot. When 100 | you use `plot()`, you add points with `points()`, and lines with `abline()`. 101 | 102 | For example, say you want to add the point of averages. First, get the 103 | averages: 104 | 105 | ```{r} 106 | avg_height = mean(height) 107 | avg_weight = mean(weight) 108 | ``` 109 | 110 | Once you have the coordinates of the point of averages, you can `plot()` again 111 | the scatter diagram, adding the point of averages with `points()`: 112 | 113 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5} 114 | # scatter diagram 115 | plot(height, weight, 116 | las = 1, # orientation of y-axis tick marks 117 | pch = 19, # filled dots 118 | col = '#598CDD', # color of dots 119 | xlab = 'Height (cm)', # x-axis label 120 | ylab = 'Weight (kg)', # y-axis label 121 | main = 'Height -vs- Weight scatter diagram') 122 | # point of averages 123 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato") 124 | ``` 125 | 126 | 127 | Another common task involves adding one or more lines to a scatter diagram 128 | produced by `plot()`. One option to achieve this task is via the `abline()` 129 | function. Here's an example showing the previous scatter diagram, with two 130 | guide lines corresponding to the point of averages 131 | 132 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5} 133 | # scatter diagram 134 | plot(height, weight, 135 | las = 1, # orientation of y-axis tick marks 136 | pch = 19, # filled dots 137 | col = '#598CDD', # color of dots 138 | xlab = 'Height (cm)', # x-axis label 139 | ylab = 'Weight (kg)', # y-axis label 140 | main = 'Height -vs- Weight scatter diagram') 141 | # guide lines for point of avgs 142 | abline(h = avg_weight, v = avg_height, col = "tomato") 143 | # point of averages 144 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato") 145 | ``` 146 | 147 | The argument `h` is used to specify the y-value for _horizontal_ lines; 148 | the argument `v` is used to specify the x-value for _vertical_ lines. 149 | 150 | If what you want is to specify a line with intercept `a` and slope `b`, then 151 | specify these arguments inside `abline()`: 152 | 153 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5} 154 | # scatter diagram 155 | plot(height, weight, 156 | las = 1, # orientation of y-axis tick marks 157 | pch = 19, # filled dots 158 | col = '#598CDD', # color of dots 159 | xlab = 'Height (cm)', # x-axis label 160 | ylab = 'Weight (kg)', # y-axis label 161 | main = 'Height -vs- Weight scatter diagram') 162 | # guide lines for point of avgs 163 | abline(h = avg_weight, v = avg_height, col = "tomato") 164 | # line with intercep and slope 165 | abline(a = 40, b = 0.3, col = "gray50", lty = 2, lwd = 2) 166 | # point of averages 167 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato") 168 | ``` 169 | 170 | 171 | 172 | ## Scatter diagrams with `ggplot2` 173 | 174 | Another approach to create scatter diagrams in R is to use functions from the 175 | package `"ggplot2"`. This package provides a different philosophy to define 176 | graphs, and it also produces plots with visual attributes carefully chosen 177 | to provide prettier plots. 178 | 179 | You should have the package `"ggplot2"` already installed, since you were 180 | supposed to use it for HW02. Assuming that this is the case, you need to load 181 | `"ggplot2"` with the function `library()` in order to start using its functions: 182 | 183 | ```{r warning=FALSE, message=FALSE} 184 | # load ggplot2 185 | library(ggplot2) 186 | ``` 187 | 188 | One of the major differences between basic plots---like those produced by `plot()`---and graphics with `ggplot()`, is that the latter requires the data 189 | to be in the form of a data frame: 190 | 191 | ```{r} 192 | dat = data.frame( 193 | name = c('Luke', 'Leia', 'Obi-Wan', 'Yoda', 'Chewbacca'), 194 | sex = c('male', 'female', 'male', 'male', 'male'), 195 | height = c(172, 150, 182, 66, 228), 196 | weight = c(77, 49, 44, 78, 112) 197 | ) 198 | ``` 199 | 200 | To create a scatter diagram with `"ggplot2"`, type the following commands: 201 | 202 | ```{r out.width='40%', fig.align='center', fig.width=3, fig.height=3} 203 | ggplot(data = dat, aes(x = height, y = weight)) + 204 | geom_point() 205 | ``` 206 | 207 | - The main input of `ggplot()` is `data` which takes the name of the data 208 | frame containing the variables. 209 | - The `aes()` function---inside `ggplot()`---allows you to specify which 210 | variables will be used for the `x` and `y` positions. 211 | - The `+` operator is used to add a _layer_, in this case, the layer corresponds 212 | to `geom_point()` 213 | - The function `geom_point()` specifies the type of geometric object to be 214 | displayed: points (since we want a scatter diagram with dots). 215 | 216 | As you can tell, the default chart produced by `ggplot()` is nicer than the 217 | one produced with `plot()`. You can customize the previous graph to add more 218 | details: 219 | 220 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4} 221 | ggplot(data = dat, aes(x = height, y = weight)) + 222 | geom_point(size = 3) + 223 | theme_bw() + 224 | ggtitle("Height -vs- Weight scatter diagram") 225 | ``` 226 | 227 | Here's another example of a scatter diagram that includes labels for each dot: 228 | 229 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4} 230 | ggplot(data = dat, aes(x = height, y = weight)) + 231 | geom_point(size = 3) + 232 | geom_text(aes(label = name), hjust=0, vjust=0) + 233 | xlim(0, 300) + 234 | theme_bw() + 235 | ggtitle("Height -vs- Weight scatter diagram") 236 | ``` 237 | 238 | Adding specific points with `ggplot()` is a bit trickier. This is because 239 | you need to provide data to `ggplot()` in the form of a data.frame. In order 240 | to plot the point of averages with `ggpot()`, we need to create a data frame 241 | for such a point: 242 | 243 | ```{r} 244 | # data frame for the point of averages 245 | avgs = data.frame(height = avg_height, weight = avg_weight) 246 | avgs 247 | ``` 248 | 249 | One way to add the point of averages is to use `geom_point()` twice: one for 250 | the heighths and weights of the individuals, and the second time for the 251 | point of averages: 252 | 253 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4} 254 | ggplot(data = dat, aes(x = height, y = weight)) + 255 | geom_point(size = 3) + 256 | geom_point(data = avgs, aes(x = height, y = weight), 257 | col = "tomato", size = 4) + 258 | geom_text(aes(label = name), hjust=0, vjust=0) + 259 | xlim(0, 300) + 260 | theme_bw() + 261 | ggtitle("Height -vs- Weight scatter diagram") 262 | ``` 263 | 264 | Finally, here's how to add guide lines for the point of averages: 265 | 266 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4} 267 | ggplot(data = dat, aes(x = height, y = weight)) + 268 | geom_point(size = 3) + 269 | geom_point(data = avgs, aes(x = height, y = weight), 270 | col = "tomato", size = 4) + 271 | geom_vline(xintercept = avg_height, col = 'tomato') + 272 | geom_hline(yintercept = avg_weight, col = 'tomato') + 273 | geom_text(aes(label = name), hjust=0, vjust=0) + 274 | xlim(0, 300) + 275 | theme_bw() + 276 | ggtitle("Height -vs- Weight scatter diagram") 277 | ``` 278 | -------------------------------------------------------------------------------- /scripts/07-scatter-diagrams.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/07-scatter-diagrams.pdf -------------------------------------------------------------------------------- /scripts/08-correlation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Correlation Coefficient" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | urlcolor: blue 7 | --- 8 | 9 | > ### Learning Objectives 10 | > 11 | > - Using scatter diagrams to visualize association of two variables 12 | > - Using R to "manually" compute the correlation coefficient 13 | > - Getting to know the function `cor()` 14 | > - Understanding how change of scales affect the correlation 15 | 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = TRUE) 19 | ``` 20 | 21 | ## Introduction 22 | 23 | In the previous script we talked about how to plot scatter diagrams in R using 24 | two different approaches: 1) the basic `plot()` function, and 2) the more 25 | advanced graphics package `"ggplot2"`. 26 | Knowing how to create scatter diagrams will help us introduce the ideas that 27 | have to do with the analysis of two quantitative variables. 28 | 29 | Describing and summarizing a single (quantitative) variable is usually the 30 | firts step of any data analysis. This should allow you to get to know the data 31 | by looking at the distributions of the variables, and reducing the numerical 32 | information in the data to a set of measures of center and spread. 33 | 34 | After performing a univariate analysis, the next step will usually consist of 35 | exploring how two variables may be associated, determine the type of association, 36 | how strong is the association (if any), and how to summarize such association. 37 | 38 | 39 | ## Anscombe Data Set 40 | 41 | In this tutorial we are going to use a special data set known as the _Anscombe_ 42 | data or _Anscombe's Quartet_. This data was created by Francis Anscombe in 43 | the early 1970s to illustrate statistical similarities and differences between 44 | four pairs of $x-y$ values. This is one of the many data sets that come 45 | in R, and it is available in the object `anscombe` 46 | 47 | ```{r} 48 | # Anscombe's Quartet 49 | anscombe 50 | ``` 51 | 52 | The data frame `anscombe` contains 8 variables: 4 `x`'s and 4 `y`'s. The way you should handle these variables is: `x1` with `y1`, `x2` with `y2`, and so on. 53 | 54 | 55 | ### Histograms 56 | 57 | Let's begin a univariate analysis by looking at the histograms of the `x` 58 | variables: 59 | 60 | ```{r xhistograms, eval = FALSE} 61 | # historgams of x-variables in 2x2 layout 62 | op = par(mfrow = c(2, 2)) 63 | hist(anscombe$x1, col = 'gray80', las = 1) 64 | hist(anscombe$x2, col = 'gray80', las = 1) 65 | hist(anscombe$x3, col = 'gray80', las = 1) 66 | hist(anscombe$x4, col = 'gray80', las = 1) 67 | par(op) 68 | ``` 69 | 70 | ```{r xhistograms, echo = FALSE, fig.height=6} 71 | ``` 72 | 73 | Note that `x1`, and `x2`, and `x3` have the exact same histogram. If you look 74 | at the data frame, this is explained by the fact that these variables have the 75 | same values. In contrast, `x4` has almost all of its vallues equal to 8, 76 | except for one value of 19. 77 | 78 | Now let's look at the histograms of the `y` variables: 79 | 80 | ```{r yhistograms, eval = FALSE} 81 | # historgams of y-variables in 2x2 layout 82 | op = par(mfrow = c(2, 2)) 83 | hist(anscombe$y1, col = 'gray80', las = 1) 84 | hist(anscombe$y2, col = 'gray80', las = 1) 85 | hist(anscombe$y3, col = 'gray80', las = 1) 86 | hist(anscombe$y4, col = 'gray80', las = 1) 87 | par(op) 88 | ``` 89 | 90 | ```{r yhistograms, echo = FALSE, fig.height=6} 91 | ``` 92 | 93 | ### Measures of Center and Spread 94 | 95 | To get various summary statistics, you can use the function `summary()` 96 | 97 | ```{r} 98 | # basic summary of x-variables 99 | summary(anscombe[ ,1:4]) 100 | 101 | # SD+ of x-variables 102 | apply(anscombe[, 1:4], MARGIN = 2, FUN = sd) 103 | ``` 104 | 105 | Again, note that the `summary()` output for `x1`, and `x2`, and `x3` is the same. 106 | As for the standard deviation ($SD^+$), all `x`-variables have identical values. 107 | To calculate all the standard deviations at once, we are using the function 108 | `apply()`. This function allows you to _apply_ a function, e.g. `sd()`, to the 109 | columns (`MARGIN = 2`) of the input data `anscombe[, 1:4]`. 110 | 111 | Now let's get the summary indicators and standard deviation for the `y` variables: 112 | 113 | ```{r} 114 | # basic summary of y-variables 115 | summary(anscombe[ ,5:8]) 116 | 117 | # SD+ of y-variables 118 | apply(anscombe[, 5:8], MARGIN = 2, FUN = sd) 119 | ``` 120 | 121 | Can you notice anything special? Here's a hint: look at the averages and SDs. 122 | All four `y` variables have pretty much the same averages and SDs. But they 123 | have different ranges, quartiles, and medians. And if you take a peek at their 124 | histograms, their distirbutions also have different shapes. 125 | 126 | 127 | ## Scatter Diagrams 128 | 129 | The real interest in the Anscombe data set has to do with studying the 130 | association between each pair of $x-y$ values. The best way to start exploring 131 | pairwise associations is by looking at the scatter diagrams of each 132 | pair of points. How would you describe the shapes and patterns in each plot? 133 | 134 | ```{r scatterplots, eval = FALSE} 135 | # scatter diagrams in 2x2 layout 136 | op = par(mfrow = c(2, 2), mar = c(4.5, 4, 1, 1)) 137 | plot(anscombe$x1, anscombe$y1, pch = 20) 138 | plot(anscombe$x2, anscombe$y2, pch = 20) 139 | plot(anscombe$x3, anscombe$y3, pch = 20) 140 | plot(anscombe$x4, anscombe$y4, pch = 20) 141 | par(op) 142 | ``` 143 | 144 | ```{r scatterplots, fig.height=5.5, echo = FALSE} 145 | ``` 146 | 147 | - The first set `x1` and `y1` shows some degree of linear association. Although 148 | the dots do not lie on a line, we can say that they follow a linear pattern. 149 | 150 | - The second set clearly has a non-linear pattern; instead, the dots follow 151 | some type of curve (perhaps quadratic) or a polynomial of degree greater than 1. 152 | 153 | - The third set is almost perfectly linear except for the observation 154 | corresponding to $x = 13$ which falls outside the pattern of the rest of $y$ 155 | values. 156 | 157 | - The fourth set is similar to the third one in the sense that there is one 158 | observation (an outlier?) that does not follow the pattern of the other values. 159 | Most dots follow a vertical line at $x=8$ except for the dot at $x=19$. 160 | 161 | 162 | ## Correlation Coefficient 163 | 164 | In addition to the visual inspection of the scatter diagrams, statisticians 165 | use a summary measure to quantify the degree of _linear association_ between 166 | two quantitative variables: the __coefficient of correlation__. 167 | 168 | One way to obtain the correlation coefficient of two variables $x$ and $y$ is 169 | as the average of the product of $x$ and $y$ in standard units. 170 | 171 | Let's consider `x1` and `y1` from the `anscobe` data set, and use R to "manually" 172 | calculate the correlation coefficient. This involves obtaining the average and 173 | the standard deviation $SD$, and then converting values to standard units: 174 | 175 | ```{r} 176 | # number of observations 177 | n = nrow(anscombe) 178 | 179 | # x1 in SU 180 | x1_avg = mean(anscombe$x1) 181 | x1_sd = sqrt((n-1)/n) * sd(anscombe$x1) 182 | x1su = (anscombe$x1 - x1_avg) / x1_sd 183 | 184 | # y1 in SU 185 | y1_avg = mean(anscombe$y1) 186 | y1_sd = sqrt((n-1)/n) * sd(anscombe$y1) 187 | y1su = (anscombe$y1 - y1_avg) / y1_sd 188 | 189 | # correlation: average of products 190 | mean(x1su * y1su) 191 | ``` 192 | 193 | Here's some good news. You don't really need to "manually" calcualte the 194 | correlation coefficient. R actually has a function to compute the correlation 195 | of two variables: `cor()` 196 | 197 | ```{r} 198 | # correlation coefficient 199 | cor(anscombe$x1, anscombe$y1) 200 | ``` 201 | 202 | Now let's get the correlation coefficients for all four pairs of variables: 203 | 204 | ```{r} 205 | cor(anscombe$x1, anscombe$y1) 206 | cor(anscombe$x2, anscombe$y2) 207 | cor(anscombe$x3, anscombe$y3) 208 | cor(anscombe$x4, anscombe$y4) 209 | ``` 210 | 211 | Any surprises? As you can tell, all four pairs of $x,y$ variables have basically 212 | the same correlation of `r round(cor(anscombe$x1, anscombe$y1), 3)`. 213 | But not all of them have scatter diagrams in which the points clustered around 214 | a line. 215 | 216 | The take home message is that the correlation coefficient can be misleading in 217 | the presence of outliers or non-linear association. 218 | 219 | 220 | ## Properties of the Correlation Coefficient 221 | 222 | One of the properties of the correlation coefficient is that it is a symmetric 223 | measure. By this we mean that the order of the variables is not important. 224 | You can interchange between $x$ and $y$, and the correlation between them 225 | is unchanged: 226 | 227 | $$ 228 | cor(x,y) = cor(y,x) 229 | $$ 230 | 231 | To illustrate this property, let's create two variables: 232 | 233 | ```{r} 234 | # two variables 235 | x = c(1, 3, 4, 5, 7, 6) 236 | y = c(5, 9, 7, 8, 9, 10) 237 | ``` 238 | 239 | ```{r scatterdiags, eval = FALSE} 240 | op = par(mfrow = c(1,2)) 241 | plot(x, y, pch = 20, col = "blue", las = 1, cex = 1.5) 242 | plot(y, x, pch = 20, col = "blue", las = 1, cex = 1.5) 243 | par(op) 244 | ``` 245 | 246 | ```{r scatterdiags, out.width='80%', fig.align='center', fig.width = 8, fig.height=4} 247 | ``` 248 | 249 | The scatter diagram changes depending on what variable is on each axis. 250 | However, the correlation coefficient in both cases is the same: 251 | 252 | ```{r} 253 | # symmetric 254 | cor(x, y) 255 | cor(y, x) 256 | ``` 257 | 258 | 259 | ### Change of Scale 260 | 261 | The other properties of the correlation coefficient have to do with what the 262 | FPP book calls _change of scale_. To be more precise, the considered change 263 | of scales involve __linear__ change of scales (i.e. linear transformation). 264 | Typical operations that result in a linear change of scale are: 265 | 266 | - Adding a scalar: $x + 3, y$ 267 | - Multiplying times a positive scalar: $2x, y$ 268 | - Multiplying times a negative scalar: $-2x, y$ 269 | - Adding and multiplying: $2x + 3, y$ 270 | 271 | ```{r change-scale, eval = FALSE} 272 | # scatter diagrams in 2x2 layout 273 | op = par(mfrow = c(2, 2), mar = c(4.5, 4, 1, 1)) 274 | plot(x + 3, y, pch = 20, col = "orange", las = 1, cex = 1.5) 275 | plot(2 * x, y, pch = 20, col = "green3", las = 1, cex = 1.5) 276 | plot((-2) * x, y, pch = 20, col = "violet", las = 1, cex = 1.5) 277 | plot(2 * x + 3, y, pch = 20, col = "red", las = 1, cex = 1.5) 278 | par(op) 279 | ``` 280 | 281 | ```{r change-scale, echo = FALSE, fig.height=6} 282 | ``` 283 | 284 | ```{r correlations, eval = FALSE} 285 | cor(x, y) 286 | cor(x + 3, y) 287 | cor(2 * x, y) 288 | cor(-2 * x, y) 289 | cor(2 * x + 3, y) 290 | ``` 291 | 292 | ```{r correlations, echo = FALSE} 293 | ``` 294 | 295 | Wat can you conclude from the change of scales? In which case the correlation 296 | coefficient is affected by such changes? 297 | -------------------------------------------------------------------------------- /scripts/08-correlation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/08-correlation.pdf -------------------------------------------------------------------------------- /scripts/09-regression-line.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/09-regression-line.pdf -------------------------------------------------------------------------------- /scripts/10-prediction-and-errors-in-regression.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Predictions and Errors in Regression" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | fontsize: 11pt 7 | urlcolor: blue 8 | --- 9 | 10 | > ### Learning Objectives 11 | > 12 | > - Calculating predicted values with the regression method 13 | > - Looking at the regression residuals 14 | > - Calculating r.m.s. error for regression 15 | 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = TRUE) 19 | ``` 20 | 21 | 22 | ## Introduction 23 | 24 | In the previous script, you learned about the function `lm()` to obtain a simple lienar regression model. Specifically, we looked at the regression `coefficients`: the intercept and the slope. You also learned how to plot a scatter diagram with the regression line, via the `abline()` function, as well as how to "manually" calculate the intercept and slope with the formulas: 25 | 26 | $$ 27 | slope = r \times \frac{SD_y}{SD_x} 28 | $$ 29 | 30 | In turn, Chapter 12 presents the formula of the intercept as: 31 | 32 | $$ 33 | intercept = avg_y - slope \times avg_x 34 | $$ 35 | 36 | 37 | ## Regression with Height Data Set 38 | 39 | To cotinue our discussion, we'll keep using the data set in the file csv file `pearson.csv` (in the github repository): 40 | 41 | ```{r} 42 | # assembling the URL of the CSV file 43 | # (otherwise it won't fit within the margins of this document) 44 | repo = 'https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/' 45 | datafile = 'master/data/pearson.csv' 46 | url = paste0(repo, datafile) 47 | 48 | # read in data set 49 | dat = read.csv(url) 50 | ``` 51 | 52 | The data frame `dat` contains `r nrow(dat)` rows, and `r ncol(dat)` columns: 53 | 54 | - `Father`: height of the father (in inches) 55 | - `Son`: height of the son (in inches) 56 | 57 | Here's a reminder on how to use the function `lm()` to regress `Son` on `Father`: 58 | 59 | ```{r} 60 | # run regression analysis 61 | reg = lm(Son ~ Father, data = dat) 62 | reg 63 | ``` 64 | 65 | You can compare the coefficients given by `lm()` with your own calculated 66 | $b_1$ and $b_0$ according to the previous formulas. First let's get the main 67 | ingredients: 68 | 69 | ```{r} 70 | # number of values (to be used for correcting SD+) 71 | n = nrow(dat) 72 | 73 | # averages 74 | avg_x = mean(dat$Father) 75 | avg_y = mean(dat$Son) 76 | 77 | # SD (corrected SD+) 78 | sd_x = sqrt((n-1)/n) * sd(dat$Father) 79 | sd_y = sqrt((n-1)/n) * sd(dat$Son) 80 | 81 | # correlation coefficient 82 | r = cor(dat$Father, dat$Son) 83 | ``` 84 | 85 | Now let's compute the slope and intercept, and compare them with 86 | `reg$coefficients` 87 | 88 | ```{r} 89 | # slope 90 | b1 = r * (sd_y / sd_x) 91 | b1 92 | 93 | # intercept 94 | b0 = avg_y - (b1 * avg_x) 95 | b0 96 | 97 | # compared with coeffs 98 | reg$coefficients 99 | ``` 100 | 101 | 102 | ## Predicting Values 103 | 104 | As I mentioned in the last tutorial, regression tools are 105 | mainly used for prediction purposes. This means that we can use the estimated 106 | regression line $\mathtt{Son} \approx b_0 + b_1 \mathtt{Father}$, to predict 107 | the height of Son given a particular Father's height. 108 | 109 | For example, if a father has a height of 71 inches, what is the predicted 110 | son's height? 111 | 112 | __Option a)__ One way to answer this question is with the regression method described in chapter 10 of FPP. The first step consists of converting $x$ in standard units, then multiplying times $r$ to get the predicted $\hat{y}$ in standard units, and finally rescaling the predicted value to the original units. 113 | 114 | ```{r} 115 | # height of father in standard units 116 | height = 71 117 | height_su = (height - avg_x) / sd_x 118 | height_su 119 | ``` 120 | 121 | ```{r} 122 | # predicted Son's height in standard units 123 | prediction_su = r * height_su 124 | prediction_su 125 | ``` 126 | 127 | ```{r} 128 | # rescaled to original units 129 | prediction = prediction_su * sd_y + avg_y 130 | prediction 131 | ``` 132 | 133 | 134 | __Option b)__ Another way to find the predicted son's height when the height of the father is 71 is by using the equation of the regression line: 135 | 136 | ```{r} 137 | # predict height of son with a 71 in. tall father 138 | b0 + b1 * 71 139 | ``` 140 | 141 | __Option c)__ A third option is with the `predict()` function. The first 142 | argument must be an `"lm"` object; the second argument must be a data frame 143 | containing the values for `Fater`: 144 | 145 | ```{r} 146 | # new data (must be a data frame) 147 | newdata = data.frame(Father = 71) 148 | 149 | # predict son's height 150 | predict(reg, newdata) 151 | ``` 152 | 153 | If you want to know the predicted values based on several `Father`'s heights, 154 | then do something like this: 155 | 156 | ```{r} 157 | more_data = data.frame(Father = c(65, 66.7, 67, 68.5, 70.5, 71.3)) 158 | 159 | predict(reg, more_data) 160 | ``` 161 | 162 | 163 | ## R.M.S. Error for Regression 164 | 165 | The predictions given by the regression line will tend to be off. There is 166 | usually some difference between the observed values $y$ and the predicted 167 | values $\hat{y}$. This difference is called __residual__. The residuals are 168 | part of the `"lm"` object `reg`. 169 | You can take a peek at such residuals with `head()` 170 | 171 | ```{r} 172 | # first six residuals 173 | head(reg$residuals) 174 | ``` 175 | 176 | By how much the predicted values will be off? 177 | To find the answer, you need to calculate the _Root Mean Square_ (RMS) error 178 | for regression. In other words, you need to take the residuals 179 | (i.e. difference between actual values and predicted values), and get the 180 | square root of the average of their squares. 181 | 182 | ```{r} 183 | # r.m.s. error for regression 184 | rms = sqrt(mean(reg$residuals^2)) 185 | rms 186 | ``` 187 | 188 | The r.m.s. value tells you the typical size of the residuals. This means that 189 | the typical predicted heights of sons will be off by about `r round(rms, 2)` 190 | inches. 191 | 192 | 193 | ## Are residuals homoscedastic? 194 | 195 | As you know, the main assumption in a simple regression analysis is that $X$ 196 | and $Y$ are approximately linearly related. This means that we can 197 | use a line as a good summary for the cloud of points. For a line to able to do 198 | a good summarizing job, the amount of spread around the regression line should 199 | be fairly the same (i.e. constant). This requirement has a very 200 | specific---and rather ugly---name: __homoscedasticity__; which simply means 201 | "same scatter". Visually, homoscedascity comes in the form of the so-called 202 | football-shaped cloud of points. Or in a more geometric sense, cloud of points 203 | with a chiefly elliptical shape. 204 | 205 | The `"lm"` object `reg` contains the vector of redisuals (see `reg$residuals`). 206 | The residuals from the regression line must average out to 0. To confirm this, 207 | let's get their average: 208 | 209 | ```{r} 210 | mean(reg$residuals) 211 | ``` 212 | 213 | You can take a look at the _residual plot_ by running this command: 214 | 215 | ```{r out.width='60%', fig.align='center', fig.width=6, fig.height=5} 216 | # residuals plot 217 | plot(reg, which = 1) 218 | ``` 219 | 220 | which is equivalent to this other command: 221 | 222 | ```{r eval = FALSE} 223 | # equivalently 224 | plot(reg$fitted.values, reg$residuals) 225 | ``` 226 | 227 | This residual plot is not exactly the same that the book describes (pages 187-188). 228 | To plot the residuals like the book does, you would need to use the `Father` 229 | variable in the x-axis: 230 | 231 | ```{r out.width='60%', fig.align='center', fig.width=6, fig.height=5} 232 | # residuals plot (as in FPP) 233 | plot(dat$Father, reg$residuals) 234 | abline(h = 0, lty = 2) # horizontal dashed line 235 | ``` 236 | 237 | The difference is only in the scale of the horizontal axis. But the important 238 | part in both plots is the shape of the cloud. 239 | As you look across the residual plot, there is no systematic tendency for the 240 | points to drift up or down. The red line displayed by `plot(reg, which = 1)`, 241 | is a regression line for the residuals. When residuals are homoscedastic, this 242 | line is basically a horizontal line. This is what you want to see when 243 | inspecting the residual plot. Why? Because it supports the appropriate use of 244 | the regression line. 245 | 246 | 247 | ## Summary output 248 | 249 | `reg` is an object of class `"lm"`---linear model. For this type of R object, 250 | you can use the `summary()` function to get additional information and diagnostics: 251 | 252 | ```{r} 253 | # summarized linear model 254 | sum_reg = summary(reg) 255 | sum_reg 256 | ``` 257 | 258 | The information displayed by `summary()` is the typical output that most 259 | statistical programs provide about a simple linear regression model. There 260 | are four major parts: 261 | 262 | - `Call`: the command used when invoking `lm()`. 263 | - `Residuals`: summary indicators of the residuals. 264 | - `Coefficients`: table of regression coefficients. 265 | - Additional statistics: more diagnostics toosl. 266 | 267 | In the same way that `lm()` produces `"lm"` objects, `summary()` of `"lm"` 268 | objects produce `"summary.lm"` objects. This type of objects also contain 269 | more information than what is displayed by default. To see the list of all the 270 | components in `sum_reg`, you can use again the function `names()`: 271 | 272 | ```{r} 273 | names(sum_reg) 274 | ``` 275 | 276 | -------------------------------------------------------------------------------- /scripts/10-prediction-and-errors-in-regression.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/10-prediction-and-errors-in-regression.pdf -------------------------------------------------------------------------------- /scripts/11-binomial-formula.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/11-binomial-formula.pdf -------------------------------------------------------------------------------- /scripts/12-chance-process.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Chance Processes and Variability" 3 | subtitle: "Intro to Stats, Spring 2017" 4 | author: "Prof. Gaston Sanchez" 5 | output: html_document 6 | fontsize: 11pt 7 | urlcolor: blue 8 | --- 9 | 10 | > ### Learning Objectives 11 | > 12 | > - How to use R to simulate chance processes 13 | > - Getting to know the function `sample()` 14 | > - Simulate flipping a coin 15 | > - Simulate rolling a die 16 | > - Simulate drawing tickets from a box 17 | 18 | 19 | ```{r setup, include=FALSE} 20 | knitr::opts_chunk$set(echo = TRUE) 21 | ``` 22 | 23 | ## Introduction 24 | 25 | In this tutorial we will see how to use R to simulate basic chance processes 26 | like tossing a coin, rolling a die, or drawing tickets from a box. The aim is 27 | to give you some tools that allow you to better understanding and visualize 28 | fundamental concepts such as the law of large numbers, the law of averages, 29 | and the central limit theorem. 30 | 31 | 32 | ## Coins, Dice, and Boxes with Tickets 33 | 34 | Chance processes, also referred to as chance experiments, have to do with 35 | actions in which the resulting outcome turns out to be different in each 36 | occurrence. 37 | 38 | Typical examples of basic chance processes are tossing one or more coins, 39 | rolling one or more dice, selecting one or more cards from a deck of cards, 40 | and in general, things that can be framed in terms of drawing tickets out of 41 | a box (or any other type of container: bag, urn, etc.). 42 | 43 | You can use your computer, and R in particular, to simulate chances processes. 44 | In order to do that, the first step consists of learning how to create a 45 | virtual coin, or die, or box with tickets. 46 | 47 | 48 | ### Creating a coin 49 | 50 | The simplest way to create a coin with two sides, `"heads"` and `"tails"`, is 51 | with an R vector via the _combine_ function `c()` 52 | 53 | ```{r} 54 | coin = c("heads", "tails") 55 | ``` 56 | 57 | You can also create a _numeric_ coin that shows `0` and `1` instead of 58 | `"heads"` and `"tails"`: 59 | 60 | ```{r} 61 | coin = c(0, 1) 62 | ``` 63 | 64 | 65 | ### Creating a die 66 | 67 | What about simulating a die in R? Pretty much the same way you create a coin: 68 | simply define a vector with numbers representing the number of spots in a die. 69 | 70 | ```{r} 71 | die = c(1, 2, 3, 4, 5, 6) 72 | 73 | # equivalent 74 | die = 1:6 75 | ``` 76 | 77 | 78 | ### Creating a box with tickets 79 | 80 | Likewise, you can create a general box with tickets. For instance, say you have 81 | a box with tickets labeled 1, 2, 3 and 4; this can be implemented in R as: 82 | 83 | ```{r} 84 | tickets = c(1, 2, 3, 4) 85 | ``` 86 | 87 | 88 | 89 | ## Drawing tickets with `sample()` 90 | 91 | Once you have an object that represents the _box with tickets_, the next step 92 | involves learning how to draw tickets from the box. One way to simulate drawing 93 | tickets from a box in R is with the function `sample()` which lets you draw 94 | random samples, with or without replacement, from an input vector. 95 | 96 | For example, consider a "box" with tickets 1, 2, 3. 97 | To draw one ticket, use `sample()` like this: 98 | 99 | ```{r} 100 | # box with tickets 101 | tickets = c(1, 2, 3) 102 | 103 | # draw one ticket 104 | sample(tickets, size = 1) 105 | ``` 106 | 107 | By default, `sample()` draws each ticket with the same probability. In other 108 | words, ecah ticket is assigned the same probability of being chosen. Another 109 | default behavior of `sample()` is to take a sample of the specified `size` 110 | without replacement. If `size = 1`, it does not really matter whether the sample 111 | is done with or without replacement. 112 | 113 | To draw two tickets WITHOUT replacement, use `sample()` like this: 114 | 115 | ```{r} 116 | # draw 2 tickets without replacement 117 | sample(tickets, size = 2) 118 | ``` 119 | 120 | To draw two tickets WITH replacement, use `sample()` and specify its argument 121 | `replace = TRUE`, like this: 122 | 123 | ```{r} 124 | # draw 2 tickets with replacement 125 | sample(tickets, size = 2, replace = TRUE) 126 | ``` 127 | 128 | The way `sample()` works is by taking a random sample from the input vector. 129 | This means that every time you invoke `sample()` you will likely get a different 130 | output. 131 | 132 | In order to make the examples replicable (so you can get the same output as me), 133 | you need to specify what is called a __random seed__. This is done with the 134 | function `set.seed()`. By setting a _seed_, every time 135 | you use one of the random generator functions, like `sample()`, you will get 136 | the same values. 137 | 138 | ```{r} 139 | # set random seed 140 | set.seed(1257) 141 | 142 | # draw 4 tickets with replacement 143 | sample(tickets, size = 4, replace = TRUE) 144 | ``` 145 | 146 | Try the code above. You should get the exact same sample. 147 | 148 | Last but not least, `sample()` comes with the argument `prob` which allows you 149 | to provide specific probabilities for each element in the input vector. 150 | 151 | By default, `prob = NULL`, which means that every element has the same 152 | probability of being drawn. In the example of tossing a coin, the command 153 | `sample(coin)` is equivalent to `sample(coin, prob = c(0.5, 0.5))`. In the 154 | latter case we explicitly specify a probability of 50% chance of heads, and 155 | 50% chance of tails: 156 | 157 | ```{r echo = FALSE} 158 | # tossing a fair coin 159 | coin = c("heads", "tails") 160 | 161 | sample(coin) 162 | sample(coin, prob = c(0.5, 0.5)) 163 | ``` 164 | 165 | However, you can provide different probabilities for each of the elements in 166 | the input vector. For instance, to simulate a __loaded__ coin with chance of 167 | heads 20%, and chance of tails 80%, set `prob = c(0.2, 0.8)` like so: 168 | 169 | ```{r} 170 | # tossing a loaded coin (20% heads, 80% tails) 171 | sample(coin, size = 5, replace = TRUE, prob = c(0.2, 0.8)) 172 | ``` 173 | 174 | 175 | ----- 176 | 177 | 178 | ## Simulating tossing a coin 179 | 180 | Now that we've talked about `sample()`, let's use R to implement code that 181 | simulates tossing a fair coin one or more times. 182 | 183 | __Recap.__ To toss a coin using R, we first need an object that plays the role 184 | of a coin. A simple way to create a `coin` is using a vector with two elements: 185 | `"heads"` and `"tails"`. Then, to simulate tossing a coin one or more times, 186 | we use the `sample()` function. 187 | Here's how to simulate a coin toss using `sample()` to take a random sample of 188 | size 1 from `coin`: 189 | 190 | ```{r coin-vector} 191 | # coin object 192 | coin <- c("heads", "tails") 193 | 194 | # one toss 195 | sample(coin, size = 1) 196 | ``` 197 | 198 | To simulate multiple tosses, just change the `size` argument, and specify 199 | sampling with replacement (`replace = TRUE`): 200 | 201 | ```{r various-tosses} 202 | # 3 tosses 203 | sample(coin, size = 3, replace = TRUE) 204 | 205 | # 6 tosses 206 | sample(coin, size = 6, replace = TRUE) 207 | ``` 208 | 209 | 210 | ### Coin Simulations 211 | 212 | Now that we have all the elements to toss a coin with R, let's simulate flipping 213 | a coin 100 times, and use the function `table()` to count the resulting number 214 | of `"heads"` and `"tails"`: 215 | 216 | ```{r} 217 | # number of flips 218 | num_flips = 100 219 | 220 | # flips simulation 221 | coin = c('heads', 'tails') 222 | flips = sample(coin, size = num_flips, replace = TRUE) 223 | 224 | # number of heads and tails 225 | freqs = table(flips) 226 | freqs 227 | ``` 228 | 229 | In my case, I got `r freqs[1]` heads and `r freqs[2]` tails. Your results will 230 | probably be different than mine. Some of you will get more `"heads"`, some of 231 | you will get more `"tails"`, and some will get exactly 50 `"heads"` and 50 232 | `"tails"`. 233 | 234 | Run another series of 100 flips, and find the frequency of `"heads"` and `"tails"`: 235 | 236 | ```{r} 237 | flips = sample(coin, size = num_flips, replace = TRUE) 238 | freqs = table(flips) 239 | freqs 240 | ``` 241 | 242 | Let's make things a little bit more complex but also more interesting. The idea 243 | is to repeat 100 flips 1000 times. To carry out this simulation, we are going 244 | to use a programming structure called a `for` loop. This is one way to tell 245 | the computer to repeat the same action a given number of times. 246 | Don't worry about this. Just execute the following lines of code: 247 | 248 | ```{r} 249 | # total number of repetitions 250 | times = 1000 251 | 252 | # "empty" vectors to store number of heads and tails in each repetition 253 | heads = c(0, times) 254 | tails = c(0, times) 255 | 256 | # 100 flips of a coin, repeated 1000 times 257 | for (i in 1:times) { 258 | flips = sample(coin, size = 100, replace = TRUE) 259 | freqs = table(flips) 260 | heads[i] = freqs[1] 261 | tails[i] = freqs[2] 262 | } 263 | ``` 264 | 265 | What the code above is doing is simulating 100 flips of a coin, not once, 266 | not twice, but 1000 times. In each repetition, we count how many `"heads"` 267 | and how many `"tails"`, and store those counts in the vectors `heads` and 268 | `tails`, respectively. 269 | 270 | Each vector, `heads` and `tails`, contains 1000 values. Moreover, we can get 271 | a histogram to see the empirical relative frequency: 272 | 273 | ```{r fig.align='center', out.width='75%', fig.height=4.5} 274 | barplot(table(heads)/1000, las = 1, cex.names = 0.5, border = NA, 275 | main = "Frequency of number of heads in 100 flips") 276 | ``` 277 | 278 | ```{r fig.align='center', out.width='75%', fig.height=4.5} 279 | barplot(table(tails)/1000, las = 1, cex.names = 0.5, border = NA, 280 | main = "Frequency of number of tails in 100 flips") 281 | ``` 282 | 283 | 284 | 285 | ## Frequencies 286 | 287 | Typical probability problems that have to do with coin tossing, require 288 | to compute the total proportion of `"heads"` and `"tails"`: 289 | 290 | ```{r five-tosses} 291 | # five tosses 292 | five <- sample(coin, size = 5, replace = TRUE) 293 | 294 | # proportion of heads 295 | sum(five == "heads") / 5 296 | 297 | # proportion of tails 298 | sum(five == "tails") / 5 299 | ``` 300 | 301 | It is also customary to compute the relative frequencies of `"heads"` and 302 | `"tails"` in a series of tosses: 303 | 304 | ```{r relative-freqs} 305 | # relative frequencies of heads 306 | cumsum(five == "heads") / 1:length(five) 307 | 308 | # relative frequencies of tails 309 | cumsum(five == "tails") / 1:length(five) 310 | ``` 311 | 312 | Likewise, it is common to look at how the relative frequencies of heads or 313 | tails change over a series of tosses: 314 | 315 | ```{r plot-freqs} 316 | set.seed(5938) 317 | hundreds <- sample(coin, size = 500, replace = TRUE) 318 | head_freqs = cumsum(hundreds == "heads") / 1:500 319 | 320 | plot(1:500, head_freqs, type = "l", ylim = c(0, 1), las = 1, 321 | col = "#3989f8", lwd = 2, 322 | xlab = 'number of tosses', 323 | ylab = 'frequency of heads') 324 | # reference line at 0.5 325 | abline(h = 0.5, col = 'gray50', lwd = 1.5, lty = 2) 326 | ``` 327 | 328 | So far we have written code in R that simulates tossing a coin one or more 329 | times. We have included commands to compute proportion of heads and tails, 330 | as well the relative frequencies of heads (or tails) in a series of tosses. 331 | In addition, we have produced a plot of the relative frequencies and see 332 | how, as the number of tosses increases, the frequency of heads (and tails) 333 | approach 0.5. 334 | 335 | 336 | ----- 337 | 338 | ## Simulating rolling a die 339 | 340 | Now that you know how to simulate flipping a coin one or more times, you can 341 | do the same to simulate rolling a die: 342 | 343 | ```{r} 344 | die = 1:6 345 | 346 | # rolling a die once 347 | sample(die, size = 1) 348 | 349 | # rolling a pair of dice 350 | sample(die, size = 2, replace = TRUE) 351 | 352 | # rolling a die 5 times 353 | sample(die, size = 5, replace = TRUE) 354 | ``` 355 | 356 | 357 | -------------------------------------------------------------------------------- /scripts/12-chance-process.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/12-chance-process.pdf -------------------------------------------------------------------------------- /scripts/Makefile: -------------------------------------------------------------------------------- 1 | # input files 2 | RMDS = $(wildcard *.Rmd) 3 | 4 | # output files 5 | PDFS = $(patsubst %.Rmd, %.pdf, $(RMDS)) 6 | HTMLS = $(patsubst %.Rmd, %.html, $(RMDS)) 7 | 8 | 9 | .PHONY: all htmls clean 10 | 11 | 12 | all: $(PDFS) 13 | 14 | 15 | htmls: $(HTMLS) 16 | 17 | 18 | %.pdf: %.Rmd 19 | Rscript -e "library(rmarkdown); render('$<', output_format = 'pdf_document')" 20 | 21 | 22 | %.html: %.Rmd 23 | Rscript -e "library(rmarkdown); render('$<', output_format = 'html_document')" 24 | 25 | 26 | clean: 27 | rm -rf *.pdf *.html 28 | -------------------------------------------------------------------------------- /scripts/README.md: -------------------------------------------------------------------------------- 1 | # Intro Stats Scripts 2 | 3 | This folder contains the Rmd scripts used in lecture as well as out-of-class for the introductory courses to Probability and Statistics at UC Berkeley. 4 | 5 | 6 | -------------------------------------------------------------------------------- /scripts/images/karl-pearson.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/images/karl-pearson.jpg -------------------------------------------------------------------------------- /scripts/images/western-conference-standings-2016.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/images/western-conference-standings-2016.png -------------------------------------------------------------------------------- /syllabus/README.md: -------------------------------------------------------------------------------- 1 | ## Syllabus 2 | 3 | > - [Stat 20](syllabus-stat20.md) 4 | > - [Stat 131A](syllabus-stat131A.md) 5 | 6 | ![](mrs-mutner-rules.jpg) -------------------------------------------------------------------------------- /syllabus/mrs-mutner-rules.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/syllabus/mrs-mutner-rules.jpg -------------------------------------------------------------------------------- /syllabus/syllabus-stat131A.md: -------------------------------------------------------------------------------- 1 | ## Course Syllabus Stat 131A 2 | 3 | Stat 131A: Introduction to Probability and Statistics for Life Scientists, Spring 2017 4 | 5 | - __Instructor:__ Gaston Sanchez, gaston.stat[at]gmail.com 6 | - __Class Time:__ MWF 2-3pm in 50 Birge 7 | - __Session Dates:__ 01/18/17 - 05/05/17 8 | - __Code #:__ 23461 9 | - __Units:__ 4 (more info [here](http://classes.berkeley.edu/content/2017-spring-stat-131A-001-lec-001)) 10 | - __Office Hours:__ TuTh 11:30am-12:30pm in 309 Evans (or by appointment) 11 | - __Text:__ Statistics, 4th edition (by Freedman, Pisani, and Purves) 12 | - __Final:__ Tue, May-09, 11:30am-2:30pm 13 | - __GSIs:__ Office hours of the GSIs will be posted on the bCourses page. You can go to the office hours of __any__ GSI, not just your own. 14 | 15 | | Discussion | Date | Room | GSI | 16 | |------------|-----------|------------------|--------------| 17 | | 101 | MW 3-4pm | 250 Sutardja Dai | Shuhui Huang | 18 | | 102 | MW 3-4pm | B51 Hildebrand | Andy Mao | 19 | | 103 | MW 4-5pm | 9 Evans | Shuhui Huang | 20 | | 104 | MW 5-6pm | 70 Evans | Andy Mao | 21 | 22 | 23 | ### Description 24 | 25 | __Statistics 131A__ is a course designed primarily as an introductory course for statistical thinking. You do need to be comfortable with math at the level of intermediate algebra (e.g., the equation of a straight line, plotting points, taking powers and roots, percentages). 26 | 27 | The emphasis of the course is critical thinking about quantitative evidence. Topics include reasoning and fallacies, descriptive statistics, association, correlation, regression, elements of probability, set theory, chance variability, random variables, expectation, standard error, sampling, hypothesis tests, confidence intervals, experiments and observational studies. 28 | 29 | 30 | ### Methods of Instruction 31 | 32 | Using a combination of lecture and student participation, each class session will focus on learning the fundamentals. The required textbook is the classic __Statistics__ 4th edition by Freedman, Pisani and Purves. 33 | 34 | I firmly believe that one cannot do statistical computations without the help of good statistical software. In this course, you will be asked to do various assignments and practical work using the [statistical software R](https://www.r-project.org/). The main idea is to use R as a supporting tool to help you apply the key concepts of the textbook. 35 | 36 | 37 | ### Homework Assignments 38 | 39 | - Homework assignments will be assigned almost every week (about 13 HW). 40 | - You will submit your homework via bCourses electronically (as a word, text, pdf, or html file). 41 | - I will drop your lowest HW score. 42 | - Don't wait until the last hour to do an assignment. Plan ahead and pace yourself. 43 | - Don't wait until the last minute to submit your assignment. 44 | - __No late assignments__ will be accepted, for any reason, including, but not limited to, theft or any extraordinary circumnstances (e.g. illnes, exhaustion, mourning, loss of internet connection, bCourses is down, broken computer). 45 | - Note that answers to non-review questions are in the back of the book so you can check that your answer is correct. 46 | - Solutions to the review exercises will be posted on bCourses. 47 | - If you collaborate with other students when working on a HW assignment, please include the names of those students in your submission. 48 | - You must write your own answers (using your own words). Copy and plagiarism will not be tolerated (see _Academic Honesty_ policy). 49 | 50 | 51 | ### Discussion 52 | 53 | - Discussion is an important part of the class and is meant to supplement lecture. 54 | - Your GSI will review and expand on concepts introduced in lecture and encourage you to problem solve in groups. 55 | - There will be about 4/5 short quizzes given in discussion to test your understanding. 56 | - Your quiz scores __will NOT__ be part of your grade. 57 | - Students must attend the discussion group they are officially registered in. 58 | 59 | 60 | ### Exams 61 | 62 | - There will be two 50-minute in-class midterms, and one 3-hour final exam. 63 | - The tentative dates of the midterms are Friday Feb-24, and Friday Apr-07. 64 | - The final exam is currently scheduled for Tuesday, May 9th from 11:30am-2:30pm. (classroom to be announced). 65 | - If you do not take the final, you will NOT pass the class. 66 | - There will be __no early or makeup exams__. 67 | - ~~To ask for regrading, you must answer a test using pen. Tests answered with pencil will not be accepted for regrading.~~ 68 | - We will use _gradescrope_ to grade your tests (so you can use pen or pencil). 69 | - When asking for regrading, please clearly state the reasons that make you think you deserve a higher score. 70 | - You will have one full week after grades are published on gradescope to request a regrade. 71 | - After the regrade deadline, no requests will be considered. 72 | 73 | 74 | ### Grading Structure 75 | 76 | - 20% homework (lowest 1 dropped) 77 | - 25% midterm 1 78 | - 25% midterm 2 79 | - 30% final 80 | 81 | No individual letter grades will be given for midterm, or final. You will get a letter grade for the course that is based on your overall score. Your final grade will be graded on a 30/30/30/10 (A/B/C/DF) scale. 82 | 83 | 84 | ### Calculator Policy 85 | 86 | - You will need one that adds, subtracts, multiplies, divides, takes square roots, raises numbers to a power, and preferably also computes factorials. Statistical calculators are unnecessary. 87 | - However, no graphing calculators, phone calculators or tablet calculators are allowed. 88 | - If you do not bring a calculator to a midterm or the final, do your computations by hand (you won't be allowed to borrow someone else's calculator). 89 | 90 | 91 | ### Academic Honesty 92 | 93 | I (Gaston Sanchez) expect you to do your own work and to uphold the standards of intellectual integrity. Collaborating on homework is fine and I encourage you to work together---but copying is not, nor is having somebody else submit assignments for you. Cheating will not be tolerated. Anyone found cheating will receive an F and will be reported to the [Center for the Student Conduct](http://sa.berkeley.edu/conduct). If you are having trouble with an assignment or studying for an exam, or if you are uncertain about permissible and impermissible conduct or collaboration, please come see me with your questions. 94 | 95 | 96 | ### Email Policy 97 | 98 | - You should try to use email as a tool to set up a one-on-one meeting with me if office hours conflict with your schedule. 99 | - Use the subject line __Stat 131 Meeting Request__. 100 | - Your message should include at least two times when you would like to meet and a brief (one-two sentence) description of the reason for the meeting. 101 | - Do NOT expect me to reply right away (I may not reply on time). 102 | - If you have an emergency, talk to me later during class or office hours. 103 | - I strongly encourage you to ask questions about the syllabus, covered material, and assignments during class time or lab discussions. 104 | - I prefer to have conversations in person rather than via email, thus allowing us to get to know each other better and fostering a more collegial learning atmosphere. 105 | 106 | 107 | ### Accommodation Policy 108 | 109 | Students needing accommodations for any physical, psychological, or learning disability, should speak with me during the first two weeks of the semester, either after class or during office hours and see [http://dsp.berkeley.edu](http://dsp.berkeley.edu) to learn about Berkeley’s policy. If you are a DSP student, please contact me at least three weeks prior to a midterm or final so that we can work out acceptable accommodations via the DSP Office. 110 | 111 | If you are an athlete or Cal band member, please check your calendar and come see me as soon as possible to OH during the first two weeks of the semester. Please try your best to be present at each of the midterms as I cannot guarantee accommodation for a late exam. 112 | 113 | 114 | ### Safe, Supportive, and Inclusive Environment 115 | 116 | Whenever a faculty member, staff member, post-doc, or GSI is responsible for 117 | the supervision of a student, a personal relationship between them of a 118 | romantic or sexual nature, even if consensual, is against university policy. 119 | Any such relationship jeopardizes the integrity of the educational process. 120 | 121 | Although faculty and staff can act as excellent resources for students, you 122 | should be aware that they are required to report any violations of this campus 123 | policy. If you wish to have a confidential discussion on matters related to this 124 | policy, you may contact the _Confidential Care Advocates_ on campus for support 125 | related to counseling or sensitive issues. Appointments can be 126 | made by calling (510) 642-1988. 127 | 128 | The classroom, lab, and work place should be safe and inclusive environments 129 | for everyone. The _Office for the Prevention of Harassment and Discrimination_ 130 | (OPHD) is responsible for ensuring the University provides an environment for 131 | faculty, staff and students that is free from discrimination and harassment on 132 | the basis of categories including race, color, national origin, age, sex, gender, 133 | gender identity, and sexual orientation. Questions or concerns? 134 | Call (510) 643-7985, email ask_ophd@berkeley.edu, or go to 135 | [http://survivorsupport.berkeley.edu/](http://survivorsupport.berkeley.edu/). 136 | 137 | 138 | ### Incomplete Policy 139 | 140 | Under emergency/special circumstances, students may petition me to receive an Incomplete grade. By University policy, for a student to get an Incomplete requires (i) that the student was performing passing-level work until the time that (ii) something happened that---through no fault of the student---prevented the student from completing the coursework. If you take the final, you completed the course, even if you took it while ill, exhausted, mourning, etc. The time to talk to me about incomplete grades is BEFORE you take the final, when the situation that prevents you from finishing presents itself. Please clearly state your reasoning in your comments to me. 141 | 142 | It is your responsibility to develop good time management skills, good studying habits, know your limits, and learn to ask for professional help. 143 | Life happens. Social, family, cultural, scholar, and individual circumstances can affect your performance (both positive and negatively). If you find yourself in a situation that raises concerns about passing the course, please come see me as soon as possible. Above all, please do not wait till the end of the semester to share your concerns about passing the course because it will be too late by then. 144 | 145 | 146 | ### Letters of Recommendation 147 | 148 | Unless I have known you at least one year, and we have developed a good collegial relationship, I do not provide letters of recommendation. 149 | 150 | 151 | ### Additional Course Policies 152 | 153 | - Be sure to pay attention to deadlines. 154 | - In consideration to everybody in the classroom, please turn off your cell phone during class and lab time. 155 | 156 | 157 | 158 | ### Fine Print 159 | 160 | The course deadlines, assignments, exam times and material are subject to change at the whim of the professor. 161 | 162 | -------------------------------------------------------------------------------- /syllabus/syllabus-stat20.md: -------------------------------------------------------------------------------- 1 | ## Course Syllabus Stat 20 2 | 3 | Stat 20: Introduction to Probability and Statistics, Spring 2017 4 | 5 | - __Instructor:__ Gaston Sanchez, gaston.stat[at]gmail.com 6 | - __Class Time:__ MWF 12-1pm in 2050 VLSB 7 | - __Session Dates:__ 01/18/17 - 05/05/17 8 | - __Code #:__ 23407 9 | - __Units:__ 4 (more info [here](http://classes.berkeley.edu/content/2017-spring-stat-20-001-lec-001)) 10 | - __Office Hours:__ TuTh 11:30am-12:30pm in 309 Evans (or by appointment) 11 | - __Text:__ Statistics, 4th edition (by Freedman, Pisani, and Purves) 12 | - __Final:__ Wed, May-10, 3:00-6:00pm 13 | - __GSIs:__ Office hours of the GSIs will be posted on the bCourses page. You can go to the office hours of __any__ GSI, not just your own. 14 | 15 | | Discussion | Date | Room | GSI | 16 | |------------|--------------|--------------|-----------------| 17 | | 101 | TuTh 9-10A | 332 Evans | Yoni Ackerman | 18 | | 102 | TuTh 9-10A | 334 Evans | Yizhou Zhao | 19 | | 103 | TuTh 10-11A | 332 Evans | Yoni Ackerman | 20 | | 104 | TuTh 10-11A | 334 Evans | Yizhou Zhao | 21 | | 105 | TuTh 11-12P | 332 Evans | Mingjia Chen | 22 | | 106 | TuTh 11-12P | 334 Evans | Jill Berkin | 23 | | 107 | TuTh 12-1P | 332 Evans | Mingjia Chen | 24 | | 108 | TuTh 1-2P | 332 Evans | Yanli Fan | 25 | | 109 | TuTh 2-3P | 334 Evans | Yanli Fan | 26 | | 110 | TuTh 2-3P | 340 Evans | Rohit Bahirwani | 27 | | 111 | TuTh 3-4P | 334 Evans | Shalika Gupta | 28 | | 112 | TuTh 3-4P | 340 Evans | Rohit Bahirwani | 29 | | 113 | TuTh 4-5P | 334 Evans | Calvin Chi | 30 | | 114 | TuTh 5-6P | 334 Evans | Calvin Chi | 31 | | 115 | TuTh 5-6P | 205 Dwinelle | Jill Berkin | 32 | | 116 | TuTh 5-6P | 187 Dwinelle | Shalika Gupta | 33 | 34 | 35 | ### Description 36 | 37 | __Statistics 20__ is a course designed primarily as an introductory course for statistical thinking. You do need to be comfortable with math at the level of intermediate algebra (e.g., the equation of a straight line, plotting points, taking powers and roots, percentages). 38 | 39 | The emphasis of the course is critical thinking about quantitative evidence. Topics include reasoning and fallacies, descriptive statistics, association, correlation, regression, elements of probability, set theory, chance variability, random variables, expectation, standard error, sampling, hypothesis tests, confidence intervals, experiments and observational studies. 40 | 41 | 42 | ### Methods of Instruction 43 | 44 | Using a combination of lecture and student participation, each class session will focus on learning the fundamentals. The required textbook is the classic __Statistics__ 4th edition by Freedman, Pisani and Purves. 45 | 46 | I firmly believe that one cannot do statistical computations without the help of good statistical software. In this course, you will be asked to do various assignments and practical work using the [statistical software R](https://www.r-project.org/). The main idea is to use R as a supporting tool to help you apply the key concepts of the textbook. 47 | 48 | 49 | ### Homework Assignments 50 | 51 | - Homework assignments will be assigned almost every week (about 13 HW). 52 | - You will submit your homework via bCourses electronically (as a word, text, pdf, or html file). 53 | - I will drop your lowest HW score. 54 | - Don't wait until the last hour to do an assignment. Plan ahead and pace yourself. 55 | - Don't wait until the last minute to submit your assignment. 56 | - __No late assignments__ will be accepted, for any reason, including, but not limited to, theft or any extraordinary circumnstances (e.g. illnes, exhaustion, mourning, loss of internet connection, bCourses is down, broken computer). 57 | - Note that answers to non-review questions are in the back of the book so you can check that your answer is correct. 58 | - Solutions to the review exercises will be posted on bCourses. 59 | - If you collaborate with other students when working on a HW assignment, please include the names of those students in your submission. 60 | - You must write your own answers (using your own words). Copy and plagiarism will not be tolerated (see _Academic Honesty_ policy). 61 | 62 | 63 | ### Discussion 64 | 65 | - Discussion is an important part of the class and is meant to supplement lecture. 66 | - Your GSI will review and expand on concepts introduced in lecture and encourage you to problem solve in groups. 67 | - There will be about 4/5 short quizzes given in discussion to test your understanding. 68 | - Your quiz scores __will NOT__ be part of your grade. 69 | - Students must attend the discussion group they are officially registered in. 70 | 71 | 72 | ### Exams 73 | 74 | - There will be two 50-minute in-class midterms, and one 3-hour final exam. 75 | - The tentative dates of the midterms are Friday Feb-24, and Friday Apr-07. 76 | - The final exam is currently scheduled for Wednesday, May 10th from 3:00pm-6:00pm. (classroom to be announced). 77 | - If you do not take the final, you will NOT pass the class. 78 | - There will be __no early or makeup exams__. 79 | - ~~To ask for regrading, you must answer a test using pen. Tests answered with pencil will not be accepted for regrading.~~ 80 | - We will use _gradescrope_ to grade your tests (so you can use pen or pencil). 81 | - When asking for regrading, please clearly state the reasons that make you think you deserve a higher score. 82 | - You will have one full week after grades are published on gradescope to request a regrade. 83 | - After the regrade deadline, no requests will be considered. 84 | 85 | 86 | 87 | ### Grading Structure 88 | 89 | - 20% homework (lowest 1 dropped) 90 | - 25% midterm 1 91 | - 25% midterm 2 92 | - 30% final 93 | 94 | No individual letter grades will be given for midterm, or final. You will get a letter grade for the course that is based on your overall score. Your final grade will be graded on a 30/30/30/10 (A/B/C/DF) scale. 95 | 96 | 97 | ### Calculator Policy 98 | 99 | - You will need one that adds, subtracts, multiplies, divides, takes square roots, raises numbers to a power, and preferably also computes factorials. Statistical calculators are unnecessary. 100 | - However, no graphing calculators, no phone calculators or tablet calculators are allowed. 101 | - If you do not bring a calculator to a midterm or the final, do your computations by hand (you won't be allowed to borrow someone else's calculator). 102 | 103 | 104 | ### Academic Honesty 105 | 106 | I (Gaston Sanchez) expect you to do your own work and to uphold the standards of intellectual integrity. Collaborating on homework is fine and I encourage you to work together---but copying is not, nor is having somebody else submit assignments for you. Cheating will not be tolerated. Anyone found cheating will receive an F and will be reported to the [Center for the Student Conduct](http://sa.berkeley.edu/conduct). If you are having trouble with an assignment or studying for an exam, or if you are uncertain about permissible and impermissible conduct or collaboration, please come see me with your questions. 107 | 108 | 109 | ### Email Policy 110 | 111 | - You should try to use email as a tool to set up a one-on-one meeting with me if office hours conflict with your schedule. 112 | - Use the subject line __Stat 20 Meeting Request__. 113 | - Your message should include at least two times when you would like to meet and a brief (one-two sentence) description of the reason for the meeting. 114 | - Do NOT expect me to reply right away (I may not reply on time). 115 | - If you have an emergency, talk to me later during class or office hours. 116 | - I strongly encourage you to ask questions about the syllabus, covered material, and assignments during class time or lab discussions. 117 | - I prefer to have conversations in person rather than via email, thus allowing us to get to know each other better and fostering a more collegial learning atmosphere. 118 | 119 | 120 | ### Accommodation Policy 121 | 122 | Students needing accommodations for any physical, psychological, or learning disability, should speak with me during the first two weeks of the semester, either after class or during office hours and see [http://dsp.berkeley.edu](http://dsp.berkeley.edu) to learn about Berkeley’s policy. If you are a DSP student, please contact me at least three weeks prior to a midterm or final so that we can work out acceptable accommodations via the DSP Office. 123 | 124 | If you are an athlete or Cal band member, please check your calendar and come see me as soon as possible to OH during the first two weeks of the semester. Please try your best to be present at each of the midterms as I cannot guarantee accommodation for a late exam. 125 | 126 | 127 | ### Safe, Supportive, and Inclusive Environment 128 | 129 | Whenever a faculty member, staff member, post-doc, or GSI is responsible for 130 | the supervision of a student, a personal relationship between them of a 131 | romantic or sexual nature, even if consensual, is against university policy. 132 | Any such relationship jeopardizes the integrity of the educational process. 133 | 134 | Although faculty and staff can act as excellent resources for students, you 135 | should be aware that they are required to report any violations of this campus 136 | policy. If you wish to have a confidential discussion on matters related to this 137 | policy, you may contact the _Confidential Care Advocates_ on campus for support 138 | related to counseling or sensitive issues. Appointments can be 139 | made by calling (510) 642-1988. 140 | 141 | The classroom, lab, and work place should be safe and inclusive environments 142 | for everyone. The _Office for the Prevention of Harassment and Discrimination_ 143 | (OPHD) is responsible for ensuring the University provides an environment for 144 | faculty, staff and students that is free from discrimination and harassment on 145 | the basis of categories including race, color, national origin, age, sex, gender, 146 | gender identity, and sexual orientation. Questions or concerns? 147 | Call (510) 643-7985, email ask_ophd@berkeley.edu, or go to 148 | [http://survivorsupport.berkeley.edu/](http://survivorsupport.berkeley.edu/). 149 | 150 | 151 | ### Incomplete Policy 152 | 153 | Under emergency/special circumstances, students may petition me to receive an Incomplete grade. By University policy, for a student to get an Incomplete requires (i) that the student was performing passing-level work until the time that (ii) something happened that---through no fault of the student---prevented the student from completing the coursework. If you take the final, you completed the course, even if you took it while ill, exhausted, mourning, etc. The time to talk to me about incomplete grades is BEFORE you take the final, when the situation that prevents you from finishing presents itself. Please clearly state your reasoning in your comments to me. 154 | 155 | It is your responsibility to develop good time management skills, good studying habits, know your limits, and learn to ask for professional help. 156 | Life happens. Social, family, cultural, scholar, and individual circumstances can affect your performance (both positive and negatively). If you find yourself in a situation that raises concerns about passing the course, please come see me as soon as possible. Above all, please do not wait till the end of the semester to share your concerns about passing the course because it will be too late by then. 157 | 158 | 159 | ### Letters of Recommendation 160 | 161 | Unless I have known you at least one year, and we have developed a good collegial relationship, I do not provide letters of recommendation. 162 | 163 | 164 | ### Additional Course Policies 165 | 166 | - Be sure to pay attention to deadlines. 167 | - In consideration to everybody in the classroom, please turn off your cell phone during class and lab time. 168 | 169 | 170 | 171 | ### Fine Print 172 | 173 | The course deadlines, assignments, exam times and material are subject to change at the whim of the professor. 174 | 175 | --------------------------------------------------------------------------------