├── .gitignore
├── README.md
├── apps
    ├── README.md
    ├── ch03-histograms
    │   ├── README.md
    │   └── app.R
    ├── ch08-corr-coeff-diagrams
    │   ├── README.md
    │   └── app.R
    ├── ch10-heights-data
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
    ├── ch11-regression-residuals
    │   ├── README.md
    │   └── app.R
    ├── ch11-regression-strips
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
    ├── ch16-chance-error
    │   ├── README.md
    │   └── app.R
    ├── ch17-demere-games
    │   ├── README.md
    │   └── app.R
    ├── ch17-expected-value-std-error
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
    ├── ch18-coin-tossing
    │   ├── README.md
    │   └── app.R
    ├── ch18-roll-dice-product
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
    ├── ch18-roll-dice-sum
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
    ├── ch20-sampling-men
    │   ├── README.md
    │   └── app.R
    ├── ch21-accuracy-percentages
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
    └── ch23-accuracy-averages
    │   ├── README.md
    │   ├── app.R
    │   └── helpers.R
├── data
    ├── abalone.csv
    ├── distributions.csv
    ├── galton.csv
    ├── nba_players.csv
    ├── pearson.csv
    ├── stock-earnings-prices.csv
    └── vegetables-smoking.csv
├── hw
    ├── README.md
    ├── hw01-questions.pdf
    ├── hw02-questions.pdf
    ├── hw03-questions.pdf
    ├── hw04-questions.pdf
    ├── hw05-questions.pdf
    ├── hw06-questions.pdf
    ├── hw07-questions.pdf
    ├── hw08-questions.pdf
    ├── hw09-questions.pdf
    ├── hw10-questions.pdf
    ├── hw11-questions.pdf
    └── hw12-questions.pdf
├── labs
    └── README.md
├── lectures
    └── README.md
├── other
    ├── Karl-Pearson-and-the-origins-of-modern-statistics.pdf
    ├── Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf
    ├── The-strange-science-of-Francis-Galton.pdf
    ├── formula-sheet-final.pdf
    ├── formula-sheet-midterm1.pdf
    ├── formula-sheet-midterm2.pdf
    ├── standard-normal-table.pdf
    ├── t-table.pdf
    └── z-table.pdf
├── scripts
    ├── 01-R-introduction.Rmd
    ├── 01-R-introduction.pdf
    ├── 02-data-variables.Rmd
    ├── 02-data-variables.pdf
    ├── 03-histograms.Rmd
    ├── 03-histograms.pdf
    ├── 04-measures-center.Rmd
    ├── 04-measures-center.pdf
    ├── 05-measures-spread.Rmd
    ├── 05-measures-spread.pdf
    ├── 06-normal-curve.Rmd
    ├── 06-normal-curve.pdf
    ├── 07-scatter-diagrams.Rmd
    ├── 07-scatter-diagrams.pdf
    ├── 08-correlation.Rmd
    ├── 08-correlation.pdf
    ├── 09-regression-line.Rmd
    ├── 09-regression-line.pdf
    ├── 10-prediction-and-errors-in-regression.Rmd
    ├── 10-prediction-and-errors-in-regression.pdf
    ├── 11-binomial-formula.Rmd
    ├── 11-binomial-formula.pdf
    ├── 12-chance-process.Rmd
    ├── 12-chance-process.pdf
    ├── Makefile
    ├── README.md
    └── images
    │   ├── karl-pearson.jpg
    │   └── western-conference-standings-2016.png
└── syllabus
    ├── README.md
    ├── mrs-mutner-rules.jpg
    ├── syllabus-stat131A.md
    └── syllabus-stat20.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Mac specific
 2 | *.DS_Store
 3 | 
 4 | # latex specific
 5 | *.aux
 6 | *.log
 7 | 
 8 | # files in labs/
 9 | labs/.DS_Store
10 | labs/*.html
11 | 
12 | # files in data/
13 | data/.DS_Store
14 | data/.Rhistory
15 | 
16 | # files in units/
17 | scripts/.DS_Store
18 | scripts/.Rhistory
19 | 
20 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## About
 2 | 
 3 | This repository holds the course materials for the Spring 2017 edition of:
 4 | 
 5 | - Stat 20 __Introduction to Probability and Statistics__ at UC Berkeley.
 6 | - Stat 131A __Introduction to Probability and Statistics for Life Scientists__ at UC Berkeley.
 7 | 
 8 | 
 9 | ## Contents
10 | 
11 | - [Syllabus](syllabus): Course logistics and policies.
12 | - [Lectures](lectures): Calendar of weekly topics, and lectures material.
13 | - [HW Assignments](hw): Weekly assignments.
14 | - [Labs](labs): Topics from textbook for lab.
15 | - [Scripts](scripts): Tutorial R scripts.
16 | - [Apps](apps): Shiny apps used in lecture's demos.
17 | - [Data](data): Data sets.
18 | - [Other](other): Other resources (e.g. tables, articles).
19 | 
20 | 
21 | ## R and RStudio
22 | 
23 | We will use the statistical software __[R](https://www.r-project.org/)__ and the 
24 | [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment) 
25 | __[RStudio](https://www.rstudio.com/)__ as a computational tool to
26 | practice and apply the key concepts of the course.
27 | 
28 | Both R and RStudio are free, and are available for Mac OS X, Windows, and Linux. 	
29 | 
30 | To install R (Binary version):
31 | 
32 | - Mac: [https://cran.cnr.berkeley.edu/bin/macosx/](https://cran.cnr.berkeley.edu/bin/macosx/)
33 | - Windows: [https://cran.cnr.berkeley.edu/bin/windows/](https://cran.cnr.berkeley.edu/bin/windows/)
34 | 
35 | To install RStudio (free desktop version): 
36 | 
37 | - RStudio Desktop version [https://www.rstudio.com/products/rstudio/download/](https://www.rstudio.com/products/rstudio/download/)
38 | 
39 | 
40 | -----
41 | 
42 | ### License
43 | 
44 | <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
45 | 
46 | Author: [Gaston Sanchez](http://gastonsanchez.com)
47 | 


--------------------------------------------------------------------------------
/apps/README.md:
--------------------------------------------------------------------------------
 1 | # Shiny Apps
 2 | 
 3 | This is a collection of Shiny apps to be used mainly during lecture to illustrate some of the concepts in the textbook _Statistics_ (FPP) 4th edition.
 4 | 
 5 | 
 6 | ## Running the apps
 7 | 
 8 | The easiest way to run an app is with the `runGitHub()` function from the `"shiny"` package. Please make sure you have installed the package `"shiny"`. In case of doubt, run:
 9 | 
10 | ```R
11 | install.packages("shiny")
12 | ```
13 | 
14 | 
15 | For instance, to run the app contained in the [ch03-histograms](/ch03-histograms) folder, run the following code in R:
16 | 
17 | ```R
18 | library(shiny)
19 | 
20 | # Run an app from a subdirectory in the repo
21 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch03-histograms")
22 | ```
23 | 


--------------------------------------------------------------------------------
/apps/ch03-histograms/README.md:
--------------------------------------------------------------------------------
 1 | # Histograms for NBA players data
 2 | 
 3 | This is a Shiny app that generates histograms using data of NBA players from the season 2015-2016..
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide examples of histograms and distributions of quantitative variables.  __Statistics, Chapter 3: The Histogram__ (pages 32-56):
 9 | 
10 | - A _variable_ is a characteristic of the subjects in a study. It can be either qualitative or quantitative.
11 | - A _histogram_ is a visual display used to look at the distribution of a quantitative variable.
12 | - A _histogram_ represents precents by area. It consists of a set of blocks. The area of each block represents the percentage of cases in the correspoding class interval.
13 | - With the _density scale_, the height of each block equals the percentage of cases in the the corresponding class interval, divided by the length of that interval.
14 | 
15 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
16 | 
17 | 
18 | ## Data
19 | 
20 | The data set is in the `nba_players.csv` file (see `data/` folder) which contains 528 rows and 39 columns, although this app only uses quantitative variables.
21 | 
22 | 
23 | ## How to run it?
24 | 
25 | 
26 | ```R
27 | library(shiny)
28 | 
29 | # Easiest way is to use runGitHub
30 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch03-histograms")
31 | ```
32 | 
33 | 


--------------------------------------------------------------------------------
/apps/ch03-histograms/app.R:
--------------------------------------------------------------------------------
 1 | # Title: Histograms with NBA Players Data
 2 | # Description: this app uses data of NBA players to show various histograms
 3 | # Author: Gaston Sanchez
 4 | 
 5 | library(shiny)
 6 | 
 7 | # data set
 8 | nba <- read.csv('../../data/nba_players.csv', header = TRUE)
 9 | 
10 | # quantitative variables
11 | quantitative <- c(
12 |   "height","weight","salary","experience","age","games","games_started",
13 |   "minutes_played","field_goals","field_goal_attempts","field_goal_percent",
14 |   "points3","points3_attempts","points3_percent","points2","points2_attempts",
15 |   "points2_percent","effective_field_goal_percent","free_throws",
16 |   "free_throw_attempts","free_throw_percent","offensive_rebounds",
17 |   "defensive_rebounds","total_rebounds","assists","steals","blocks",
18 |   "turnovers","fouls","points")
19 | 
20 | # select just quantitative variables
21 | dat <- nba[ ,quantitative]
22 | 
23 | 
24 | # Define UI for application that draws a histogram
25 | ui <- fluidPage(
26 |    
27 |    # Application title
28 |    titlePanel("NBA Players"),
29 |    
30 |    # Sidebar with a slider input for number of bins 
31 |    sidebarLayout(
32 |       sidebarPanel(
33 |          selectInput("variable", "Select a Variable", 
34 |                     choices = colnames(dat), selected = 'height'),
35 |         
36 |          sliderInput("bins",
37 |                      "Number of bins:",
38 |                      min = 1,
39 |                      max = 50,
40 |                      value = 10),
41 |          
42 |          checkboxInput('density', label = strong('Use density scale'))
43 |       ),
44 |       
45 |       # Show a plot of the generated distribution
46 |       mainPanel(
47 |          plotOutput("histogram")
48 |       )
49 |    )
50 | )
51 | 
52 | 
53 | # Define server logic required to draw a histogram
54 | server <- function(input, output) {
55 |    
56 |    output$histogram <- renderPlot({
57 |       # generate bins based on input$bins from ui.R
58 |       x    <- na.omit(dat[ ,input$variable])
59 |       bins <- seq(min(x), max(x), length.out = input$bins + 1)
60 | 
61 |       histogram <- hist(x, breaks = bins, 
62 |            probability = input$density,
63 |            col = 'gray80', border = 'white', las = 1, 
64 |            axes = FALSE, xlab = "",
65 |            main = paste("Histogram of", input$variable))
66 |       axis(side = 2, las = 1)
67 |       axis(side = 1, at = bins, labels = round(bins, 2))
68 |       
69 |    })
70 | }
71 | 
72 | # Run the application 
73 | shinyApp(ui = ui, server = server)
74 | 
75 | 


--------------------------------------------------------------------------------
/apps/ch08-corr-coeff-diagrams/README.md:
--------------------------------------------------------------------------------
 1 | # Correlation Coefficient Diagrams
 2 | 
 3 | This is a Shiny app that generates scatter diagrams based on the specified correlation coefficient.
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide examples of scatter diagrams as those displayed on the FPP book, page 127. See Chapter 8: The Correlation Coefficient (page 127).
 9 | 
10 | The scatter diagrams are based on random generated data following a multivariate normal distribution.
11 | 
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 | 
14 | 
15 | ## How to run it?
16 | 
17 | 
18 | ```R
19 | library(shiny)
20 | 
21 | # Easiest way is to use runGitHub
22 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch08-corr-coeff-diagrams")
23 | ```
24 | 
25 | 


--------------------------------------------------------------------------------
/apps/ch08-corr-coeff-diagrams/app.R:
--------------------------------------------------------------------------------
 1 | #
 2 | # This is a Shiny web application. You can run the application by clicking
 3 | # the 'Run App' button above.
 4 | #
 5 | # Find out more about building applications with Shiny here:
 6 | #
 7 | #    http://shiny.rstudio.com/
 8 | #
 9 | 
10 | library(shiny)
11 | library(MASS)
12 | 
13 | 
14 | # Define UI for application that draws a histogram
15 | ui <- fluidPage(
16 |    
17 |    # Application title
18 |    titlePanel("Scatter Diagrams and Correlation"),
19 |    
20 |    # Sidebar with a slider input for number of bins 
21 |    sidebarLayout(
22 |       sidebarPanel(
23 |         numericInput("seed",
24 |                      "Random seed",
25 |                      min = 100,
26 |                      max = 99999,
27 |                      value = 1234),
28 |         sliderInput("corr",
29 |                      "Correlation Coefficient",
30 |                      min = -1,
31 |                      max = 1,
32 |                      step = 0.05,
33 |                      value = 0.7),
34 |          sliderInput("size",
35 |                       "Number of points",
36 |                       min = 10,
37 |                       max = 5000,
38 |                       step = 5,
39 |                       value = 500),
40 |          sliderInput("cex",
41 |                      "Size of points",
42 |                      min = 0,
43 |                      max = 5,
44 |                      step = 0.1,
45 |                      value = 1),
46 |         sliderInput("alpha",
47 |                     "Transparency of points",
48 |                     min = 0,
49 |                     max = 1,
50 |                     step = 0.01,
51 |                     value = 0.8)
52 |       ),
53 |       
54 |       # Show a plot of the generated distribution
55 |       mainPanel(
56 |          plotOutput("scatterplot")
57 |       )
58 |    )
59 | )
60 | 
61 | # Define server logic required to draw a histogram
62 | server <- function(input, output) {
63 |    
64 |    output$scatterplot <- renderPlot({
65 |      # generate bins based on input$bins from ui.R
66 |      set.seed(input$seed)
67 |      cor_matrix = matrix(c(1, input$corr, input$corr, 1), 2)
68 |      xy = mvrnorm(input$size, c(0, 0), cor_matrix)
69 |      plot(xy, type = "n", axes=FALSE, xlab="", ylab="",
70 |           xlim=c(-3, 3), ylim=c(-3, 3))
71 |      abline(h=0, v=0, col="gray80", lwd = 2)
72 |      points(xy[,1], xy[,2], pch=20, cex=input$cex,
73 |             col=rgb(0.45, 0.59, 0.84, alpha = input$alpha))
74 |    })
75 | }
76 | 
77 | # Run the application 
78 | shinyApp(ui = ui, server = server)
79 | 
80 | 


--------------------------------------------------------------------------------
/apps/ch10-heights-data/README.md:
--------------------------------------------------------------------------------
 1 | # Regression Scatterplot for Pearson's Height data
 2 | 
 3 | This is a Shiny app that generates a scatter diagram to illustrate the regression method using Pearson's heights data set.
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide a visual display of some of the concepts from __Statistics, Chapter 10: Regression__ (pages 158-165):
 9 | 
10 | - Point of averages
11 | - SD line
12 | - Graph of averages
13 | - Regression line
14 | - Correlation coefficient
15 | 
16 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
17 | 
18 | 
19 | ## Data
20 | 
21 | This app uses the Pearson's Height Data. The data is in the `data/` folder. which contains 1078 rows and 2 columns: 
22 | 
23 | - `Father`: The father's height, in inches
24 | - `Son`: The height of the son, in inches
25 | 
26 | The app only uses variables: `Father, Mother, Child`
27 | 
28 | Original source: [http://www.math.uah.edu/stat/data/Pearson.csv](http://www.math.uah.edu/stat/data/Pearson.csv)
29 | 
30 | 
31 | ## How to run it?
32 | 
33 | There are many ways to download the app and run it:
34 | 
35 | ```R
36 | library(shiny)
37 | 
38 | # Easiest way is to use runGitHub
39 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch10-heights-data")
40 | ```
41 | 


--------------------------------------------------------------------------------
/apps/ch10-heights-data/app.R:
--------------------------------------------------------------------------------
  1 | #
  2 | # This is a Shiny web application. You can run the application by clicking
  3 | # the 'Run App' button above.
  4 | #
  5 | # Find out more about building applications with Shiny here:
  6 | #
  7 | #    http://shiny.rstudio.com/
  8 | #
  9 | 
 10 | library(shiny)
 11 | source('helpers.R')
 12 | 
 13 | # reading a couple of lines just to get the names of variables
 14 | dat <- read.csv('../../data/pearson.csv')
 15 | 
 16 | # Define UI for application that draws a histogram
 17 | ui <- fluidPage(
 18 |    
 19 |    # Application title
 20 |    titlePanel("Pearson's Height Data Set"),
 21 |    
 22 |    # Define the sidebar with one input
 23 |    sidebarPanel(
 24 |      selectInput("xvar", "X-axis variable", 
 25 |                  choices = colnames(dat), selected = 'Father'),
 26 |      selectInput("yvar", "Y-axis variable", 
 27 |                  choices = colnames(dat), selected = 'Son'),
 28 |      sliderInput("cex", 
 29 |                  label = "Size of points",
 30 |                  min = 0, max = 3, value = 2, step = 0.1),
 31 |      checkboxInput('reg_line', label = strong('Regression line')),
 32 |      checkboxInput('point_avgs', label = strong('Point of Averages')),
 33 |      checkboxInput('sd_line', label = strong('SD line')),
 34 |      checkboxInput('sd_guides', label = strong('SD guides')),
 35 |      sliderInput("breaks", 
 36 |                  label = "Graph of Averages",
 37 |                  min = 0, max = 10, value = 0, step = 1),
 38 |      hr(),
 39 |      helpText('Correlation:'),
 40 |      verbatimTextOutput("correlation")
 41 |    ),
 42 |    
 43 |    # Show a plot of the generated distribution
 44 |    mainPanel(
 45 |      plotOutput("datPlot")
 46 |    )
 47 | )
 48 | 
 49 | 
 50 | # Define server logic required to draw a histogram
 51 | server <- function(input, output) {
 52 |    
 53 |   # Correlation
 54 |   output$correlation <- renderPrint({ 
 55 |     cor(dat[,input$xvar], dat[,input$yvar])
 56 |   })
 57 |   
 58 |   # Fill in the spot we created for a plot
 59 |   output$datPlot <- renderPlot({
 60 |     # standard deviations
 61 |     sdx <- sd(dat[,input$xvar])
 62 |     sdy <- sd(dat[,input$yvar])
 63 |     avgx <- mean(dat[,input$xvar])
 64 |     avgy <- mean(dat[,input$yvar])
 65 |     
 66 |     # Render scatterplot
 67 |     plot(dat[,input$xvar], dat[,input$yvar],
 68 |          main = 'scatter diagram', type = 'n', axes = FALSE,
 69 |          xlab = paste(input$xvar, " height (in)"), 
 70 |          ylab = paste(input$yvar, "height (in)"))
 71 |     box()
 72 |     axis(side = 1)
 73 |     axis(side = 2, las = 1)
 74 |     points(dat[,input$xvar], dat[,input$yvar],
 75 |            pch = 21, col = 'white', bg = '#777777aa',
 76 |            lwd = 2, cex = input$cex)
 77 |     # Point of Averages
 78 |     if (input$point_avgs) {
 79 |       points(avgx, avgy, 
 80 |              pch = 21, col = 'white', bg = 'tomato',
 81 |              lwd = 3, cex = 3)
 82 |     }
 83 |     # SD line
 84 |     if (input$sd_line) {
 85 |       cor_xy <- cor(dat[,input$xvar], dat[,input$yvar])
 86 |       if (cor_xy >= 0) {
 87 |         sd_line <- line_equation(avgx - sdx, avgy - sdy, avgx + sdx, avgy + sdy)
 88 |         abline(a = sd_line$intercept, b = sd_line$slope, 
 89 |                lwd = 4, lty = 2, col = 'orange')
 90 |       } else {
 91 |         sd_line <- line_equation(avgx + sdx, avgy - sdy, avgx - sdx, avgy + sdy)
 92 |         abline(a = sd_line$intercept, b = sd_line$slope, 
 93 |                lwd = 4, lty = 2, col = 'orange')
 94 |       }
 95 |     }
 96 |     # SD guides
 97 |     if (input$sd_guides) {
 98 |       abline(v = c(avgx - sdx, avgx + sdx), 
 99 |              h = c(avgy - sdy, avgy + sdy), 
100 |              lty = 1, lwd = 3, col = '#FFA600aa')
101 |     }
102 |     # Graph of averages
103 |     if (input$breaks > 1) {
104 |       graph_avgs <- averages(dat[,input$xvar], dat[,input$yvar],
105 |                              breaks = input$breaks)
106 |       points(graph_avgs$x, graph_avgs$y, pch = "+",
107 |              col = '#ff6700', cex = 3)
108 |     }    
109 |     # Regression line
110 |     if (input$reg_line) {
111 |       reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
112 |       abline(reg = reg, lwd = 4, col = '#4878DF')
113 |     }
114 |     
115 |   }, height = 650, width = 650)
116 | }
117 | 
118 | 
119 | # Run the application 
120 | shinyApp(ui = ui, server = server)
121 | 
122 | 


--------------------------------------------------------------------------------
/apps/ch10-heights-data/helpers.R:
--------------------------------------------------------------------------------
 1 | # compuptes the slope and intercept terms of a line between two points
 2 | line_equation <- function(x1, y1, x2, y2) {
 3 |   slope <- (y2 - y1) / (x2 - x1)
 4 |   intercept <- y1 - slope*x1
 5 |   list(intercept = intercept, slope = slope)
 6 | }
 7 | 
 8 | # computes x,y averages depending on a given number of intervals (x-axis)
 9 | # (to be used for showing graph of averages)
10 | averages <- function(x, y, breaks = 5) {
11 |   x_cut<- cut(x, breaks = breaks)
12 |   y_averages <- as.vector(tapply(y, x_cut, mean))
13 |   x_boundaries <- gsub('\\(', '', levels(x_cut))
14 |   x_boundaries <- gsub('\\]', '', x_boundaries)
15 |   x_boundaries <- strsplit(x_boundaries, ',')
16 |   x1 <- as.numeric(sapply(x_boundaries, function(u) u[1]))
17 |   x2 <- as.numeric(sapply(x_boundaries, function(u) u[2]))
18 |   x_midpoints <- x1 + (x2 - x1) / 2
19 |   list(x = x_midpoints, y = y_averages)
20 | }
21 | 


--------------------------------------------------------------------------------
/apps/ch11-regression-residuals/README.md:
--------------------------------------------------------------------------------
 1 | # Regression Residuals
 2 | 
 3 | This is a Shiny app that generates two graphs: 1) a scatterplot with a 
 4 | regression line, and 2) a residual plot from the fitted regression line.
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to illustrate the concepts of homoscedastic and heteroscedastic  
10 | residuals described in 
11 | __Statistics, Chapter 11: The R.M.S. Error for Regression__ (pages 180-201):
12 | 
13 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger 
14 | Purves (2007). Fourth Edition. Norton & Company.
15 | 
16 | 
17 | ## Data
18 | 
19 | This app uses the data from NBA basketball players in the 2015-2016 season. 
20 | The csv file `nba_players.csv` is the `data/` folder of the github repository. 
21 | 
22 | 
23 | ## How to run it?
24 | 
25 | 
26 | ```R
27 | library(shiny)
28 | 
29 | # Easiest way is to use runGitHub
30 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch11-regression-residuals")
31 | ```
32 | 


--------------------------------------------------------------------------------
/apps/ch11-regression-residuals/app.R:
--------------------------------------------------------------------------------
  1 | # Title: Spread within vertical strips in regression
  2 | # Description: this app uses Pearson's height data 
  3 | # Author: Gaston Sanchez
  4 | 
  5 | library(shiny)
  6 | 
  7 | # reading a couple of lines just to get the names of variables
  8 | nba <- read.csv('../../data/nba_players.csv')
  9 | 
 10 | # quantitative variables
 11 | quantitative <- c(
 12 |   "height","weight","experience","age","games","games_started",
 13 |   "minutes_played","field_goals","field_goal_attempts","field_goal_percent",
 14 |   "points3","points3_attempts","points3_percent","points2","points2_attempts",
 15 |   "points2_percent","effective_field_goal_percent","free_throws",
 16 |   "free_throw_attempts","free_throw_percent","offensive_rebounds",
 17 |   "defensive_rebounds","total_rebounds","assists","steals","blocks",
 18 |   "turnovers","fouls","points")
 19 | 
 20 | # select just quantitative variables
 21 | dat <- nba[ ,quantitative]
 22 | 
 23 | # Define UI for application that draws a histogram
 24 | ui <- fluidPage(
 25 |   # Give the page a title
 26 |   titlePanel("NBA Players"),
 27 |   
 28 |   # Generate a row with a sidebar
 29 |   sidebarLayout(      
 30 |     
 31 |     # Define the sidebar with one input
 32 |     sidebarPanel(
 33 |       selectInput("xvar", "X-axis variable", 
 34 |                   choices = colnames(dat), selected = 'height'),
 35 |       selectInput("yvar", "Y-axis variable", 
 36 |                   choices = colnames(dat), selected = 'weight'),
 37 |       hr(),
 38 |       helpText('Correlation:'),
 39 |       verbatimTextOutput("correlation"),
 40 |       helpText('r.m.s. error:'),
 41 |       verbatimTextOutput("rms_error")
 42 |     ),
 43 |     
 44 |     # Create a spot for the barplot
 45 |     mainPanel(
 46 |       plotOutput("datPlot"),
 47 |       plotOutput("residualPlot")
 48 |     )
 49 |     
 50 |   )
 51 | )
 52 | 
 53 | 
 54 | # Define server logic required to draw a histogram
 55 | server <- function(input, output) {
 56 |   
 57 |   # Correlation
 58 |   output$correlation <- renderPrint({ 
 59 |     cor(dat[,input$xvar], dat[,input$yvar])
 60 |   })
 61 |   
 62 |   # r.m.s. error
 63 |   output$rms_error <- renderPrint({
 64 |     reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
 65 |     sqrt(mean((reg$residuals)^2))
 66 |   })
 67 |   
 68 |   # Fill in the spot we created for a plot
 69 |   output$datPlot <- renderPlot({
 70 |     
 71 |     # Render a scatter diagram
 72 |     plot(dat[,input$xvar], dat[,input$yvar],
 73 |          main = 'scatter diagram', type = 'n', axes = FALSE,
 74 |          xlab = input$xvar, ylab = input$yvar)
 75 |     box()
 76 |     axis(side = 1)
 77 |     axis(side = 2, las = 1)
 78 |     points(dat[,input$xvar], dat[,input$yvar],
 79 |            pch = 21, col = 'white', bg = '#4878DFaa',
 80 |            lwd = 2, cex = 2)
 81 |     # regression line
 82 |     reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
 83 |     abline(reg = reg, lwd = 3, col = '#e35a6d')
 84 |     
 85 |   })
 86 |   
 87 |   # histogram
 88 |   output$residualPlot <- renderPlot({
 89 |     reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
 90 |     # Render scatterplot
 91 |     plot(dat[,input$xvar], reg$residuals, las = 1,
 92 |          main = 'Residual plot', xlab = input$xvar,
 93 |          ylab = 'residuals', col = '#ACB6F1', type = 'n')
 94 |     abline(h = 0, col = 'gray70', lw = 2)
 95 |     points(dat[,input$xvar], reg$residuals,
 96 |            pch = 20, col = '#888888aa', cex = 2)
 97 |   })
 98 | }
 99 | 
100 | 
101 | 
102 | # Run the application 
103 | shinyApp(ui = ui, server = server)
104 | 
105 | 


--------------------------------------------------------------------------------
/apps/ch11-regression-strips/README.md:
--------------------------------------------------------------------------------
 1 | # Vertical Strips for Pearson's Height data
 2 | 
 3 | This is a Shiny app that generates a scatter diagram to illustrate the 
 4 | distribution of values on the y-axis within a vertical strip.
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to provide a visual display of some of the concepts from 
10 | __Statistics, Chapter 11: The R.M.S. Error for Regression__ (pages 180-201):
11 | 
12 | - Looking at vertical strips
13 | - Using the normal curve inside a vertical strip
14 | 
15 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger 
16 | Purves (2007). Fourth Edition. Norton & Company.
17 | 
18 | 
19 | ## Data
20 | 
21 | This app uses the Pearson and Lee's Height Data as described in 
22 | [Pearson's Height Data](http://www.math.uah.edu/stat/data/Pearson.csv). 
23 | The data is in the `pearson.csv` file, available in the `data/` folder of 
24 | the github repository.
25 | 
26 | 
27 | ## How to run it?
28 | 
29 | 
30 | ```R
31 | library(shiny)
32 | 
33 | # Easiest way is to use runGitHub
34 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch11-regression-strips")
35 | ```
36 | 


--------------------------------------------------------------------------------
/apps/ch11-regression-strips/app.R:
--------------------------------------------------------------------------------
 1 | # Title: Spread within vertical strips in regression
 2 | # Description: this app uses Pearson's height data 
 3 | # Author: Gaston Sanchez
 4 | 
 5 | library(shiny)
 6 | source('helpers.R')
 7 | 
 8 | # reading a couple of lines just to get the names of variables
 9 | dat <- read.csv('../../data/pearson.csv')
10 | 
11 | # Define UI for application that draws a histogram
12 | ui <- fluidPage(
13 |   
14 |   # Application title
15 |   titlePanel("Pearson's Height Data Set"),
16 |   
17 |   # Define the sidebar with one input
18 |   sidebarPanel(
19 |     selectInput("xvar", "X-axis variable", 
20 |                 choices = colnames(dat), selected = 'Father'),
21 |     selectInput("yvar", "Y-axis variable", 
22 |                 choices = colnames(dat), selected = 'Son'),
23 |     checkboxInput('reg_line', label = strong('Regression line')),
24 |     sliderInput("cex", 
25 |                 label = "Size of points",
26 |                 min = 0, max = 3, value = 1.5, step = 0.1),
27 |     #checkboxInput('point_avgs', label = strong('Point of Averages')),
28 |     #checkboxInput('sd_line', label = strong('SD line')),
29 |     #checkboxInput('sd_guides', label = strong('SD guides')),
30 |     sliderInput("center", 
31 |                 label = "x location",
32 |                 min = 60, 
33 |                 max = 76, 
34 |                 value = 70, step = 0.25),
35 |     sliderInput("width", 
36 |                 label = "width",
37 |                 min = 0, 
38 |                 max = 4, 
39 |                 value = 0, step = 0.1),
40 |     hr(),
41 |     helpText('Correlation:'),
42 |     verbatimTextOutput("correlation")
43 |   ),
44 |   
45 |   # Show a plot of the generated distribution
46 |   mainPanel(
47 |     plotOutput("datPlot"),
48 |     plotOutput("histogram")
49 |   )
50 | )
51 | 
52 | 
53 | # Define server logic required to draw a histogram
54 | server <- function(input, output) {
55 |   
56 |   # Correlation
57 |   output$correlation <- renderPrint({ 
58 |     cor(dat[,input$xvar], dat[,input$yvar])
59 |   })
60 |   
61 |   # Fill in the spot we created for a plot
62 |   output$datPlot <- renderPlot({
63 |     # Render scatterplot
64 |     plot(dat[,input$xvar], dat[,input$yvar],
65 |          main = 'scatter diagram', type = 'n', axes = FALSE,
66 |          xlab = input$xvar, ylab = input$yvar)
67 |     box()
68 |     axis(side = 1)
69 |     axis(side = 2, las = 1)
70 |     points(dat[,input$xvar], dat[,input$yvar],
71 |            pch = 21, col = 'white', bg = '#777777aa',
72 |            lwd = 2, cex = input$cex)
73 |     # vertical strips
74 |     abline(v = c(input$center - input$width, input$center + input$width),
75 |            lty = 1, lwd = 3, col = '#5A6DE3')
76 |     # Regression line
77 |     if (input$reg_line) {
78 |       reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
79 |       abline(reg = reg, lwd = 3, col = '#e35a6d')
80 |     }
81 |   })
82 |   
83 |   # histogram
84 |   output$histogram <- renderPlot({
85 |     xmin <- input$center - input$width
86 |     xmax <- input$center + input$width
87 |     child <- dat$Son[dat$Father >= xmin & dat$Father <= xmax]
88 |     hist(child, main = '', col = '#ACB6FF', las = 1)
89 |   })
90 |   
91 | }
92 | 
93 | # Run the application 
94 | shinyApp(ui = ui, server = server)
95 | 
96 | 


--------------------------------------------------------------------------------
/apps/ch11-regression-strips/helpers.R:
--------------------------------------------------------------------------------
 1 | line_equation <- function(x1, y1, x2, y2) {
 2 |   slope <- (y2 - y1) / (x2 - x1)
 3 |   intercept <- y1 - slope*x1
 4 |   list(intercept = intercept, slope = slope)
 5 | }
 6 | 
 7 | 
 8 | averages <- function(x, y, breaks = 5) {
 9 |   x_cut<- cut(x, breaks = breaks)
10 |   y_averages <- as.vector(tapply(y, x_cut, mean))
11 |   x_boundaries <- gsub('\\(', '', levels(x_cut))
12 |   x_boundaries <- gsub('\\]', '', x_boundaries)
13 |   x_boundaries <- strsplit(x_boundaries, ',')
14 |   x1 <- as.numeric(sapply(x_boundaries, function(u) u[1]))
15 |   x2 <- as.numeric(sapply(x_boundaries, function(u) u[2]))
16 |   x_midpoints <- x1 + (x2 - x1) / 2
17 |   list(x = x_midpoints, y = y_averages)
18 | }
19 | 
20 | 


--------------------------------------------------------------------------------
/apps/ch16-chance-error/README.md:
--------------------------------------------------------------------------------
 1 | # Chance Error
 2 | 
 3 | This is a Shiny app that illustrates the concept of chance error when simulating tossing a coin a given number of times.
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide a visual display motivated by John Kerrich's coin-tossing experiment __Statistics, Chapter 16: The Law of Averages__
 9 | 
10 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
11 | 
12 | 
13 | ## Data
14 | 
15 | The data simulates tossing a coin using the random binomial generator function `rbinom()`. The input parameters are the number of tosses, and optionally, the probability of heads.
16 | 
17 | 
18 | ## Plot
19 | 
20 | There are two options for the displayed plot: 
21 | 
22 | 1. shows the chance error (i.e. number of heads minus half the number of tosses) on the y-axis, and the number of tosses on the x-axis.
23 | 2. shows the percent error (i.e. proportion of heads) on the y-axis, and the number of tosses on the x-axis.
24 | 
25 | 
26 | ## How to run it?
27 | 
28 | ```R
29 | library(shiny)
30 | 
31 | # Easiest way is to use runGitHub
32 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch16-chance-error")
33 | ```
34 | 


--------------------------------------------------------------------------------
/apps/ch16-chance-error/app.R:
--------------------------------------------------------------------------------
  1 | # Title: Chance Error and Percent Error
  2 | # Description: Chance error when tossing a coin (based on John Kerrich's)
  3 | # Chapter 16: The Law of Averages, p 275-278
  4 | # Author: Gaston Sanchez
  5 | 
  6 | library(shiny)
  7 | 
  8 | # Define UI for application that draws a histogram
  9 | ui <- fluidPage(
 10 |   
 11 |   # Give the page a title
 12 |   titlePanel("Coin Tossing Experiment"),
 13 |   
 14 |   # Generate a row with a sidebar
 15 |   sidebarLayout(      
 16 |     
 17 |     # Define the sidebar with one input
 18 |     sidebarPanel(
 19 |       numericInput("seed", label = "Random Seed:", 12345, 
 20 |                    min = 10000, max = 50000, step = 1),
 21 |       sliderInput("chance", label = "Chance of heads:", 
 22 |                   min = 0, max = 1, value = 0.5, step = 0.01),
 23 |       sliderInput("tosses", label = "Number of tosses:", 
 24 |                   min = 100, max = 10000, value = 3000, step = 50),
 25 |       radioButtons("error", label = "Display",
 26 |                    choices = list("Chance error" = 1, 
 27 |                                   "Percent error" = 2), 
 28 |                    selected = 2),
 29 |       hr(),
 30 |       helpText('Total number of heads:'),
 31 |       verbatimTextOutput("num_heads"),
 32 |       helpText('Proportion of heads:'),
 33 |       verbatimTextOutput("prop_heads")
 34 |     ),
 35 |     
 36 |     # Create a spot for the barplot
 37 |     mainPanel(
 38 |       plotOutput("chancePlot")  
 39 |     )
 40 |   )
 41 | )
 42 | 
 43 | 
 44 | # Define server logic required to draw a histogram
 45 | server <- function(input, output) {
 46 |   
 47 |   seed <- reactive({
 48 |     input$seed
 49 |   })
 50 |   tosses <- reactive({
 51 |     input$tosses
 52 |   })
 53 |   chance <- reactive({
 54 |     input$chance
 55 |   })
 56 |   
 57 |   # Number of heads
 58 |   output$num_heads <- renderPrint({ 
 59 |     set.seed(seed())
 60 |     flips <- rbinom(n = tosses(), 1, prob = chance())
 61 |     sum(flips)
 62 |   })
 63 |   
 64 |   # Proportion of heads
 65 |   output$prop_heads <- renderPrint({ 
 66 |     set.seed(seed())
 67 |     flips <- rbinom(n = tosses(), 1, prob = chance())
 68 |     round(100 * sum(flips) / tosses(), 2)
 69 |   })
 70 |   
 71 |   # Fill in the spot we created for a plot
 72 |   output$chancePlot <- renderPlot({
 73 |     set.seed(input$seed)
 74 |     tosses <- input$tosses
 75 |     flips <- rbinom(n = tosses, 1, prob = chance())
 76 |     num_heads <- cumsum(flips)
 77 |     prop_heads <- (num_heads / 1:tosses)
 78 |     num_tosses <- 1:tosses
 79 | 
 80 |     # Render a barplot
 81 |     difference <- num_heads[num_tosses] - (chance() * num_tosses)
 82 |     proportion <- prop_heads[num_tosses]
 83 |     if (input$error == 1) {
 84 |       plot(num_tosses, difference, 
 85 |            col = '#627fe2', type = 'l', lwd = 2,
 86 |            xlab = "Number of tosses",
 87 |            ylab = '# of heads - 1/2 # of tosses',
 88 |            axes = FALSE, main = 'Chance Error:  # successes - # expected')
 89 |       abline(h = 0, col = '#88888855', lwd = 2, lty = 2)
 90 |       axis(side = 2, las = 1)
 91 |     } else {
 92 |       plot(num_tosses, proportion, ylim = c(0, 1),
 93 |            col = '#627fe2', type = 'l', lwd = 2,
 94 |            xlab = 'Number of tosses',
 95 |            ylab = 'Proportion of heads',
 96 |            axes = FALSE, main = 'Percent Error:  % successes - % expected')
 97 |       abline(h = chance(), col = '#88888855', lwd = 2, lty = 2)
 98 |       axis(side = 2, las = 1, at = seq(0, 1, 0.1))
 99 |     }
100 |     axis(side = 1)
101 |   })
102 |   
103 | }
104 | 
105 | # Run the application 
106 | shinyApp(ui = ui, server = server)
107 | 
108 | 


--------------------------------------------------------------------------------
/apps/ch17-demere-games/README.md:
--------------------------------------------------------------------------------
 1 | # Expected value and Standard Error with De Mere's Games
 2 | 
 3 | This is a Shiny app that illustrates the concept of Expected Value and Standard Error 
 4 | when simulating De Mere's games (100 times by default).
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to provide a visual display for the concepts in __Statistics, Chapter 17: The Expected Value and Standard Error.__
10 | 
11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
12 | 
13 | 
14 | ## Data
15 | 
16 | The app allows you to simulate two main scenarios:
17 | 
18 | 1. __Rolling a fair die 4 times.__ This actually done by drawing 4 tickets out of a box with six tickets.
19 | The structure of the box consists of one ticket `1`, and five tickets `0`.
20 | 2. __Rolling a pair of dice 24 times.__ This actually done by drawing 24 tickets out of a box with 36 tickets.
21 | The structure of the box consists of 1 ticket `1`, and 35 tickets `0`.
22 | 
23 | 
24 | ## Plots
25 | 
26 | There are three displayed graphs.
27 | 
28 | 1. A probability distribution (theoretical probabilities for the number of tickets `1`).
29 | 2. An empirical pareto chart (cumulative distribution) with the proportion of tickets `1` when 
30 | playing the game a given number of times.
31 | 3. A line chart with the empirical (net) gain when playing the game a given number of times.
32 | 
33 | 
34 | ## How to run it?
35 | 
36 | ```R
37 | library(shiny)
38 | 
39 | # Easiest way is to use runGitHub
40 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch17-demere-games")
41 | ```
42 | 


--------------------------------------------------------------------------------
/apps/ch17-demere-games/app.R:
--------------------------------------------------------------------------------
  1 | # Title: More Expected Value and Standard Error
  2 | # Description: Simulation of De Mere's rolling dice games using 
  3 | # a box model for the number of "aces" (ticket 1).
  4 | # Author: Gaston Sanchez
  5 | 
  6 | library(shiny)
  7 | 
  8 | # Define UI for application that draws a histogram
  9 | ui <- fluidPage(
 10 |   
 11 |   # Give the page a title
 12 |   titlePanel("De Mere's games"),
 13 |   
 14 |   # Generate a row with a sidebar
 15 |   sidebarLayout(      
 16 |     
 17 |     # Define the sidebar with one input
 18 |     sidebarPanel(
 19 |       fluidRow(
 20 |         column(5, 
 21 |                numericInput("tickets1", "# Tickets 1", 1,
 22 |                             min = 1, max = 35, step = 1)),
 23 |         column(5,
 24 |                numericInput("tickets0", "# Tickets 0", 5,
 25 |                             min = 1, max = 35, step = 1))
 26 |       ),
 27 |       helpText('Avg of box, and SD of box'),
 28 |       verbatimTextOutput("avg_sd_box"),
 29 |       numericInput("draws", label = "Number of Draws:", value = 4,
 30 |                    min = 1, max = 100, step = 1),
 31 |       helpText('Expected Value and SE'),
 32 |       verbatimTextOutput("ev_se"),
 33 |       hr(),
 34 |       sliderInput("reps", label = "Number of games:", 
 35 |                   min = 1, max = 5000, value = 100, step = 1),
 36 |       helpText('Actual gain'),
 37 |       verbatimTextOutput("gain"),
 38 |       numericInput("seed", label = "Random Seed:", 12345, 
 39 |                    min = 10000, max = 50000, step = 1)
 40 |     ),
 41 |     
 42 |     # Create a spot for the barplot
 43 |     mainPanel(
 44 |       tabsetPanel(type = "tabs",
 45 |                   tabPanel("Sum", plotOutput("sumPlot")),
 46 |                   tabPanel("Pareto", plotOutput("paretoPlot")),
 47 |                   tabPanel("Games", plotOutput("gamesPlot"))
 48 |       )
 49 |     )
 50 |   )
 51 | )
 52 | 
 53 | 
 54 | # Define server logic required to draw a histogram
 55 | server <- function(input, output) {
 56 |   
 57 |   tickets <- reactive({
 58 |     tickets <- c(rep(1, input$tickets1), rep(0, input$tickets0))
 59 |   })
 60 |   
 61 |   avg_box <- reactive({
 62 |     mean(tickets())
 63 |   })
 64 |   
 65 |   sd_box <- reactive({
 66 |     total <- input$tickets1 + input$tickets0
 67 |     sqrt((input$tickets1 / total) * (input$tickets0 / total))
 68 |   })
 69 |   
 70 |   sum_draws <- reactive({
 71 |     set.seed(input$seed)
 72 |     samples <- 1:input$reps
 73 |     for (i in 1:input$reps) {
 74 |       samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE))
 75 |     }
 76 |     samples
 77 |   })
 78 |   
 79 |   # Average and SD of box
 80 |   output$avg_sd_box <- renderPrint({ 
 81 |     cat(avg_box(), ",  ", sd_box(), sep = '')
 82 |   })
 83 |   
 84 |   # Expected Value, and Standard Error
 85 |   output$ev_se <- renderPrint({ 
 86 |     ev = input$draws * avg_box()
 87 |     se = sqrt(input$draws) * sd_box()
 88 |     cat(ev, ",  ", se, sep = '')
 89 |   })
 90 | 
 91 |   # Probability Histogram
 92 |   output$sumPlot <- renderPlot({
 93 |     # Render a barplot
 94 |     total_tickets <- input$tickets1 + input$tickets0
 95 |     prob_ticket1 <- input$tickets1 / total_tickets
 96 |     probabilities <- dbinom(0:input$draws, size = input$draws, prob_ticket1)
 97 |     barplot(round(probabilities, 4), border = NA, las = 1, 
 98 |             names.arg = 0:input$draws,
 99 |             xlab = paste("Number of tickets 1"),
100 |             ylab = 'Probability',
101 |             main = paste("Probability Distribution\n", 
102 |                          "(# ticekts 1)"))
103 |     abline(h = 0.5, col = '#EC5B5B99', lty = 2, lwd = 1.4)
104 |   })
105 |   
106 |   # Pareto chart: cumulative percentage of draws
107 |   output$paretoPlot <- renderPlot({
108 |     # Render a barplot
109 |     freqs_draws <- table(sum_draws()) / input$reps
110 |     freq_aux <- barplot(freqs_draws, plot = FALSE)
111 |     barplot(freqs_draws, 
112 |             ylim = c(0, 1.1),
113 |             border = NA, las = 1,
114 |             xlab = paste('Number of tickets 1 in', input$reps, 'games'),
115 |             ylab = 'Percentage',
116 |             main = paste("Empirical Cumulative Relative Frequency\n", 
117 |                          "(at least one ticket 1)"))
118 |     abline(h = 0.5, col = '#EC5B5B99', lty = 2, lwd = 1.4)
119 |     lines(freq_aux[-1], cumsum(freqs_draws[-1]), lwd = 3, col = "gray60")
120 |     points(freq_aux[-1], cumsum(freqs_draws[-1]), pch=19, col="gray30")
121 |     text(freq_aux[-1], cumsum(freqs_draws[-1]), 
122 |          round(cumsum(freqs_draws[-1]), 3), pos = 3)
123 |   })
124 |   
125 |   # Plot with gains
126 |   output$gamesPlot <- renderPlot({
127 |     results <- rep(-1, input$reps)
128 |     results[sum_draws() > 0] <- 1
129 |     plot(1:input$reps, cumsum(results), type = "n", axes = FALSE,
130 |          xlab = paste('Number of tickets 1 in', input$reps, 'games'),
131 |          ylab = "Gained amount",
132 |          main = "Empirical Gain")
133 |     abline(h = 0, col = '#EC5B5B99', lty = 2, lwd = 1.4)
134 |     axis(side = 1)
135 |     axis(side = 2, las = 1, pos = 0)
136 |     lines(1:input$reps, cumsum(results), lwd = 1.5)
137 |   })
138 |   
139 |   # actual gain
140 |   output$gain <- renderPrint({ 
141 |     results <- rep(-1, input$reps)
142 |     results[sum_draws() > 0] <- 1
143 |     sum(results)
144 |   })
145 | }
146 | 
147 | # Run the application 
148 | shinyApp(ui = ui, server = server)
149 | 
150 | 


--------------------------------------------------------------------------------
/apps/ch17-expected-value-std-error/README.md:
--------------------------------------------------------------------------------
 1 | # Expected value and Standard Error
 2 | 
 3 | This is a Shiny app that illustrates the concept of Expected Value and Standard Error when simulating rolling a die (5 times by default).
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide a visual display for the concepts in __Statistics, Chapter 17: The Expected Value and Standard Error.__
 9 | 
10 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
11 | 
12 | 
13 | ## Data
14 | 
15 | The data simulates rolling a die 5 times by default.
16 | 
17 | 
18 | ## Plot
19 | 
20 | A bar-chart of frequencies for the sum of draws is displayed.
21 | 
22 | 
23 | ## How to run it?
24 | 
25 | ```R
26 | library(shiny)
27 | 
28 | # Easiest way is to use runGitHub
29 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch17-expected-value-std-error")
30 | ```
31 | 


--------------------------------------------------------------------------------
/apps/ch17-expected-value-std-error/app.R:
--------------------------------------------------------------------------------
 1 | # Title: Expected Value and Standard Error
 2 | # Description: EV and SE when rolling a die 5 times
 3 | # Chapter 17: The EV and SE, p 288-296
 4 | # Author: Gaston Sanchez
 5 | 
 6 | library(shiny)
 7 | source("helpers.R")
 8 | 
 9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |   
12 |   # Give the page a title
13 |   titlePanel("Rolling a Die"),
14 |   
15 |   # Generate a row with a sidebar
16 |   sidebarLayout(      
17 |     
18 |     # Define the sidebar with one input
19 |     sidebarPanel(
20 |       numericInput("dice", label = "Number of dice:", 5, 
21 |                    min = 1, max = 10, step = 1),
22 |       numericInput("seed", label = "Random Seed:", 12330, 
23 |                    min = 10000, max = 50000, step = 1),
24 |       sliderInput("reps", label = "Number of repetitions:", 
25 |                   min = 100, max = 10000, value = 100, step= 10),
26 |       hr(),
27 |       helpText('Average of sums:'),
28 |       verbatimTextOutput("num_heads"),
29 |       helpText('SD of sums:'),
30 |       verbatimTextOutput("prop_heads")
31 |     ),
32 |     
33 |     # Create a spot for the barplot
34 |     mainPanel(
35 |       plotOutput("chancePlot")  
36 |     )
37 |   )
38 | )
39 | 
40 | 
41 | # Define server logic required to draw a histogram
42 | server <- function(input, output) {
43 |   
44 |   # Empirical average of sum of draws
45 |   output$num_heads <- renderPrint({ 
46 |     set.seed(input$seed)
47 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
48 |     # avg of sums
49 |     mean(total_points)
50 |   })
51 |   
52 |   # Empirical SD of sum of draws
53 |   output$prop_heads <- renderPrint({ 
54 |     set.seed(input$seed)
55 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
56 |     # avg of sums
57 |     sd(total_points) * sqrt((input$reps - 1)/input$reps)
58 |   })
59 | 
60 |     # Fill in the spot we created for a plot
61 |   output$chancePlot <- renderPlot({
62 |     set.seed(input$seed)
63 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
64 |     # put in relative terms
65 |     prop_points <- 100 * table(total_points) / input$reps
66 |     ymax <- find_ymax(max(prop_points), 2)
67 |     # Render a barplot
68 |     barplot(prop_points, las = 1, border = "gray40",
69 |             space = 0, ylim = c(0, ymax),
70 |             main = sprintf("%s Repetitions", input$reps))
71 |   })
72 | }
73 | 
74 | # Run the application 
75 | shinyApp(ui = ui, server = server)
76 | 
77 | 


--------------------------------------------------------------------------------
/apps/ch17-expected-value-std-error/helpers.R:
--------------------------------------------------------------------------------
 1 | # helper functions to simulate rolling a die
 2 | # and adding the number of sposts
 3 | 
 4 | # roll one die
 5 | roll_die <- function(times = 1) {
 6 |   die <- 1:6
 7 |   sample(die, times, replace = TRUE)
 8 | }
 9 | 
10 | 
11 | # roll a pair of dice
12 | roll_pair <- function() {
13 |   roll_die(2)
14 | }
15 | 
16 | # sum of spots
17 | sum_rolls <- function(times = 1) {
18 |   sum(roll_die(times))
19 | }
20 | 
21 | # product of numbers
22 | prod_rolls <- function(times = 1) {
23 |   prod(roll_die(times))
24 | }
25 | 
26 | # check whether 'x' is multiple of 'num'
27 | is_multiple <- function(x, num) {
28 |   x %% num == 0
29 | }
30 | 
31 | # find the y-max value for ylim in barplot()
32 | find_ymax <- function(x, num) {
33 |   if (is_multiple(x, num)) {
34 |     return(max(x))
35 |   } else {
36 |     return(num * ((x %/% num) + 1))
37 |   }
38 | }
39 | 


--------------------------------------------------------------------------------
/apps/ch18-coin-tossing/README.md:
--------------------------------------------------------------------------------
 1 | # Tossing Coins
 2 | 
 3 | This is a Shiny app that generates a probability histogram when tossing
 4 | a coin a specified number of times.
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to provide a visual display similar to the probability histograms
10 | in chapter 18 of "Statistics".
11 | 
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 | 
14 | 
15 | ## Data
16 | 
17 | The data computes the probabilities when tossing a coin a specified number of times.
18 | The input parameters are the number of tosses, and the chance of heads.
19 | 
20 | 
21 | ## Plot
22 | 
23 | The produced plot is a probability histogram.
24 | 
25 | 
26 | ## How to run it?
27 | 
28 | ```R
29 | library(shiny)
30 | 
31 | # Easiest way is to use runGitHub
32 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-tossing-coins")
33 | ```
34 | 


--------------------------------------------------------------------------------
/apps/ch18-coin-tossing/app.R:
--------------------------------------------------------------------------------
 1 | # Title: Probability histograms
 2 | # Description: Probability histograms for the number of heads 
 3 | #              in "n" tosses of a coin
 4 | # Chapter 18: Normal Approx, p 316
 5 | # Author: Gaston Sanchez
 6 | 
 7 | library(shiny)
 8 | 
 9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |   
12 |   # Give the page a title
13 |   titlePanel("Tossing Coins"),
14 |   
15 |   # Generate a row with a sidebar
16 |   sidebarLayout(      
17 |     
18 |     # Define the sidebar with one input
19 |     sidebarPanel(
20 |       sliderInput("tosses", label = "Number of tosses:",  
21 |                    min = 1, max = 500, value = 100, step = 1),
22 |       sliderInput("chance", label = "Chance of heads", 
23 |                   min = 0, max = 1, value = 0.5, step= 0.05),
24 |       hr(),
25 |       helpText('Expected Value:'),
26 |       verbatimTextOutput("exp_value"),
27 |       helpText('Standard Error'),
28 |       verbatimTextOutput("std_error")
29 |     ),
30 |     
31 |     # Create a spot for the barplot
32 |     mainPanel(
33 |       plotOutput("chancePlot")  
34 |     )
35 |   )
36 | )
37 | 
38 | 
39 | # Define server logic required to draw a histogram
40 | server <- function(input, output) {
41 |   
42 |   # Expected Value
43 |   output$exp_value <- renderPrint({ 
44 |     input$tosses * input$chance
45 |   })
46 |   
47 |   # Standard Error
48 |   output$std_error <- renderPrint({ 
49 |     sqrt(input$tosses * input$chance * (1 - input$chance))
50 |   })
51 | 
52 |   # Fill in the spot we created for a plot
53 |   output$chancePlot <- renderPlot({
54 |     probs <- 100 * dbinom(0:input$tosses, 
55 |                           size = input$tosses, 
56 |                           prob = input$chance)
57 |     
58 |     exp_value <- input$tosses * input$chance
59 |     std_error <- sqrt(input$tosses * input$chance * (1 - input$chance))
60 |     
61 |     below3se <- (exp_value - 3 * std_error)
62 |     above3se <- (exp_value + 3 * std_error)
63 |     
64 |     from <- floor(below3se) + 1
65 |     to <- ceiling(above3se) + 1
66 |     
67 |     if (input$tosses >= 10 & from > 0) {
68 |       xpos <- barplot(probs[from:to], plot = FALSE)
69 |       # Render probability histogram as a barplot
70 |       op = par(mar = c(6.5, 4.5, 4, 2))
71 |       barplot(probs[from:to], axes = FALSE, col = "gray70", 
72 |               names.arg = (from-1):(to-1), border = NA,
73 |               ylim = c(0, ceiling(max(probs))),
74 |               ylab = "probability (%)", 
75 |               main = sprintf("Probability Histogram\n %s Tosses", 
76 |                              input$tosses))
77 |       axis(side = 2, las = 1)
78 |       axis(side = 1, line = 3,
79 |            at = seq(xpos[1], xpos[length(xpos)], length.out = 7),
80 |            labels = seq(-3, 3, 1))
81 |       mtext("Standard Units", side = 1, line = 5.5)
82 |       par(op)
83 |     } else {
84 |       barplot(probs, axes = FALSE, col = "gray70", 
85 |               names.arg = 0:input$tosses, border = NA,
86 |               ylim = c(0, ceiling(max(probs))),
87 |               ylab = "probability (%)", 
88 |               main = sprintf("Probability Histogram\n %s Tosses", 
89 |                              input$tosses))
90 |       axis(side = 2, las = 1)
91 |     }
92 |   })
93 | }
94 | 
95 | # Run the application 
96 | shinyApp(ui = ui, server = server)
97 | 
98 | 


--------------------------------------------------------------------------------
/apps/ch18-roll-dice-product/README.md:
--------------------------------------------------------------------------------
 1 | # Rolling Dice: Sum of Points
 2 | 
 3 | This is a Shiny app that generates empirical histograms when simulating 
 4 | rolling dice and finding the total product of spots.
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to provide a visual display similar to the empirical and 
10 | probability histograms shown in page 313 of "Statistics", chapter 18.
11 | 
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 | 
14 | 
15 | ## Data
16 | 
17 | The data simulates rolling (by default) a pair of dice (but the user can choose between 
18 | one and 10 dices). The input parameters are the number of dice, the random seed, and 
19 | the number of repetitions.
20 | 
21 | 
22 | ## Plots
23 | 
24 | There are two tabs:
25 | 
26 | 1. An empirical histogram.
27 | 2. A probability histogram (probability distribution).
28 | 
29 | 
30 | ## How to run it?
31 | 
32 | ```R
33 | library(shiny)
34 | 
35 | # Easiest way is to use runGitHub
36 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-roll-dice-product")
37 | ```
38 | 


--------------------------------------------------------------------------------
/apps/ch18-roll-dice-product/app.R:
--------------------------------------------------------------------------------
 1 | # Title: Roll dice and multiply spots
 2 | # Description: Empirical vs Probability Histograms
 3 | # Chapter 18: Probability Histograms, page 313
 4 | # Author: Gaston Sanchez
 5 | 
 6 | library(shiny)
 7 | source("helpers.R")
 8 | 
 9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |   
12 |   # Give the page a title
13 |   titlePanel("Rolling Dice: Product"),
14 |   
15 |   # Generate a row with a sidebar
16 |   sidebarLayout(      
17 |     
18 |     # Define the sidebar with one input
19 |     sidebarPanel(
20 |       numericInput("dice", label = "Number of dice:", 2, 
21 |                    min = 1, max = 10, step = 1),
22 |       numericInput("seed", label = "Random Seed:", 12330, 
23 |                    min = 10000, max = 50000, step = 1),
24 |       sliderInput("reps", label = "Number of repetitions:", 
25 |                   min = 100, max = 10000, value = 100, step= 10)
26 |     ),
27 |     
28 |     # Create tabs for plots
29 |     mainPanel(
30 |       tabsetPanel(type = "tabs",
31 |                   tabPanel("Empirical", plotOutput("empiricalPlot")),
32 |                   tabPanel("Probability", plotOutput("probabilityPlot"))
33 |       )
34 |     )
35 |   )
36 | )
37 | 
38 | 
39 | # Define server logic required to draw a histogram
40 | server <- function(input, output) {
41 |   
42 |   # Empirical average of sum of draws
43 |   output$num_heads <- renderPrint({ 
44 |     set.seed(input$seed)
45 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
46 |     # avg of sums
47 |     mean(total_points)
48 |   })
49 |   
50 |   # Empirical SD of product of draws
51 |   output$prop_heads <- renderPrint({ 
52 |     set.seed(input$seed)
53 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
54 |     # avg of sums
55 |     sd(total_points) * sqrt((input$reps - 1)/input$reps)
56 |   })
57 |   
58 |   # Empirical Histogram
59 |   output$empiricalPlot <- renderPlot({
60 |     set.seed(input$seed)
61 |     total_points <- sapply(rep(input$dice, input$reps), prod_rolls)
62 |     # put in relative terms
63 |     prop_points <- 100 * table(total_points) / input$reps
64 |     ymax <- find_ymax(max(prop_points), 2)
65 |     # Render a barplot
66 |     # Frequencies of products
67 |     freq <- numeric((6^input$dice))
68 |     freq[1:(6^input$dice) %in% names(prop_points)] <- prop_points
69 |     names(freq) <- 1:(6^input$dice)
70 |     
71 |     # Render a barplot
72 |     barplot(freq, space = 0, las = 1, border = "gray40",
73 |             cex.names = 0.8, ylim = c(0, ymax),
74 |             main = sprintf("%s Repetitions", input$reps))
75 |   })
76 | 
77 |   # Probability Histogram
78 |   output$probabilityPlot <- renderPlot({
79 |     outcomes <- multiply_spots(input$dice)
80 |     freq <- rep(0, 6^input$dice)
81 |     freq[as.numeric(names(outcomes))] <- outcomes
82 |     # Render a barplot
83 |     barplot(100 * freq, names.arg = 1:6^input$dice,
84 |             las = 1, border = "gray40",
85 |             space = 0,
86 |             xlab = "Number of spots",
87 |             ylab = "Chance (%)",
88 |             main = "Probability Histogram")
89 |   })
90 |   
91 | }
92 | 
93 | # Run the application 
94 | shinyApp(ui = ui, server = server)
95 | 
96 | 


--------------------------------------------------------------------------------
/apps/ch18-roll-dice-product/helpers.R:
--------------------------------------------------------------------------------
 1 | # Helper functions to simulate rolling a die
 2 | # and adding-or-multiplying the number of spots
 3 | 
 4 | # roll one die
 5 | roll_die <- function(times = 1) {
 6 |   die <- 1:6
 7 |   sample(die, times, replace = TRUE)
 8 | }
 9 | 
10 | 
11 | # roll a pair of dice
12 | roll_pair <- function() {
13 |   roll_die(2)
14 | }
15 | 
16 | # sum of spots
17 | sum_rolls <- function(times = 1) {
18 |   sum(roll_die(times))
19 | }
20 | 
21 | # product of numbers
22 | prod_rolls <- function(times = 1) {
23 |   prod(roll_die(times))
24 | }
25 | 
26 | # check whether 'x' is multiple of 'num'
27 | is_multiple <- function(x, num) {
28 |   x %% num == 0
29 | }
30 | 
31 | # find the y-max value for ylim in barplot()
32 | find_ymax <- function(x, num) {
33 |   if (is_multiple(x, num)) {
34 |     return(max(x))
35 |   } else {
36 |     return(num * ((x %/% num) + 1))
37 |   }
38 | }
39 | 
40 | # reps <- 100
41 | # total_points <- sapply(rep(2, reps), sum_rolls)
42 | # prop_points <- 100 * table(total_points) / reps
43 | # barplot(prop_points, las = 1, 
44 | #         space = 0, ylim = c(0, 30),
45 | #         main = sprintf("%s Repetitions", reps))
46 | 
47 | 
48 | 
49 | 
50 | # function that multiplies spots to a given result
51 | multiply_rolls <- function(given) {
52 |   results <- rep(0, length(given) * 6)
53 |   aux <- 1
54 |   for (i in 1:length(given)) {
55 |     for (j in 1:6) {
56 |       results[aux] <- given[i] * j
57 |       aux <- aux + 1
58 |     }
59 |   }
60 |   results
61 | }
62 | 
63 | # function that computes theoretical probabilities
64 | # for the addition of spots when rolling "k" dice
65 | multiply_spots <- function(num_dice) {
66 |   # just one die
67 |   if (num_dice == 1) {
68 |     outcomes <- table(1:6) / (6^num_dice)
69 |   } else {
70 |     # two or more dice
71 |     current <- 1:6
72 |     for (k in 2:num_dice) {
73 |       current <- multiply_rolls(current)
74 |     }
75 |     outcomes <- table(current) / (6^num_dice)
76 |   }
77 |   outcomes
78 | }
79 | 
80 | # multiply_spots(1)
81 | # multiply_spots(2)
82 | # multiply_spots(3)
83 | 
84 | 
85 | 
86 | 


--------------------------------------------------------------------------------
/apps/ch18-roll-dice-sum/README.md:
--------------------------------------------------------------------------------
 1 | # Rolling Dice: Sum of Points
 2 | 
 3 | This is a Shiny app that generates empirical histograms when simulating 
 4 | rolling dice and finding the total number of spots.
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to provide a visual display similar to the empirical and 
10 | probability histograms shown in page 311 of "Statistics", chapter 18.
11 | 
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 | 
14 | 
15 | ## Data
16 | 
17 | The data simulates rolling (by default) a pair of dice (but the user can choose between 
18 | one and 10 dices). The input parameters are the number of dice, the random seed, and 
19 | the number of repetitions.
20 | 
21 | 
22 | ## Plots
23 | 
24 | There are two tabs:
25 | 
26 | 1. An empirical histogram.
27 | 2. A probability histogram (probability distribution).
28 | 
29 | 
30 | ## How to run it?
31 | 
32 | ```R
33 | library(shiny)
34 | 
35 | # Easiest way is to use runGitHub
36 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-roll-dice-sum")
37 | ```
38 | 


--------------------------------------------------------------------------------
/apps/ch18-roll-dice-sum/app.R:
--------------------------------------------------------------------------------
 1 | # Title: Roll dice and add spots
 2 | # Description: Empirical vs Probability Histograms
 3 | # Chapter 18: Probability Histograms, page 311
 4 | # Author: Gaston Sanchez
 5 | 
 6 | library(shiny)
 7 | source("helpers.R")
 8 | 
 9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |   
12 |   # Give the page a title
13 |   titlePanel("Rolling Dice: Sum"),
14 |   
15 |   # Generate a row with a sidebar
16 |   sidebarLayout(      
17 |     
18 |     # Define the sidebar with one input
19 |     sidebarPanel(
20 |       numericInput("dice", label = "Number of dice:", 2, 
21 |                    min = 1, max = 10, step = 1),
22 |       numericInput("seed", label = "Random Seed:", 12330, 
23 |                    min = 10000, max = 50000, step = 1),
24 |       sliderInput("reps", label = "Number of repetitions:", 
25 |                   min = 100, max = 10000, value = 100, step= 10)
26 |       #hr(),
27 |       #helpText('Average of sums:'),
28 |       #verbatimTextOutput("num_heads"),
29 |       #helpText('SD of sums:'),
30 |       #verbatimTextOutput("prop_heads")
31 |     ),
32 |     
33 |     # Create tabs for plots
34 |     mainPanel(
35 |       tabsetPanel(type = "tabs",
36 |                   tabPanel("Empirical", plotOutput("empiricalPlot")),
37 |                   tabPanel("Probability", plotOutput("probabilityPlot"))
38 |       )
39 |     )
40 |   )
41 | )
42 | 
43 | 
44 | # Define server logic required to draw a histogram
45 | server <- function(input, output) {
46 |   
47 |   # Empirical average of sum of draws
48 |   output$num_heads <- renderPrint({ 
49 |     set.seed(input$seed)
50 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
51 |     # avg of sums
52 |     mean(total_points)
53 |   })
54 |   
55 |   # Empirical SD of sum of draws
56 |   output$prop_heads <- renderPrint({ 
57 |     set.seed(input$seed)
58 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
59 |     # avg of sums
60 |     sd(total_points) * sqrt((input$reps - 1)/input$reps)
61 |   })
62 |   
63 |   # Empirical Histogram
64 |   output$empiricalPlot <- renderPlot({
65 |     set.seed(input$seed)
66 |     total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
67 |     # put in relative terms
68 |     prop_points <- 100 * table(total_points) / input$reps
69 |     ymax <- find_ymax(max(prop_points), 2)
70 |     # Render a barplot
71 |     barplot(prop_points, las = 1, border = "gray40",
72 |             space = 0, ylim = c(0, ymax),
73 |             xlab = "Number of spots",
74 |             ylab = "Relative Frequency",
75 |             main = sprintf("%s Repetitions", input$reps))
76 |   })
77 | 
78 |   # Probability Histogram
79 |   output$probabilityPlot <- renderPlot({
80 |     outcomes <- sum_spots(input$dice)
81 |     # Render a barplot
82 |     barplot(100 * outcomes, 
83 |             las = 1, border = "gray40",
84 |             space = 0,
85 |             xlab = "Number of spots",
86 |             ylab = "Chance (%)",
87 |             main = "Probability Histogram")
88 |   })
89 |   
90 | }
91 | 
92 | # Run the application 
93 | shinyApp(ui = ui, server = server)
94 | 
95 | 


--------------------------------------------------------------------------------
/apps/ch18-roll-dice-sum/helpers.R:
--------------------------------------------------------------------------------
 1 | # Helper functions to simulate rolling a die
 2 | # and adding-or-multiplying the number of spots
 3 | 
 4 | # roll one die
 5 | roll_die <- function(times = 1) {
 6 |   die <- 1:6
 7 |   sample(die, times, replace = TRUE)
 8 | }
 9 | 
10 | 
11 | # roll a pair of dice
12 | roll_pair <- function() {
13 |   roll_die(2)
14 | }
15 | 
16 | # sum of spots
17 | sum_rolls <- function(times = 1) {
18 |   sum(roll_die(times))
19 | }
20 | 
21 | # product of numbers
22 | prod_rolls <- function(times = 1) {
23 |   prod(roll_die(times))
24 | }
25 | 
26 | # check whether 'x' is multiple of 'num'
27 | is_multiple <- function(x, num) {
28 |   x %% num == 0
29 | }
30 | 
31 | # find the y-max value for ylim in barplot()
32 | find_ymax <- function(x, num) {
33 |   if (is_multiple(x, num)) {
34 |     return(max(x))
35 |   } else {
36 |     return(num * ((x %/% num) + 1))
37 |   }
38 | }
39 | 
40 | # reps <- 100
41 | # total_points <- sapply(rep(2, reps), sum_rolls)
42 | # prop_points <- 100 * table(total_points) / reps
43 | # barplot(prop_points, las = 1, 
44 | #         space = 0, ylim = c(0, 30),
45 | #         main = sprintf("%s Repetitions", reps))
46 | 
47 | 
48 | 
49 | 
50 | # function that adds spots to a given result
51 | add_rolls <- function(given) {
52 |   results <- rep(0, length(given) * 6)
53 |   aux <- 1
54 |   for (i in 1:length(given)) {
55 |     for (j in 1:6) {
56 |       results[aux] <- given[i] + j
57 |       aux <- aux + 1
58 |     }
59 |   }
60 |   results
61 | }
62 | 
63 | # function that computes theoretical probabilities
64 | # for the addition of spots when rolling "k" dice
65 | sum_spots <- function(num_dice) {
66 |   # just one die
67 |   if (num_dice == 1) {
68 |     outcomes <- table(1:6) / (6^num_dice)
69 |   } else {
70 |     # two or more dice
71 |     current <- 1:6
72 |     for (k in 2:num_dice) {
73 |       current <- add_rolls(current)
74 |     }
75 |     outcomes <- table(current) / (6^num_dice)
76 |   }
77 |   outcomes
78 | }
79 | 
80 | # sum_spots(1)
81 | # sum_spots(2)
82 | # sum_spots(3)
83 | 
84 | 
85 | 
86 | 


--------------------------------------------------------------------------------
/apps/ch20-sampling-men/README.md:
--------------------------------------------------------------------------------
 1 | # Sampling Men
 2 | 
 3 | This is a Shiny app that illustrates the concept of chance errors in sampling.
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide a visual display for the Introduction example in 
 9 | __Statistics, Chapter 20: Chance Errors in Sampling__
10 | 
11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
12 | 
13 | 
14 | ## Data
15 | 
16 | The data consists of a box model with 6672 tickets: 3091 __1's__, and 3581 __0's__. 
17 | The 1's tickets represent men, while the 0's represent women.
18 | The app simulates taking samples from the box. There are two parameters, one is the sample size, and the other is the number samples (i.e. # of repetitions).
19 | 
20 | 
21 | ## Plots
22 | 
23 | There are two plots: 
24 | 
25 | 1. The first tab shows a histogram with the number of men in the samples.
26 | 2. The second tab shows a histogram with the percentage of men in the samples.
27 | 
28 | 
29 | ## How to run it?
30 | 
31 | There are many ways to download the app and run it:
32 | 
33 | ```R
34 | library(shiny)
35 | 
36 | # Easiest way is to use runGitHub
37 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch20-sampling-men")
38 | ```
39 | 


--------------------------------------------------------------------------------
/apps/ch20-sampling-men/app.R:
--------------------------------------------------------------------------------
  1 | # Title: Roll dice and add spots
  2 | # Description: Empirical vs Probability Histograms
  3 | # Chapter 18: Probability Histograms, page 311
  4 | # Author: Gaston Sanchez
  5 | 
  6 | library(shiny)
  7 | 
  8 | # Define UI for application that draws a histogram
  9 | ui <- fluidPage(
 10 |   
 11 |   # Give the page a title
 12 |   titlePanel("Sampling Men (p 359)"),
 13 |   
 14 |   # Generate a row with a sidebar
 15 |   sidebarLayout(      
 16 |     
 17 |     # Define the sidebar with one input
 18 |     sidebarPanel(
 19 |       fluidRow(
 20 |         column(5, 
 21 |                numericInput("tickets1", "men [1]", 3091,
 22 |                             min = 1, max = 4000, step = 1)),
 23 |         column(5,
 24 |                numericInput("tickets0", "women [0]", 3581,
 25 |                             min = 1, max = 4000, step = 1))
 26 |       ),
 27 | #      helpText('Avg box,  SD box'),
 28 |       verbatimTextOutput("avg_sd_box"),
 29 |       numericInput("size", label = "Sample Size (# draws):", value = 100,
 30 |                    min = 10, max = 1500, step = 1),
 31 |       sliderInput("reps", label = "Number of repetitions:", 
 32 |                   min = 50, max = 2000, value = 100, step = 50),
 33 |       numericInput("seed", label = "Random Seed:", 12345, 
 34 |                    min = 10000, max = 50000, step = 1),
 35 |       hr(),
 36 |       helpText('Number average'),
 37 |       verbatimTextOutput("num_avg"),
 38 |       helpText('Percent average'),
 39 |       verbatimTextOutput("perc_avg")
 40 |     ),
 41 |     
 42 |     # Create tabs for plots
 43 |     mainPanel(
 44 |       tabsetPanel(type = "tabs",
 45 |                   tabPanel("Number", plotOutput("numberPlot")),
 46 |                   tabPanel("Percentage", plotOutput("percentPlot"))
 47 |       )
 48 |     )
 49 |   )
 50 | )
 51 | 
 52 | 
 53 | # Define server logic required to draw a histogram
 54 | server <- function(input, output) {
 55 |   
 56 |   # Number of men
 57 |   output$avg_sd_box <- renderPrint({
 58 |     total <- input$tickets1 + input$tickets0
 59 |     avg_box <- input$tickets1 / total
 60 |     sd_box <- sqrt((input$tickets1/total) * (input$tickets0/total))
 61 |     cat(sprintf('Avg = %0.3f,  SD = %0.3f', avg_box, sd_box))
 62 |   })
 63 | 
 64 |     num_men <- reactive({
 65 |     tickets <- rep(c(1, 0), c(input$tickets1, input$tickets0))
 66 |     
 67 |     set.seed(input$seed)
 68 |     size <- input$size
 69 |     samples <- 1:input$reps
 70 |     for (i in 1:input$reps) {
 71 |       samples[i] <- sum(sample(tickets, size = size))
 72 |     }
 73 |     samples
 74 |   })
 75 |   
 76 |   # Number of men
 77 |   output$num_avg <- renderPrint({ 
 78 |     round(mean(num_men()), 2)
 79 |   })
 80 |   
 81 |   # Percentage of men
 82 |   output$perc_avg <- renderPrint({ 
 83 |     round(100 * mean(num_men() / input$size), 2)
 84 |   })
 85 |   
 86 |   # Plot with number of men in samples
 87 |   output$numberPlot <- renderPlot({
 88 |     # Render a barplot
 89 |     barplot(table(num_men()), 
 90 |             space = 0, las = 1,
 91 |             xlab = 'Number of men',
 92 |             ylab = '',
 93 |             main = 'Sample Men')
 94 |   })
 95 |   
 96 |   # Plot with percentage of men in samples
 97 |   output$percentPlot <- renderPlot({
 98 |     # Render a barplot
 99 |     percentage_men <- round(100 * num_men() / input$size)
100 |     barplot(table(percentage_men) / length(num_men()), 
101 |             space = 0, las = 1,
102 |             xlab = 'Percentage of men',
103 |             ylab = 'Proportion',
104 |             main = 'Sample Men')
105 |   })
106 |   
107 | }
108 | 
109 | # Run the application 
110 | shinyApp(ui = ui, server = server)
111 | 
112 | 


--------------------------------------------------------------------------------
/apps/ch21-accuracy-percentages/README.md:
--------------------------------------------------------------------------------
 1 | # Ch21 - Percent Estimation
 2 | 
 3 | This is a Shiny app that illustrates the concept of accuracy of percentages.
 4 | In other words, confidence intervals when esitmating a percentage.
 5 | 
 6 | 
 7 | ## Motivation
 8 | 
 9 | The goal is to provide a visual display for the various examples in 
10 | __Statistics, Chapter 21: Accuracy of Percentages__
11 | 
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). 
13 | Fourth Edition. Norton & Company.
14 | 
15 | 
16 | ## Data
17 | 
18 | - The data consists of a box model with two types of tickets: 0's and 1's.
19 | - The user can specify the nummber of both types of tickets (# of 1's, # of 0's).
20 | - The app simulates drawing tickets with replacement from the box. 
21 | - There are three arguments:
22 |     + the number of draws (i.e. sample size)
23 |     + the number samples (i.e. # of repetitions)
24 |     + the confidence level
25 | 
26 | 
27 | ## Plots
28 | 
29 | There are three plots: 
30 | 
31 | 1. The first tab shows a histogram for the sum of draws.
32 | 2. The second tab shows a histogram for the percentage of tickets 1's.
33 | 3. The third tab shows a chart with the percentage of the box (i.e. population percentage),
34 | and the confidence intervals of the drawn samples.
35 | 
36 | 
37 | ## How to run it?
38 | 
39 | ```R
40 | library(shiny)
41 | 
42 | # Easiest way is to use runGitHub
43 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch21-accuracy-percentages")
44 | ```
45 | 


--------------------------------------------------------------------------------
/apps/ch21-accuracy-percentages/app.R:
--------------------------------------------------------------------------------
  1 | # Box with two types of tickets [# 1's, # 0's]
  2 | # Drawing tickets from the box
  3 | # Chapter 21: Accuracy of Percentages
  4 | 
  5 | library(shiny)
  6 | 
  7 | source('helpers.R')
  8 | 
  9 | # Define the overall UI
 10 | ui <- fluidPage(
 11 |   
 12 |   # Give the page a title
 13 |   titlePanel("Accuracy of Percentages"),
 14 |   
 15 |   # Generate a row with a sidebar
 16 |   sidebarLayout(      
 17 |     
 18 |     # Define the sidebar with one input
 19 |     sidebarPanel(
 20 |       fluidRow(
 21 |         column(5, 
 22 |                numericInput("tickets1", "# Tickets 1", 5,
 23 |                             min = 1, max = 100, step = 1)),
 24 |         column(5,
 25 |                numericInput("tickets0", "# Tickets 0", 5,
 26 |                             min = 1, max = 200, step = 1))
 27 |       ),
 28 |       helpText('Avg of box, and SD of box'),
 29 |       verbatimTextOutput("avg_sd_box"),
 30 |       hr(),
 31 |       sliderInput("draws", label = "Sample size (# draws):", value = 25,
 32 |                    min = 5, max = 500, step = 1),
 33 |       numericInput("reps", label = "Number of samples (# reps):", 
 34 |                   min = 10, max = 1000, value = 50, step = 10),
 35 |       checkboxInput('param', value = TRUE, label = strong('Show parameter')),
 36 |       sliderInput("confidence", label = "Confidence level (%):", value = 68,
 37 |                   min = 1, max = 99, step = 1),
 38 |       numericInput("seed", label = "Random Seed:", 12345, 
 39 |                    min = 10000, max = 50000, step = 1)
 40 |     ),
 41 |     
 42 |     # Create a spot for the barplot
 43 |     mainPanel(
 44 |       tabsetPanel(type = "tabs",
 45 |                   tabPanel("Sum", plotOutput("sumPlot")),
 46 |                   tabPanel("Percentage", plotOutput("percentPlot")),
 47 |                   tabPanel("Estimates", plotOutput("intervalPlot"))
 48 |       )
 49 |     )
 50 |   )
 51 | )
 52 | 
 53 | 
 54 | 
 55 | # Define server logic required to draw a histogram
 56 | server <- function(input, output) {
 57 |   tickets <- reactive({
 58 |     tickets <- c(rep(1, input$tickets1), rep(0, input$tickets0))
 59 |   })
 60 |   
 61 |   sum_draws <- reactive({
 62 |     set.seed(input$seed)
 63 |     samples <- 1:input$reps
 64 |     for (i in 1:input$reps) {
 65 |       samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE))
 66 |     }
 67 |     samples
 68 |   })
 69 |   
 70 |   avg_box <- reactive({
 71 |     mean(tickets())
 72 |   })
 73 |   
 74 |   sd_box <- reactive({
 75 |     total <- input$tickets1 + input$tickets0
 76 |     sqrt((input$tickets1 / total) * (input$tickets0 / total))
 77 |   })
 78 |   
 79 |   # Average and SD of box
 80 |   output$avg_sd_box <- renderPrint({ 
 81 |     cat(avg_box(), ",  ", sd_box(), sep = '')
 82 |   })
 83 |   
 84 |   # Plot with sum of draws
 85 |   output$sumPlot <- renderPlot({
 86 |     # Render a barplot
 87 |     barplot(table(sum_draws()), 
 88 |             space = 0, las = 1,
 89 |             xlab = 'Sum',
 90 |             ylab = '',
 91 |             main = sprintf('Sum of Box for %s Draws', input$draws))
 92 |   })
 93 |   
 94 |   # Plot with percentage of draws
 95 |   output$percentPlot <- renderPlot({
 96 |     # Render a barplot
 97 |     avg_draws <- round(sum_draws() / input$draws, 2)
 98 |     barplot(table(avg_draws), 
 99 |             space = 0, las = 1,
100 |             xlab = 'Percentage',
101 |             ylab = '',
102 |             main = "Percentage of 1's")
103 |   })
104 |   
105 |   # Plot with confidence intervals
106 |   output$intervalPlot <- renderPlot({
107 |     avg_box <- mean(tickets())
108 |     n <- length(tickets())
109 |     sd_box <- sqrt((n-1)/n) * sd(tickets())
110 |     se_sum <- sqrt(input$draws) * sd_box
111 |     se_perc <- se_sum / input$draws
112 |     
113 |     # Render plot
114 |     samples <- sum_draws() / input$draws
115 |     
116 |     #a <- samples - se_perc
117 |     #b <- samples + se_perc
118 |     
119 |     a <- samples - ci_factor(input$confidence) * se_perc
120 |     b <- samples + ci_factor(input$confidence) * se_perc
121 |     covers <- (a <= avg_box & avg_box <= b)
122 |     ci_cols <- rep('#ff000088', input$reps)
123 |     ci_cols[covers] <- '#0000ff88'
124 |     
125 |     #xlim <- c(min(samples) - ci_factor(input$confidence) * se_perc, 
126 |     #          max(samples) + ci_factor(input$confidence) * se_perc)
127 |     xlim <- c(min(samples) - 3 * se_perc, 
128 |               max(samples) + 3 * se_perc)
129 |     plot(samples, 1:length(samples), axes = FALSE,
130 |          col = '#444444', pch = 21, cex = 0.5,
131 |          xlim = xlim, 
132 |          ylab = 'Number of samples',
133 |          xlab = "Confidence Intervals",
134 |          main = "Percentage of 1's")
135 |     axis(side = 1, at = seq(0, 1, 0.1))
136 |     axis(side = 2, las = 1)
137 |     if (input$param) {
138 |       # display line for parameter
139 |       abline(v = avg_box, col = '#0000FFdd', lwd = 2.5)
140 |     }
141 |     segments(x0 = a,
142 |              x1 = b,
143 |              y0 = 1:length(samples),
144 |              y1 = 1:length(samples),
145 |              col = ci_cols)
146 |   })
147 |   
148 | }
149 | 
150 | 
151 | # Run the application 
152 | shinyApp(ui = ui, server = server)
153 | 
154 | 


--------------------------------------------------------------------------------
/apps/ch21-accuracy-percentages/helpers.R:
--------------------------------------------------------------------------------
 1 | # function to compute SE factor 
 2 | # for a confidence level
 3 | ci_factor <- function(level = 95) {
 4 |   area <- level + ((100 - level) / 2)
 5 |   qnorm(area/100)
 6 | }
 7 | 
 8 | # tests
 9 | 
10 | # 90% confidence level 
11 | # ci_factor(90)
12 | 
13 | # 95% confidence level 
14 | # ci_factor(95)
15 | 
16 | # 99% confidence level 
17 | # ci_factor(99)
18 | 


--------------------------------------------------------------------------------
/apps/ch23-accuracy-averages/README.md:
--------------------------------------------------------------------------------
 1 | # Ch23 - Accuracy of Averages
 2 | 
 3 | This is a Shiny app that illustrates the concept of accuracy of averages.
 4 | 
 5 | 
 6 | ## Motivation
 7 | 
 8 | The goal is to provide a visual display for the Introduction example in 
 9 | __Statistics, Chapter 23: Accuracy of Averages__
10 | 
11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). 
12 | Fourth Edition. Norton & Company.
13 | 
14 | 
15 | ## Data
16 | 
17 | The data consists of a box model with default tickets: 1, 2, 3, 4, 5, 6, 7.
18 | However, the numbers in the box can be changed by the user.
19 | The app simulates taking random samples from the box. 
20 | There are two parameters, one is the number of draws (i.e. sample size), 
21 | and the other is the number samples (i.e. # of repetitions).
22 | 
23 | 
24 | ## Plots
25 | 
26 | There are three plots: 
27 | 
28 | 1. The first tab shows a histogram for the sum of draws.
29 | 2. The second tab shows a histogram for the average of draws.
30 | 3. The third tab shows a chart with the average of the box (i.e. population avg),
31 | and the confidence intervals of the drawn samples (i.e. sample averages)
32 | 
33 | 
34 | ## How to run it?
35 | 
36 | ```R
37 | library(shiny)
38 | 
39 | # Easiest way is to use runGitHub
40 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch23-accuracy-averages")
41 | ```
42 | 


--------------------------------------------------------------------------------
/apps/ch23-accuracy-averages/app.R:
--------------------------------------------------------------------------------
  1 | # Box with two types of tickets [# 1's, # 0's]
  2 | # Drawing tickets from the box
  3 | # Chapter 21: Accuracy of Percentages
  4 | 
  5 | library(shiny)
  6 | 
  7 | source('helpers.R')
  8 | 
  9 | # Define the overall UI
 10 | ui <- fluidPage(
 11 |   
 12 |   # Give the page a title
 13 |   titlePanel("Accuracy of Averages"),
 14 |   
 15 |   # Generate a row with a sidebar
 16 |   sidebarLayout(      
 17 |     
 18 |     # Define the sidebar with one input
 19 |     sidebarPanel(
 20 |       textInput("tickets", label = "Numbers in box:", 
 21 |                 value = '1, 2, 3, 4, 5, 6, 7'),
 22 |       helpText('Avg of box, and SD of box'),
 23 |       verbatimTextOutput("avg_sd_box"),
 24 |       hr(),
 25 |       sliderInput("draws", label = "Sample size (# draws):", value = 25,
 26 |                    min = 5, max = 500, step = 1),
 27 |       sliderInput("reps", label = "Number of samples (# reps):", 
 28 |                   min = 10, max = 1000, value = 50, step = 10),
 29 |       checkboxInput('param', value = TRUE, label = strong('Show parameter')),
 30 |       sliderInput("confidence", label = "Confidence level (%):", value = 68,
 31 |                   min = 1, max = 99, step = 1),
 32 |       numericInput("seed", label = "Random Seed:", 12345, 
 33 |                    min = 10000, max = 50000, step = 1)
 34 |     ),
 35 |     
 36 |     # Create a spot for the barplot
 37 |     mainPanel(
 38 |       tabsetPanel(type = "tabs",
 39 |                   tabPanel("Sum", plotOutput("sumPlot")),
 40 |                   tabPanel("Average", plotOutput("averagePlot")),
 41 |                   tabPanel("Estimates", plotOutput("intervalPlot"))
 42 |       )
 43 |     )
 44 |   )
 45 | )
 46 | 
 47 | 
 48 | 
 49 | # Define server logic required to draw a histogram
 50 | server <- function(input, output) {
 51 |   tickets <- reactive({
 52 |     tickets <- gsub(' ', '', input$tickets)
 53 |     tickets <- unlist(strsplit(tickets, ','))
 54 |     as.numeric(tickets)
 55 |   })
 56 |   
 57 |   sum_draws <- reactive({
 58 |     set.seed(input$seed)
 59 |     samples <- 1:input$reps
 60 |     for (i in 1:input$reps) {
 61 |       samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE))
 62 |     }
 63 |     samples
 64 |   })
 65 |   
 66 |   avg_box <- reactive({
 67 |     mean(tickets())
 68 |   })
 69 |   
 70 |   sd_box <- reactive({
 71 |     n <- length(tickets())
 72 |     sqrt((n-1)/n) * sd(tickets())
 73 |   })
 74 |   
 75 |   # Average and SD of box
 76 |   output$avg_sd_box <- renderPrint({ 
 77 |     cat(avg_box(), ",  ", sd_box(), sep = '')
 78 |   })
 79 |   
 80 |   # Plot with sum of draws
 81 |   output$sumPlot <- renderPlot({
 82 |     # Render a barplot
 83 |     barplot(table(sum_draws()), 
 84 |             space = 0, las = 1,
 85 |             xlab = 'Sum',
 86 |             ylab = '',
 87 |             main = sprintf('Sum of Box for %s Draws', input$draws))
 88 |   })
 89 |   
 90 |   # Plot with average of draws
 91 |   output$averagePlot <- renderPlot({
 92 |     # Render a barplot
 93 |     avg_draws <- round(sum_draws() / input$draws, 2)
 94 |     barplot(table(avg_draws), 
 95 |             space = 0, las = 1,
 96 |             xlab = 'Average',
 97 |             ylab = '',
 98 |             main = "Average")
 99 |   })
100 |   
101 |   # Plot with confidence intervals
102 |   output$intervalPlot <- renderPlot({
103 |     avg_box <- mean(tickets())
104 |     n <- length(tickets())
105 |     sd_box <- sqrt((n-1)/n) * sd(tickets())
106 |     se_sum <- sqrt(input$draws) * sd_box
107 |     se_perc <- se_sum / input$draws
108 |     
109 |     # Render plot
110 |     samples <- sum_draws() / input$draws
111 |     
112 |     #a <- samples - se_perc
113 |     #b <- samples + se_perc
114 |     
115 |     a <- samples - ci_factor(input$confidence) * se_perc
116 |     b <- samples + ci_factor(input$confidence) * se_perc
117 |     covers <- (a <= avg_box & avg_box <= b)
118 |     ci_cols <- rep('#ff000088', input$reps)
119 |     ci_cols[covers] <- '#0000ff88'
120 |     
121 |     #xlim <- c(min(samples) - ci_factor(input$confidence) * se_perc, 
122 |     #          max(samples) + ci_factor(input$confidence) * se_perc)
123 |     xlim <- c(min(samples) - 3 * se_perc, 
124 |               max(samples) + 3 * se_perc)
125 |     plot(samples, 1:length(samples), axes = FALSE,
126 |          col = '#444444', pch = 21, cex = 0.5,
127 |          xlim = xlim, 
128 |          ylab = 'Number of samples',
129 |          xlab = "Confidence Intervals",
130 |          main = "Average")
131 |     axis(side = 1)
132 |     axis(side = 2, las = 1)
133 |     if (input$param) {
134 |       # display line for parameter
135 |       abline(v = avg_box, col = '#0000FFdd', lwd = 2.5)
136 |     }
137 |     segments(x0 = a,
138 |              x1 = b,
139 |              y0 = 1:length(samples),
140 |              y1 = 1:length(samples),
141 |              col = ci_cols)
142 |   })
143 |   
144 | }
145 | 
146 | 
147 | # Run the application 
148 | shinyApp(ui = ui, server = server)
149 | 
150 | 


--------------------------------------------------------------------------------
/apps/ch23-accuracy-averages/helpers.R:
--------------------------------------------------------------------------------
 1 | # function to compute SE factor 
 2 | # for a confidence level
 3 | ci_factor <- function(level = 95) {
 4 |   area <- level + ((100 - level) / 2)
 5 |   qnorm(area/100)
 6 | }
 7 | 
 8 | # tests
 9 | 
10 | # 90% confidence level 
11 | # ci_factor(90)
12 | 
13 | # 95% confidence level 
14 | # ci_factor(95)
15 | 
16 | # 99% confidence level 
17 | # ci_factor(99)
18 | 


--------------------------------------------------------------------------------
/data/stock-earnings-prices.csv:
--------------------------------------------------------------------------------
 1 | "industry","earnings","price"
 2 | "auto",3.3,2.9
 3 | "banks",8.6,6.5
 4 | "chemicals",6.6,3.1
 5 | "computers",10.2,5.3
 6 | "drugs",11.3,10.0
 7 | "electrical equipment",8.5,8.2
 8 | "food",7.6,6.5
 9 | "household products",9.7,10.1
10 | "machinery",5.1,4.7
11 | "oil domestic",7.4,7.3
12 | "oil international",7.7,7.7
13 | "oil equipment",10.1,10.8
14 | "railroad",6.6,6.6
15 | "retail food",6.9,6.9
16 | "department stores",10.1,9.5
17 | "soft drinks",12.7,12.0
18 | "steel",-1.0,-1.6
19 | "tobacco",12.3,11.7
20 | "utilities electric",2.8,1.4
21 | "utilities gas",5.2,6.2
22 | 


--------------------------------------------------------------------------------
/data/vegetables-smoking.csv:
--------------------------------------------------------------------------------
 1 | state,vegetables,smoking
 2 | Alabama,20.1,18.8
 3 | Alaska,24.8,18.8
 4 | Arizona,23.7,13.7
 5 | Arkansas,21,18.1
 6 | California,28.9,9.8
 7 | Colorado,24.5,13.5
 8 | Connecticut,27.4,12.4
 9 | Delaware,21.3,15.5
10 | Florida,26.2,15.2
11 | Georgia,23.2,16.4
12 | Hawaii,24.5,12.1
13 | Idaho,23.2,13.3
14 | Illinois,24,14.2
15 | Indiana,22,20.8
16 | Iowa,19.5,16.1
17 | Kansas,19.9,13.6
18 | Kentucky,16.8,23.5
19 | Louisiana,20.2,16.4
20 | Maine,28.7,15.9
21 | Maryland,28.7,13.4
22 | Massachusetts,28.6,13.5
23 | Michigan,22.8,16.7
24 | Minnesota,24.5,14.9
25 | Mississippi,16.5,18.6
26 | Missouri,22.6,18.5
27 | Montana,24.7,14.5
28 | Nebraska,20.2,16.1
29 | Nevada,22.5,16.6
30 | NewHampshire,29.1,15.4
31 | NewJersey,25.9,12.8
32 | NewMexico,21.5,14.6
33 | NewYork,26,14.6
34 | NorthCarolina,22.5,17.1
35 | NorthDakota,21.8,15
36 | Ohio,22.6,17.6
37 | Oklahoma,15.7,19
38 | Oregon,25.9,13.4
39 | Pennsylvania,23.9,17.9
40 | RhodeIsland,26.8,15.3
41 | SouthCarolina,21.2,17
42 | SouthDakota,20.5,13.8
43 | Tennessee,26.5,20.4
44 | Texas,22.6,13.2
45 | Utah,22.1,8.5
46 | Vermont,30.8,14.4
47 | Virginia,26.2,15.3
48 | Washington,25.2,12.5
49 | WestVirginia,20,21.3
50 | Wisconsin,22.2,15.9
51 | Wyoming,21.8,16.3
52 | 


--------------------------------------------------------------------------------
/hw/README.md:
--------------------------------------------------------------------------------
  1 | ## Homework Assignments
  2 | 
  3 | - HW assignments are due on Thursdays (before midnight).
  4 | - Further instructions will be posted on bCourses (see "Assignments" section).
  5 | - Submit your homework electronically via bCourses as a word, text, pdf, or html file. 
  6 | - Please do NOT submit any other file format (e.g. `.pages`, `.Rmd`, `.R`) since it won't be rendered on bCourses.
  7 | - Please become familiar with the HW policy described in the syllabus.
  8 | 
  9 | 
 10 | Tentative Calendar, Spring 2017
 11 | 
 12 | 
 13 | <hr>
 14 | 
 15 | <table>
 16 |   <thead>
 17 |     <tr>
 18 |       <th align="left">HW</th>
 19 |       <th align="left">Due</th>
 20 |       <th align="left">Topic</th>
 21 |     </tr>
 22 |   </thead>
 23 |   <tbody>
 24 |     <tr>
 25 |       <td>1</td>
 26 |       <td>Jan 26</td>
 27 |       <td>Ch-3: A3,7, C2, R4,8, extra questions</td>
 28 |     </tr>
 29 |     <tr>
 30 |       <td>2</td>
 31 |       <td>Feb 02</td>
 32 |       <td>Ch-4: B5, D8, E4, R6,9, extra questions</td>
 33 |     </tr>
 34 |     <tr>
 35 |       <td>3</td>
 36 |       <td>Feb 09</td>
 37 |       <td>
 38 |         Ch-5: C1, D1, E3, Rev7,10<br>
 39 |         Ch-8: B6,8,9, R9, extra questions</td>
 40 |     </tr>
 41 |     <tr>
 42 |       <td>4</td>
 43 |       <td>Feb 16</td>
 44 |       <td>
 45 |         Ch-9: A10, B2, E3, R4,8<br>
 46 |         Ch-10: A2,4, C4, extra questions</td>
 47 |     </tr>
 48 |     <tr>
 49 |       <td>5</td>
 50 |       <td>Feb 23</td>
 51 |       <td>Ch-10: C2, R3,4<br>
 52 |         Ch-11: B1,2, D2, E1, R4,7<br>
 53 |         Ch-12: R3,5</td>
 54 |     </tr>
 55 |     <tr>
 56 |       <td>6</td>
 57 |       <td>Mar 02</td>
 58 |       <td>Ch-13: R2,4,7,8,9
 59 |         </td>
 60 |     </tr>
 61 |     <tr>
 62 |       <td>7</td>
 63 |       <td>Mar 09</td>
 64 |       <td>
 65 |         Ch-14: R1,3,5,6,9<br>
 66 |         and Binomial Probability</td>
 67 |     </tr>
 68 |     <tr>
 69 |       <td>8</td>
 70 |       <td>Mar 16</td>
 71 |       <td>Ch-16: B2,6, R1,4,9<br>
 72 |         Ch-17: A1, B2, C1, E1, R2</td>
 73 |     </tr>
 74 |     <tr>
 75 |       <td>9</td>
 76 |       <td>Mar 23</td>
 77 |       <td>Ch-18: B3, B5, C5, R2<br>
 78 |         Ch-19: R5,7<br>
 79 |         Ch-20: A4, B3, Rev3,4,6</td>
 80 |     </tr>
 81 |     <tr>
 82 |       <td>10</td>
 83 |       <td>Apr 13</td>
 84 |       <td>Ch-21: A7,8, B4, C6,7, R2,7<br>
 85 |         Ch-23: A2,5, C2, R4,10,12</td>
 86 |     </tr>
 87 |     <tr>
 88 |       <td>11</td>
 89 |       <td>Apr 20</td>
 90 |       <td>Ch-26: B5, C1, F1, F7, R2,5,7,8,9</td>
 91 |     </tr>
 92 |     <tr>
 93 |       <td>12</td>
 94 |       <td>Apr 27</td>
 95 |       <td>Ch-26: F4, R1; Ch-27: R1, R10<br>
 96 |       Ch-29: R4,7,9,11</td>
 97 |     </tr>
 98 |   </tbody>
 99 |  </table>
100 | 


--------------------------------------------------------------------------------
/hw/hw01-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw01-questions.pdf


--------------------------------------------------------------------------------
/hw/hw02-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw02-questions.pdf


--------------------------------------------------------------------------------
/hw/hw03-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw03-questions.pdf


--------------------------------------------------------------------------------
/hw/hw04-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw04-questions.pdf


--------------------------------------------------------------------------------
/hw/hw05-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw05-questions.pdf


--------------------------------------------------------------------------------
/hw/hw06-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw06-questions.pdf


--------------------------------------------------------------------------------
/hw/hw07-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw07-questions.pdf


--------------------------------------------------------------------------------
/hw/hw08-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw08-questions.pdf


--------------------------------------------------------------------------------
/hw/hw09-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw09-questions.pdf


--------------------------------------------------------------------------------
/hw/hw10-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw10-questions.pdf


--------------------------------------------------------------------------------
/hw/hw11-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw11-questions.pdf


--------------------------------------------------------------------------------
/hw/hw12-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw12-questions.pdf


--------------------------------------------------------------------------------
/labs/README.md:
--------------------------------------------------------------------------------
  1 | ## Lab Discussions
  2 | 
  3 | Tentative Calendar, Spring 2017
  4 | 
  5 | <hr>
  6 | 
  7 | <table>
  8 |   <thead>
  9 |     <tr>
 10 |       <th align="left">Week</th>
 11 |       <th align="left">Lab</th>
 12 |       <th align="left">Topic</th>
 13 |     </tr>
 14 |   </thead>
 15 |   <tbody>
 16 |     <tr>
 17 |       <td>1</td>
 18 |       <td>
 19 |         Jan 23-24<br>
 20 |         Jan 25-26</td>
 21 |       <td>
 22 |         Ch-3: A4,5,6, B2, C4<br>
 23 |         Ch-4: A4,5,6,9, B3,4
 24 |       </td>
 25 |     </tr>
 26 |     <tr>
 27 |       <td>2</td>
 28 |       <td>
 29 |         Jan 03-31<br>
 30 |         Feb 01-02</td>
 31 |       <td>
 32 |         Ch-4: C4,5, D1,2,6, E1,5,6,7<br>
 33 |         Ch-5: A1,2, B1,2,5, F1, R11
 34 |       </td>
 35 |     </tr>
 36 |     <tr>
 37 |       <td>3</td>
 38 |       <td>
 39 |         Feb 06-07<br>
 40 |         Feb 08-09</td>
 41 |       <td>
 42 |         Ch-8: A5,6, B4,7, D2,3<br>
 43 |         Ch-9: A2-5,9, B1, C4, D1, E4
 44 |       </td>
 45 |     </tr>
 46 |     <tr>
 47 |       <td>4</td>
 48 |       <td>
 49 |         Feb 13-14<br>
 50 |         Feb 15-16</td>
 51 |       <td>
 52 |         Ch-10: A1, B1,4, C1,5, D1, E1,2<br>
 53 |         Ch-11: A4,6,8
 54 |       </td>
 55 |     </tr>
 56 |     <tr>
 57 |       <td>5</td>
 58 |       <td>
 59 |         Feb 21<br>
 60 |         Feb 22-23<br>
 61 |         Feb 24</td>
 62 |       <td>
 63 |         Ch-11: B1,2,3, D4,5,6,7, E2,3<br>
 64 |         Ch-12: A2, B3,4,5<br>
 65 |         <b>Test 1</b>
 66 |       </td>
 67 |     </tr>
 68 |     <tr>
 69 |       <td>6</td>
 70 |       <td>
 71 |         Feb 28<br>
 72 |         Mar 02</td>
 73 |       <td>
 74 |         Ch-13: A1, B1,2, C1, D2,4,5,6<br>
 75 |         Ch-14: B1,3, C3, D4</td>
 76 |     </tr>
 77 |     <tr>
 78 |       <td>7</td>
 79 |       <td>
 80 |         Mar 07<br>
 81 |         Mar 09</td>
 82 |       <td>
 83 |         Ch-15: A3-6, R1,5,6<br>
 84 |         Ch-16: A3,4, B1, 3, C1, R6</td>
 85 |     </tr>
 86 |     <tr>
 87 |       <td>8</td>
 88 |       <td>
 89 |         Mar 14<br>
 90 |         Mar 16</td>
 91 |       <td>
 92 |         Ch-17: B1,3,5,6<br>
 93 |         Ch-17: D4, R6,11</td>
 94 |     </tr>
 95 |     <tr>
 96 |       <td>9</td>
 97 |       <td>
 98 |         Mar 21<br>
 99 |         Mar 23</td>
100 |       <td>
101 |         Ch-18: B1; C2,7,8, R3,12,13,14<br>
102 |         Ch-19: A5,6,8,12, R6,12
103 |     </tr>
104 |     <tr>
105 |       <td>10</td>
106 |       <td>
107 |         Mar 28<br>
108 |         Mar 30</td>
109 |       <td>
110 |         <em>Spring Break</em><br>
111 |         <em>Spring Break</em>
112 |       </td>
113 |     </tr>
114 |     <tr>
115 |       <td>11</td>
116 |       <td>
117 |         Apr 04<br>
118 |         Apr 06<br>
119 |         Apr 07</td>
120 |       <td>
121 |         Ch-20: A1,2, B1,2<br>
122 |         Ch-20: C1,3,5<br>
123 |         <b>Test 2</b></td>
124 |     </tr>
125 |     <tr>
126 |       <td>12</td>
127 |       <td>
128 |         Apr 11<br>
129 |         Apr 13</td>
130 |       <td>
131 |         Ch-21: A4,5,6, B3, C4,5, D2, E1,2<br>
132 |         Ch-23: A1, B1, C3, D1,2,3,4, R3,8</td>
133 |     </tr>
134 |     <tr>
135 |       <td>13</td>
136 |       <td>
137 |         Apr 18<br>
138 |         Apr 20</td>
139 |       <td>
140 |         Ch-26: A4,5, B2, C4,5, D3<br>
141 |         Extra exercises
142 |         </td>
143 |     </tr>
144 |     <tr>
145 |       <td>14</td>
146 |       <td>
147 |         Apr 25<br>
148 |         Apr 27</td>
149 |       <td>
150 |         Ch-27: A4,5, B2,3,5, D5,6<br>
151 |         Ch-29: A2, B2,3,4,5,6,7, D2, E2</td>
152 |     </tr>
153 |   </tbody>
154 |  </table>
155 | 
156 | 


--------------------------------------------------------------------------------
/lectures/README.md:
--------------------------------------------------------------------------------
  1 | ## Lectures
  2 | 
  3 | Tentative Calendar, Spring 2017.
  4 | 
  5 | Material based on __Statistics__ (4th edition) by Freedman, Pisani and Purves. 
  6 | 
  7 | 
  8 | | Week | Date   | Monday                      | Wednesday               | Friday                |
  9 | |------|--------|-----------------------------|-------------------------|-----------------------|
 10 | |  0   | Jan-16	|                             | Data and variables	    | Intro to R & RStudio  |
 11 | |  1   | Jan-23	| Ch 3: Histograms	          | Ch 4: Average           | Ch 4: Spread          |
 12 | |  2   | Jan-30	| Ch 5: Normal curve          | Ch 5: Normal Curve      | Ch 8: Correlation     |
 13 | |  3   | Feb-06	| Ch 9: More Correlation      | History of Regression   | Ch 10: Regression     |
 14 | |  4   | Feb-13	| Ch 11: RMS Error            | Ch 12: Regression line  | Regression in R	    |
 15 | |  5   | Feb-20	| _Holiday_                   | Review                  | __MIDTERM 1__         |
 16 | |  6   | Feb-27	| Ch 13: Probability	      | Ch 14: More Probability | Ch 15: Binomial prob. |
 17 | |  7   | Mar-06	| Ch 16: Law of Averages      |	Ch 16: Box Models       | Ch 17: Expected Value |
 18 | |  8   | Mar-13	| Ch 17: Standard Error	      | Ch 18: Normal Approx    | Ch 18: Normal Approx  |
 19 | |  9   | Mar-20	| Ch 19: Sampling             | Ch 19: Sampling         | Ch 20: Chance Errors  |
 20 | | 10   | Mar-27	| _Spring Break_              | _Spring Break_          | _Spring Break_        |
 21 | | 11   | Apr-03	| Review	                  | Review	                | __MIDTERM 2__         |
 22 | | 12   | Apr-10	| Ch 21: Accuracy Percentages | Ch 21: Conf. Intervals  | Ch 23: Accuracy Averages|
 23 | | 13   | Apr-17	| Ch 26: Significance Tests   | Ch 26: z-test           | Ch 26: t-test         |
 24 | | 14   | Apr-24	| Ch 27: Two-sample z-test    | Ch 27: Two-sample z-test| Ch 29: More about tests |
 25 | | 15   | May-01	| _RRR_                       | _RRR_                   | _RRR_                 |
 26 | 
 27 | 
 28 | - May-09: __Final Stat 131A__, 11:30-2:30pm in Birge 50
 29 | - May-10: __Final Stat 20__, 3:00-6:00pm in VLSB 2050
 30 | 
 31 | -----
 32 | 
 33 | ## Slides and Scripts
 34 | 
 35 | - Jan 18-20: 
 36 | 	+ [Data and Variables](https://docs.google.com/presentation/d/1k0Ti3489qKExV-X9VzgOq0rCRk0EcjsEB800TDyvfG0/edit?usp=sharing)
 37 | 	+ In-class: [Getting started with R and RStudio](../scripts/01-R-introduction.pdf)
 38 | 	+ [Intro to R and RStudio](https://docs.google.com/presentation/d/1jtPoAMnT2-56REz-pFZQWSSSzFVHXOI069vrQCA0r6k/edit?usp=sharing) auxiliary slides
 39 | 	+ Practice: script about [data and variables](../scripts/02-data-variables.pdf)
 40 | - Jan 23-27:
 41 | 	+ In-class: [Histograms](https://docs.google.com/presentation/d/1D_QNv8HPBRQGqy3ofiJDuLgOpB-awMwwpMchX9n0My4/edit?usp=sharing) slides
 42 | 	+ App: [ch03-histograms](../apps/03-histograms)
 43 | 	+ Practice: script about [Histograms in R](../scripts/03-histograms.pdf)
 44 | 	+ In-class: [Measures of Center (Average and Median)](https://docs.google.com/presentation/d/15jjBpSkQmYs99S8A2yvGGR4lwusUcJgBXZYU88158pE/edit?usp=sharing)
 45 | 	+ Practice: script about [Average and median in R](../scripts/04-measures-center.pdf)
 46 | 	+ In-class: [Measures of Spread (RMS, Standard Deviation)](https://docs.google.com/presentation/d/1olNOkShLZTBwEywn1AsuX92PvimntXoKMn7eRDh5MRE/edit?usp=sharing)
 47 | 	+ Practice: script about [measures of spread in R](../scripts/05-measures-spread.pdf)
 48 | - JanFeb 30-03:
 49 | 	+ In-class: [Normal Curve](https://docs.google.com/presentation/d/1_6ZEhuTCDvxesw6H99nJxnJz7shMIU9Hzq4GzWzw0dE/edit?usp=sharing) slides
 50 | 	+ Practice: script about [normal curve in R](../scripts/06-normal-curve.pdf)
 51 | 	+ In-class: [Scatter Diagrams and Correlation](https://docs.google.com/presentation/d/1qLtoiX8CrpHL70lZ8LBQN0F-xHuwEnhpVNZalaBnSM8/edit?usp=sharing) slides
 52 | 	+ App: [ch08-corr-coeff-diagrams](../apps/ch08-corr-coeff-diagrams)
 53 | 	+ Practice: script about [scatter diagrams in R](../scripts/07-scatter-diagrams.pdf)
 54 | - Feb 06-10:
 55 | 	+ In-class: [More about Correlation](https://docs.google.com/presentation/d/1TNmvkcGnhIpZ3N-XLEJwuOcG9tDd6KbdIDzU4K6wivE/edit?usp=sharing) slides
 56 | 	+ In-class: [A bit of history about origins of regression](https://docs.google.com/presentation/d/1VBdCiJn_QmfeTsCzP29RlL4ldjripPdrSXkUSYfq0Rc/edit?usp=sharing) auxiliary slides
 57 | 	+ In-class: [Intro to Regression Method](https://docs.google.com/presentation/d/10eQJ3DxVVuC00mQ5aEBNb0nWZh8oX-vJ5mCJRQH39VA/edit?usp=sharing) slides
 58 | 	+ App: [ch10-heights-data](../apps/ch10-heights-data)
 59 | 	+ Practice: script about [Regression Line with R](../scripts/09-regression-line.pdf)
 60 | - Feb 13-17
 61 | 	+ In-class: [R.M.S. Error for Regression](https://docs.google.com/presentation/d/1KSws7X-9jr1YWtJwPUmdnooodMqBMzRLjDWhsgq04Iw/edit?usp=sharing) slides
 62 | 	+ App: [ch11-regression-residuals](../apps/ch10-heights-data)
 63 | 	+ Practice: script about [Predictions and Errors in Regression with R](../scripts/10-prediction-and-errors-in-regression.pdf)
 64 | 	+ In-class: [Regression Line](https://docs.google.com/presentation/d/1bEV8MWCZ6xE2zm5egZXq5wcXOGOnHDJiJvj2tTGMhyI/edit?usp=sharing) slides
 65 | 	+ App: [ch11-regression-strips](../apps/ch11-regression-strips)
 66 | - Feb 20-24
 67 | 	+ Regression Line
 68 | 	+ __Midterm 1__ Friday Feb-24
 69 | - FebMar 27-03
 70 | 	+ In-class: [Probability Rules (part 1)](https://docs.google.com/presentation/d/1cgU096Vr5Ep30rXoQ68940YbbCM7wvpznsC623Zx5N0/edit?usp=sharing)
 71 | 	+ In-class: [Probability Rules (part 2)](https://docs.google.com/presentation/d/1C-bEAHd3naLPxk_WDSrMuWHd9kMdVVo7vh2x9lWaFvc/edit?usp=sharing)
 72 | 	+ In-class: [Binomial Formula](https://docs.google.com/presentation/d/1M6Xk1xwAmdewO1K5lVIAOXz45LcIfvrZOgzQs9EXc1c/edit?usp=sharing)
 73 | 	+ Practice: script about [binomial probability in R](../scripts/11-binomial-formula.pdf)
 74 | - Mar 06-10
 75 | 	+ In-class: [Law of Averages](https://docs.google.com/presentation/d/1WDS0RyPXBjo0kgYSC5AIR33Vr78lKbOURXqJ2TMXvtI/edit?usp=sharing)
 76 | 	+ Practice: script about simulating basic [chance process with R](../scripts/12-chance-processes.pdf)
 77 | 	+ App: [ch16-chance-errors](../apps/ch16-chance-errors)
 78 | 	+ In-class: [Expected Value and Standard Error](https://docs.google.com/presentation/d/1QCSwf7zN80253dLYUAkZ3C4h01M6rLTFJ33h1tFD9To/edit?usp=sharing)
 79 | 	+ App: [ch17-demere-games](../apps/ch17-demere-games)
 80 | 	+ App: [ch17-expected-value-std-error](../apps/ch17-expected-value-std-error)
 81 | - Mar 13-17
 82 | 	+ In-class: [Probability Histograms and Normal Approximation](https://docs.google.com/presentation/d/1AZ61AYdl1mmT3Uy1XebT8qpTbbR7uqiP0y_n740Vp8E/edit?usp=sharing)
 83 | 	+ App: [ch18-roll-dice-sum](../apps/ch18-roll-dice-sum)
 84 | 	+ App: [ch18-roll-dice-product](../apps/ch18-roll-dice-product)
 85 | 	+ App: [ch18-coin-tossing](../apps/ch18-coin-tossing)
 86 | - Mar 20-24
 87 | 	+ In-class: [Sample Surveys](https://docs.google.com/presentation/d/1n-zZKPrpCoNqhf1hnDlUNVx-XL_qxdZgwngeKWMULiM/edit?usp=sharing)
 88 | 	+ In-class: [Sample Designs](https://docs.google.com/presentation/d/1KWmjAxrSNM7hRjWPh9veTLl8_FLneJozhA-6OUSLqK8/edit?usp=sharing)
 89 | 	+ In-class: [Chance Errors in Sampling](https://docs.google.com/presentation/d/1jRFpoepvu7RWwl6fsxPD7wFkdhZk83dlLwzb9SNMXSE/edit?usp=sharing)
 90 | 	+ App: [ch20-sampling-men](../apps/ch20-sampling-men)
 91 | - Apr 03-07
 92 | 	+ Review
 93 | 	+ __Midterm 2__ Friday Apr-07
 94 | - Apr 10-14
 95 | 	+ In-class: [Accuracy of Percentages](https://docs.google.com/presentation/d/1Ia5dA9BuEHUTX0dxLRJ9RervShAHmtqk8Si8hXPak-0/edit?usp=sharing)
 96 | 	+ App: [ch21-accuracy-percentages](../apps/ch21-accuracy-percentages)
 97 | 	+ In-class: [Accuracy of Averages](https://docs.google.com/presentation/d/1FnUXMu_5qYST5Stou895O_vUjdAULxeyxhvrGImAVEA/edit?usp=sharing)
 98 | 	+ App: [ch23-accuracy-averages](../apps/ch23-accuracy-averages)
 99 | - Apr 17-21
100 | 	+ In-class: [Hypothesis Tests](https://docs.google.com/presentation/d/1FQN-qh-plq87aB1d2vOUoi3YVLl6LE28uYUhXS5RFcI/edit?usp=sharing)
101 | 	+ In-class: [One sample z-test](https://docs.google.com/presentation/d/1HhVMfQ0n8iebx527qscSFtj3wHQAqFk2xSSEnitu91g/edit?usp=sharing)
102 | 	+ In-class: [One sample t-test](https://docs.google.com/presentation/d/1GTWOiwk4Gkeh_nXnKKK47hCcE1sWMGT-s9Q313VTUFM/edit?usp=sharing)
103 | - Apr 24-28
104 | 	+ In-class: [Two sample z-test](https://docs.google.com/presentation/d/19PpdMovtJSbydDAc1Mv1wh3Mu5YWT0dFCh5aPcdE6dU/edit?usp=sharing)
105 | - May 01-05
106 | 	+ RRR Week
107 | - May 07-12
108 | 	+ __Final: Stat 131A__ Tue May-09
109 | 	+ __Final: Stat 20__ Wed May-10
110 | 


--------------------------------------------------------------------------------
/other/Karl-Pearson-and-the-origins-of-modern-statistics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/Karl-Pearson-and-the-origins-of-modern-statistics.pdf


--------------------------------------------------------------------------------
/other/Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf


--------------------------------------------------------------------------------
/other/The-strange-science-of-Francis-Galton.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/The-strange-science-of-Francis-Galton.pdf


--------------------------------------------------------------------------------
/other/formula-sheet-final.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-final.pdf


--------------------------------------------------------------------------------
/other/formula-sheet-midterm1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-midterm1.pdf


--------------------------------------------------------------------------------
/other/formula-sheet-midterm2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-midterm2.pdf


--------------------------------------------------------------------------------
/other/standard-normal-table.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/standard-normal-table.pdf


--------------------------------------------------------------------------------
/other/t-table.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/t-table.pdf


--------------------------------------------------------------------------------
/other/z-table.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/z-table.pdf


--------------------------------------------------------------------------------
/scripts/01-R-introduction.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Getting started with R"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | header-includes: \usepackage{float}
  6 | output: html_document
  7 | urlcolor: blue
  8 | ---
  9 | 
 10 | > ### Learning Objectives
 11 | >
 12 | > - Complete installation of R and RStudio
 13 | > - Get started with R as a scientific calculator
 14 | > - First steps using RStudio
 15 | > - Getting help in R
 16 | > - Installing packages
 17 | > - Using R script files
 18 | > - Using Rmd files
 19 | > - Get to know markdown syntax
 20 | 
 21 | ------
 22 | 
 23 | ## R and RStudio
 24 | 
 25 | - Install __R__
 26 | 	- R for Mac: [https://cran.r-project.org/bin/macosx/](https://cran.r-project.org/bin/macosx/)
 27 | 	- R for windows: [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/)
 28 | - Install __RStudio__
 29 | 	- RStudio download (desktop version): [https://www.rstudio.com/products/rstudio/download/](https://www.rstudio.com/products/rstudio/download/)
 30 | 
 31 | 
 32 | ### Difference between R-GUI and RStudio
 33 | 
 34 | The default installation of R comes with R-GUI which is a simple graphical 
 35 | user interface. In contrast, RStudio is an _Integrated Development Environment_ 
 36 | (IDE). This means that RStudio is much more than a simple GUI, providing a nice 
 37 | working environment and development framework. In this course, you will use R
 38 | mainly for doing computations and plots, not really for programming purposes. 
 39 | And you are going to interact with R via RStudio, using the so-called __Rmd__ 
 40 | files.
 41 | 
 42 | -----
 43 | 
 44 | ## R as a scientific calculator
 45 | 
 46 | Open RStudio and locate the _console_ (or prompt) pane.
 47 | Let's start typing basic things in the console, using R as a scientific calculator:
 48 | 
 49 | ```r
 50 | # addition
 51 | 1 + 1
 52 | 2 + 3
 53 | 
 54 | # subtraction
 55 | 4 - 2
 56 | 5 - 7
 57 | 
 58 | # multiplication
 59 | 10 * 0
 60 | 7 * 7
 61 | 
 62 | # division
 63 | 9 / 3
 64 | 1 / 2
 65 | 
 66 | # power
 67 | 2 ^ 2
 68 | 3 ^ 3
 69 | ```
 70 | 
 71 | 
 72 | ### Functions
 73 | 
 74 | R has many functions. To use a function, type its name followed by parenthesis.
 75 | Inside the parenthesis you pass an input. Most functions will produce some
 76 | type of output:
 77 | 
 78 | ```r
 79 | # absolute value
 80 | abs(10)
 81 | abs(-4)
 82 | 
 83 | # square root
 84 | sqrt(9)
 85 | 
 86 | # natural logarithm
 87 | log(2)
 88 | ```
 89 | 
 90 | 
 91 | ### Comments in R
 92 | 
 93 | All programming languages use a set of characters to indicate that a
 94 | specific part or lines of code are __comments__, that is, things that are
 95 | not to be executed. R uses the hash or pound symbol `#` to specify comments.
 96 | Any code to the right of `#` will not be executed by R.
 97 | 
 98 | ```r
 99 | # this is a comment
100 | # this is another comment
101 | 2 * 9
102 | 
103 | 4 + 5  # you can place comments like this
104 | ```
105 | 
106 | 
107 | ### Variables and Assignment
108 | 
109 | R is more powerful than a calculator, and you can do many more things than
110 | practically most scientific calculators. One of the things you will be
111 | doing a lot in R is creating variables or objects to store values.
112 | 
113 | For instance, you can create a variable `x` and give it the value of 1.
114 | This is done using what is known as the __assignment operator__ `<-`,
115 | also known in R as the _arrow_ operator:
116 | 
117 | ```r
118 | x <- 1
119 | x
120 | ```
121 | 
122 | This is a way to tell R: "create an object `x` and store in it the number 1".
123 | Alternatively, you can use the equals sign `=` as an assignment operator:
124 | 
125 | ```r
126 | y = 2
127 | y
128 | ```
129 | 
130 | With variables, you can operate the way you do algebraic operations (addition, subtraction, multiplication, division, power, etc):
131 | 
132 | ```r
133 | x + y
134 | x - y
135 | x * y
136 | x / y
137 | x ^ y
138 | ```
139 | 
140 | 
141 | ### Case Sensitive
142 | 
143 | R is case sensitive. This means that `abs()` is not the same
144 | as `Abs()` or `ABS()`. Only the function `abs()` is the valid one.
145 | 
146 | ```r
147 | # case sensitive
148 | x = 1
149 | X = 2
150 | x + x
151 | x + X
152 | X + X
153 | ```
154 | 
155 | 
156 | ### Some Examples
157 | 
158 | Here are some examples that illustrate how to use R to define
159 | variables and perform basic calculations:
160 | 
161 | ```r
162 | # convert Fahrenheit degrees to Celsius degrees
163 | fahrenheit = 50
164 | celsius = (fahrenheit - 32) * (5/9)
165 | celsius
166 | 
167 | 
168 | # compute the area of a rectangle
169 | rec_length = 10
170 | rec_height = 5
171 | rec_area = rec_length * rec_height
172 | rec_area
173 | 
174 | 
175 | # degrees to radians
176 | deg = 90
177 | rad = (deg * pi) / 180
178 | rad
179 | ```
180 | 
181 | -----
182 | 
183 | ## More about RStudio
184 | 
185 | You will be working with RStudio a lot, and you will have time to learn
186 | many of the bells and whistles RStudio provides. Think about RStudio as
187 | your "workbench". Keep in mind that RStudio is NOT R. RStudio is an environment
188 | that makes it easier to work with R, while taking care of the little tasks that 
189 | can be a hassle.
190 | 
191 | 
192 | ### A quick tour of RStudio
193 | 
194 | - Understand the __pane layout__ (i.e. windows) of RStudio
195 | 	- Source
196 | 	- Console
197 | 	- Environment, History, etc
198 | 	- Files, Plots, Packages, Help, Viewer
199 | - Customize RStudio Appearance of source pane
200 |   - font
201 |   - size
202 |   - background
203 | 
204 | 
205 | ### Using an R script file
206 | 
207 | Most of the time you won't be working directly on the console.
208 | Instead, you will be typing your commands in some _source_ file.
209 | The basic type of source files are known as _R script files_.
210 | Open a new script file in the _source_ pane, and rewrite the
211 | previous commands.
212 | 
213 | You can copy the commands in your source file and paste them in the
214 | console. But that's not very efficient. Find out how to run (execute)
215 | the commands (in your source file) and pass them to the console pane.
216 | 
217 | 
218 | ### Getting help
219 | 
220 | Because we work with functions all the time, it's important to know certain
221 | details about how to use them, what input(s) is required, and what is the
222 | returned output.
223 | 
224 | There are several ways to get help.
225 | 
226 | If you know the name of a function you are interested in knowing more,
227 | you can use the function `help()` and pass it the name of the function you
228 | are looking for:
229 | 
230 | ```r
231 | # documentation about the 'abs' function
232 | help(abs)
233 | 
234 | # documentation about the 'mean' function
235 | help(mean)
236 | ```
237 | 
238 | Alternatively, you can use a shortcut using the question mark `?` followed
239 | by the name of the function:
240 | 
241 | ```r
242 | # documentation about the 'abs' function
243 | ?abs
244 | 
245 | # documentation about the 'mean' function
246 | ?mean
247 | ```
248 | 
249 | - How to read the manual documentation:
250 | 	- Title
251 | 	- Description
252 | 	- Usage of function
253 | 	- Arguments
254 | 	- Details
255 | 	- See Also
256 | 	- Examples!!!
257 | 
258 | `help()` only works if you know the name of the function your are looking for.
259 | Sometimes, however, you don't know the name but you may know some keywords.
260 | To look for related functions associated to a keyword, use `help.search()` or
261 | simply `??`
262 | 
263 | ```r
264 | # search for 'absolute'
265 | help.search("absolute")
266 | 
267 | # alternatively you can also search like this:
268 | ??absolute
269 | ```
270 | 
271 | Notice the use of quotes surrounding the input name inside `help.search()`
272 | 
273 | 
274 | ### Installing Packages
275 | 
276 | R comes with a large set of functions and packages. A package is a collection
277 | of functions that have been designed for a specific purpose. One of the great
278 | advantages of R is that many analysts, scientists, programmers, and users
279 | can create their own pacakages and make them available for everybody to use them.
280 | R packages can be shared in different ways. The most common way to share a
281 | package is to submit it to what is known as __CRAN__, the
282 | _Comprehensive R Archive Network_.
283 | 
284 | You can install a package using the `install.packages()` function.
285 | Just give it the name of a package, surrounded by quotes, and R will look for
286 | it in CRAN, and if it finds it, R will download it to your computer.
287 | 
288 | ```r
289 | # installing
290 | install.packages("knitr")
291 | ```
292 | 
293 | You can also install a bunch of packages at once:
294 | 
295 | ```r
296 | install.packages(c("readr", "ggplot2"))
297 | ```
298 | 
299 | The installation of a package needs to be done only once. 
300 | After a package has been installed, you can start using its functions
301 | by _loading_ the package with the function `library()`
302 | 
303 | ```r
304 | library(knitr)
305 | ```
306 | 
307 | 
308 | ### Your turn
309 | 
310 | - Install packages `"stringr"`, `"RColorBrewer"`
311 | - Calculate: $3x^2 + 4x + 8$ when $x = 2$
312 | - Look for the manual (i.e. help) documentation of the function `exp`
313 | - Find out how to look for information about binary operators
314 | like `+` or `^`
315 | - There are several tabs in the pane `Files, Plots, Packages, Help, Viewer`.
316 | Find out what does the tab __Files__ is good for?
317 | 
318 | -----
319 | 
320 | ## Introduction to Rmd files
321 | 
322 | Besides using R script files to write source code, you will be using other
323 | type of source files known as _R markdown_ files, simply called `Rmd` files. 
324 | These files use a special syntax called
325 | [markdown](https://en.wikipedia.org/wiki/Markdown).
326 | 
327 | 
328 | ### Get to know the `Rmd` files
329 | 
330 | In the menu bar of RStudio, click on __File__, then __New File__,
331 | and choose __R Markdown__. Select the default option "Document" (HTML output),
332 | and click __Ok__.
333 | 
334 | __Rmd__ files are a special type of file, referred to as a _dynamic document_,
335 | that allows you to combine narrative (text) with R code. It is extremeley 
336 | important that you quickly become familiar with this resource. One reason is
337 | that you can use Rmd files to write your homework assignments and convert them 
338 | to HTML, Word, or PDF files.
339 | 
340 | Locate the button __Knit__ (the one with a knitting icon) and click on it
341 | so you can see how `Rmd` files are rendered and displayed as HTML documents.
342 | 
343 | 
344 | ### Yet Another Syntax to Learn
345 | 
346 | R markdown (`Rmd`) files use [markdown](https://daringfireball.net/projects/markdown/)
347 | as the main syntax to write content that is not R code. Markdown is a very
348 | lightweight type of markup language, and it is relatively easy to learn.
349 | 
350 | 
351 | ### Your turn
352 | 
353 | If you are new to Markdown, please take a look at the following tutorials:
354 | 
355 | - [www.markdown-tutorial.com](http://www.markdown-tutorial.com)
356 | - [www.markdowntutorial.com](http://www.markdowntutorial.com)
357 | 
358 | -----
359 | 
360 | ### Rmd basics
361 | 
362 | - YAML header:
363 | 	- title
364 | 	- author
365 | 	- date
366 | 	- output: `html_document`, `word_document`, `pdf_document`
367 | - Code Chunks:
368 | 	- syntax
369 | 	- chunk options
370 | 	- graphics
371 | - Math notation:
372 | 	- inline `$z^2 = x^2 + y^2$`
373 | 	- paragraph `$$z^2 = x^2 + y^2$$`
374 | 
375 | Example of inline equation: $z^2 = x^2 + y^2$
376 | 
377 | Example of equation in its own paragraph:
378 | $$
379 | z^2 = x^2 + y^2
380 | $$
381 | 
382 | RStudio has a basic tutorial about R Markdown files:
383 | [Rstudio markdown tutorial](http://rmarkdown.rstudio.com/)
384 | 
385 | Rmd files are able to render math symbols and expressions written using LaTeX
386 | notation. There are dozens of online resources to learn about math notation and 
387 | equations in LaTeX. Here's some documentation from [www.sharelatex.com/learn](https://www.sharelatex.com/learn/)
388 | 
389 | - [Mathematical expressions](https://www.sharelatex.com/learn/Mathematical_expressions)
390 | - [Subscripts and superscripts](https://www.sharelatex.com/learn/Subscripts_and_superscripts)
391 | - [Brackets and Parentheses](https://www.sharelatex.com/learn/Brackets_and_Parentheses)
392 | - [Fractions and Binomials](https://www.sharelatex.com/learn/Fractions_and_Binomials)
393 | - [Integrals, sums and limits](https://www.sharelatex.com/learn/Integrals,_sums_and_limits)
394 | - [List of Greek letters and math symbols](https://www.sharelatex.com/learn/List_of_Greek_letters_and_math_symbols)
395 | - [Operators](https://www.sharelatex.com/learn/Operators)
396 | 


--------------------------------------------------------------------------------
/scripts/01-R-introduction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/01-R-introduction.pdf


--------------------------------------------------------------------------------
/scripts/02-data-variables.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Data and Variables in R"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | header-includes: \usepackage{float}
  6 | output: html_document
  7 | urlcolor: blue
  8 | ---
  9 | 
 10 | > ### Learning Objectives
 11 | >
 12 | > - Basics of vectors
 13 | > - Variables (as vectors and factors)
 14 | > - Quantitative variables as numeric vectors
 15 | > - Qualitative variables (as factors)
 16 | > - Manipulating vectors
 17 | 
 18 | 
 19 | ```{r setup, include=FALSE}
 20 | knitr::opts_chunk$set(echo = TRUE)
 21 | ```
 22 | 
 23 | ## NBA Data
 24 | 
 25 | In this Rmd script we'll consider some NBA data from the website 
 26 | _Basketball Reference_. More specifically, let's look at the Western Conference 
 27 | Standings (season 2015-2016) shown in the following screenshot:
 28 | 
 29 | ```{r out.width='60%', echo = FALSE, fig.align='center'}
 30 | knitr::include_graphics('images/western-conference-standings-2016.png')
 31 | ```
 32 | 
 33 | source: [http://www.basketball-reference.com/leagues/NBA_2016.html#all_confs_standings_E](http://www.basketball-reference.com/leagues/NBA_2016.html#all_confs_standings_E)
 34 | 
 35 | The above table contains 15 rows with 8 columns. The first column contains the
 36 | names of the teams in the Western Conference, and the rest of the columns are:
 37 | 
 38 | - _W_: wins
 39 | - _L_: losses
 40 | - _W/L%_: win-loss percentange
 41 | - _GB_: games behind (the top team)
 42 | - _PS/G_: points per game
 43 | - _PA/G_: opponent points per game
 44 | - _SRS_: simple rating system
 45 | 
 46 | From the statistical standpoint, we say that the table has 8 variables measured
 47 | (or observed) on 15 individuals. In this case the "individuals" are the basketball 
 48 | teams.
 49 | 
 50 | 
 51 | ## Basics of vectors
 52 | 
 53 | In order to use R as the computational tool in this course, you need to learn
 54 | how to input data. Before describing how to read in tables in R (we'll cover 
 55 | that later), we must talk about vectors.
 56 | 
 57 | R vectors are the most basic structure to store data in R. Virtually all other
 58 | data structures in R are based or derived from vectors. Using a vector is also
 59 | the most basic way to manually input data.
 60 | 
 61 | You can create vectors in several ways. The most common option is with the
 62 | function `c()` (combine). Simply pass a series of values separated by commas.
 63 | Here is how to create a vector `wins` with the first five values from the column
 64 | _W_ of the conference standings table:
 65 | 
 66 | ```{r}
 67 | wins = c(73, 67, 55, 53, 44)
 68 | ```
 69 | 
 70 | Likewise, we can create a vector `losses` like this:
 71 | 
 72 | ```{r}
 73 | losses = c(9, 15, 27, 29, 38)
 74 | ```
 75 | 
 76 | Having the vectors `wins` and `losses`, we can use them to create another
 77 | vector `win_loss_perc` for the column _W/L%_ (win-loss percentange):
 78 | 
 79 | ```{r}
 80 | win_loss_perc = wins / (wins + losses)
 81 | win_loss_perc
 82 | ```
 83 | 
 84 | You can think of vectors as variables. The previous vectors `wins`, `losses`,
 85 | and `win_loss_perc` are what it's known as __quantitative__ variables. This
 86 | means that each value in those variables (the numbers) reflect a quantity.
 87 | 
 88 | Not all variables are quantitative. For instance, the first column of the table
 89 | does not contain numbers but names. The name of a basketball team is referred
 90 | to as a __qualitative__ variable.
 91 | 
 92 | In R you can create a vector of names using a character vector. Again, we use 
 93 | the `c()` function and we pass it names surrounded by either single or double 
 94 | quotes. Here's how to create a vector `teams` with the names of the first five 
 95 | teams in the standings table:
 96 | 
 97 | ```{r}
 98 | teams = c('GSW', 'SAS', 'OCT', 'LAC', 'PTB')
 99 | ```
100 | 
101 | The vector `teams` is referred in R to as a __character vector__ because it 
102 | is formed by characters.
103 | 
104 | 
105 | ## Manipulating Vectors: Subsetting
106 | 
107 | In addition to creating variables, you should also learn how to do some basic
108 | manipulation of vectors. The most common type of manipulation  is called 
109 | _subsetting_ which refers to extracting elements of a vector (or another R object). 
110 | To do so, you use what is known as __bracket notation__. This implies using 
111 | (square) brackets `[ ]` to get access to the elements of a vector. Inside the
112 | brackets you can specify one or more numeric values that correspond to the
113 | position(s) of the vector element(s):
114 | 
115 | ```r
116 | # first element of 'wins'
117 | wins[1]
118 | 
119 | # third element of 'losses'
120 | losses[3]
121 | 
122 | # last element of teams
123 | teams[5]
124 | ```
125 | 
126 | Some common functions that you can use on vectors are:
127 | 
128 | - `length()` gives the number of values
129 | - `sort()` sorts the values in increasing or decreasing ways
130 | - `rev()` reverses the values
131 | 
132 | ```r
133 | length(teams)
134 | teams[length(teams)]
135 | sort(wins, decreasing = TRUE)
136 | rev(wins)
137 | ```
138 | 
139 | 
140 | 
141 | ### Subsetting with Logical Indices
142 | 
143 | In addition to using numbers inside the brackets, you can also do 
144 | _logical subsetting_. This type of subsetting involves using a __logical__ 
145 | vector inside the brackets. A logical vector is a particular type of vector 
146 | that takes the special values `TRUE` and `FALSE`, as well as `NA` 
147 | (Not Available).
148 | 
149 | This type of subsetting is very powerful because it allows you to 
150 | extract elements based on some logical condition. 
151 | Here are some examples of logical subsetting:
152 | 
153 | ```r
154 | # wins of Golden State Warriors
155 | wins[teams == 'GSW']
156 | 
157 | # teams with wins > 40
158 | teams[wins > 40]
159 | 
160 | # name of teams with losses between 10 and 29
161 | teams[losses >= 10 & losses <= 29]
162 | ```
163 | 
164 | 
165 | ## Factors and Qualitative Variables
166 | 
167 | As mentioned before, vectors are the most essential type of data structure
168 | in R. Related to vectors, there is another important data structure in R called
169 | __factor__. Factors are data structures exclusively designed to handle 
170 | qualitative or categorical data.
171 | 
172 | The term _factor_ as used in R for handling categorical variables, comes from 
173 | the terminology used in _Analysis of Variance_, commonly referred to as ANOVA. 
174 | In this statistical method, a categorical variable is commonly referred to as 
175 | _factor_ and its categories are known as _levels_.
176 | 
177 | To create a factor you use the homonym function `factor()`, which takes a 
178 | vector as input. The vector can be either numeric, character or logical. 
179 | 
180 | ```{r}
181 | # numeric vector
182 | num_vector <- c(1, 2, 3, 1, 2, 3, 2)
183 | 
184 | # creating a factor from num_vector
185 | first_factor <- factor(num_vector)
186 | 
187 | first_factor
188 | ```
189 | 
190 | You can take the `teams` vector and convert it as a factor:
191 | 
192 | ```{r}
193 | teams = factor(teams)
194 | teams
195 | ```
196 | 
197 | 
198 | 
199 | ## Sequences
200 | 
201 | It is very common to generate sequences of numbers. For that R provides: 
202 | 
203 | - the colon operator `":"`
204 | - sequence function `seq()`
205 | 
206 | ```r
207 | # colon operator
208 | 1:5
209 | 1:10
210 | -3:7
211 | 10:1
212 | ```
213 | 
214 | ```r
215 | # sequence function
216 | seq(from = 1, to = 10)
217 | seq(from = 1, to = 10, by = 1)
218 | seq(from = 1, to = 10, by = 2)
219 | seq(from = -5, to = 5, by = 1)
220 | ```
221 | 
222 | 
223 | ### Repeated Vectors
224 | 
225 | There is a function `rep()`. It takes a vector as the main input, and then it
226 | optionally takes various arguments: `times`, `length.out`, and `each`.
227 | 
228 | ```{r}
229 | rep(1, times = 5)        # repeat 1 five times
230 | rep(c(1, 2), times = 3)  # repeat 1 2 three times
231 | rep(c(1, 2), each = 2)
232 | rep(c(1, 2), length.out = 5)
233 | ```
234 | 
235 | Here are some more complex examples:
236 | 
237 | ```r
238 | rep(c(3, 2, 1), times = 3, each = 2)
239 | ```
240 | 
241 | 
242 | ## From vectors to data frames
243 | 
244 | Now that we've seen how to create some vectors and do some basic manipulation,
245 | we can describe how to combine them in a table in R. The standard tabular 
246 | structure in R is a __data frame__. To manually create a data frame you use 
247 | the function `data.frame()` and you pass it one or more vectors. Here's how
248 | to create a small data frame `dat` with the vectors `teams`, `wins`, `losses`, 
249 | and `win_loss_perc`:
250 | 
251 | ```{r}
252 | dat = data.frame(
253 |   Teams = teams,
254 |   Wins = wins,
255 |   Losses = losses,
256 |   WLperc = win_loss_perc
257 | )
258 | 
259 | dat
260 | ```
261 | 
262 | Manipulating data frames is more complex than manipulating vectors. However,
263 | manipulating the column of a data frame is essentially the same as manipulating
264 | a vector. 
265 | 
266 | There are a couple of ways to "select" a column of a data frame. One option
267 | consists of using the dollar `$` operator. This involves typing the name of 
268 | the data frame, followed by the `$`, followed by the name of the column.
269 | For instance, to extract the values in column `Teams` simply type:
270 | 
271 | ```{r}
272 | dat$Teams
273 | ```
274 | 
275 | Moreover, you can use bracket notation on the extracted column like with any
276 | type of vector:
277 | 
278 | ```{r}
279 | dat$Wins[1]
280 | dat$Wins[5]
281 | ```
282 | 
283 | Likewise, you can do logical subsetting:
284 | 
285 | ```r
286 | # wins of Golden State Warriors
287 | dat$Wins[dat$Teams == 'GSW']
288 | 
289 | # teams with wins > 40
290 | dat$Teams[dat$Wins > 40]
291 | 
292 | # name of teams with losses between 10 and 29
293 | dat$Teams[dat$Losses >= 10 & dat$Losses <= 29]
294 | ```
295 | 
296 | 
297 | ## Your Turn
298 | 
299 | Refer to the table of Western Conference Standings shown at the beginning of
300 | this document. Your mission consists of creating a data frame `standings`.
301 | In order to create such data frame, you will have to first create the following
302 | eight vectors:
303 | 
304 | - `teams`
305 | - `wins`
306 | - `losses`
307 | - `win_loss_perc`
308 | - `games_behind`
309 | - `points_scored`
310 | - `points_against`
311 | - `rating`
312 | 
313 | You can create the vector `games_behind` by taking the won games of Golden
314 | State Warriors and subtracting the wins of the rest of the teams, that is:
315 | 
316 | ```r
317 | wins[1] - wins
318 | ```
319 | 
320 | Once you have the previous listed vectors, use the function `data.frame()`
321 | to build `standings`.
322 | 
323 | Select the _Points Scored_ from the table `standings` and sort it both in 
324 | increasing as well as decreasing order.
325 | 
326 | 


--------------------------------------------------------------------------------
/scripts/02-data-variables.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/02-data-variables.pdf


--------------------------------------------------------------------------------
/scripts/03-histograms.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/03-histograms.pdf


--------------------------------------------------------------------------------
/scripts/04-measures-center.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Measures of Center"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | urlcolor: blue
  7 | ---
  8 | 
  9 | > ### Learning Objectives
 10 | >
 11 | > - Compute the average 
 12 | > - Become familiar with the function `mean()`
 13 | > - Interpret the average as the balancing point
 14 | 
 15 | ```{r setup, include=FALSE}
 16 | knitr::opts_chunk$set(echo = TRUE)
 17 | ```
 18 | 
 19 | As we mentioned in the previous script, the first part of the course has to do 
 20 | with __Descriptive Statistics__. The main idea is to make a "large" or 
 21 | "complicated" dataset more compact and easier to understand by using three 
 22 | major tools:
 23 | 
 24 | - summary and frequency tables
 25 | - charts and graphics
 26 | - key numeric summaries
 27 | 
 28 | In this script we will focus on various numeric summaries that are typically 
 29 | used to condense information of quantitative variables.
 30 | 
 31 | One common way to classify numeric summaries is in 1) measures of center, and
 32 | 2) measures of spread or variability.
 33 | The idea of both types of measures is to obtain one or more numeric values that
 34 | reflect a "central" value, and the amount of "spread".
 35 | 
 36 | - Measures of Center
 37 |     + average or mean
 38 |     + median
 39 | - Measures of Spread
 40 |     + range
 41 |     + interquartile range
 42 |     + standard deviation (and variance)
 43 | 
 44 | 
 45 | 
 46 | ## The Average
 47 | 
 48 | Perhaps the most common type of measure of center is the average or mean.
 49 | Consider a list of numbers formed by: 0, 1, 2, 3, 5, and 7. The average is
 50 | calculated as the sum of all values divided the number of values:
 51 | 
 52 | $$
 53 | average = \frac{0 + 1 + 2 + 3 + 5 + 7}{6} = 3
 54 | $$
 55 | 
 56 | You can use R to compute the previous average:
 57 | 
 58 | ```{r}
 59 | (0 + 1 + 2 + 3 + 5 + 7) / 6
 60 | ```
 61 | 
 62 | Algebraically, we typically denote a set of values by $x_1, x_2, \dots, x_n$, 
 63 | in which the index $n$ represents the total number of values. Using this 
 64 | notation, the formula of the average is expressed as:
 65 | 
 66 | $$
 67 | average = \frac{x_1 + x_2 + \dots + x_n}{n}
 68 | $$
 69 | 
 70 | Using summation notation, the average can be compactly expressed as:
 71 | 
 72 | $$
 73 | average = \sum_{i=1}^{n} \frac{x_i}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i
 74 | $$
 75 | 
 76 | Summation notation uses the uppercase Greek letter $\Sigma$ (sigma), 
 77 | is used as an abbreviation for the phrase "the sum of". So, in place of 
 78 | $x_1 + x_2 + \dots + x_n$, we can use summation notation as "the sum of the 
 79 | observations of the variable $x$."
 80 | 
 81 | In R, you can create a vector `x` to store the previous numbers:
 82 | 
 83 | ```{r}
 84 | x = c(0, 1, 2, 3, 5, 7)
 85 | ```
 86 | 
 87 | Then, you can use the function `sum()` to add all the values in `x`, and 
 88 | compute the average as:
 89 | 
 90 | ```{r}
 91 | sum(x) / length(x)
 92 | ```
 93 | 
 94 | An alternative way to compute the average in R is using the `mean()` function:
 95 | 
 96 | ```{r}
 97 | mean(x)
 98 | ```
 99 | 
100 | 
101 | ### The Average is the balancing point
102 | 
103 | Usually, the average of a set of $n$ values $x_1, x_2, \dots, x_n$ is expressed as 
104 | $\bar{x}$ (pronounced _x-bar_). 
105 | 
106 | To understand how the average is a type of central or mid-value, we need to 
107 | talk about __deviations__. A deviation is the difference between an observed 
108 | value $x_i$ and another value of reference $ref$, that is, $(x_i - ref)$.
109 | 
110 | Taking the average value $\bar{x}$ as a reference value, we can calculate the
111 | deviations of all observations from the average: $(x_i - \bar{x})$
112 | 
113 | Given a reference value $ref$, we can also compute the sum of all deviations
114 | around such value:
115 | 
116 | $$
117 | \sum_{i=1}^{n} (x_i - ref)
118 | $$
119 | 
120 | It turns out that the average is the ONLY reference value such that the sum
121 | of deviations around it becomes zero:
122 | 
123 | $$
124 | \sum_{i=1}^{n} (x_i - \bar{x}) = 0
125 | $$
126 | 
127 | Let's verify that in R
128 | 
129 | ```{r}
130 | avg = mean(x)
131 | deviations = x - avg
132 | deviations
133 | ```
134 | 
135 | The sum of the deviations around the mean should be zero:
136 | 
137 | ```{r}
138 | sum(deviations)
139 | ```
140 | 
141 | This is the reason why we say that the average is one type of center or mid-value. 
142 | In simpler terms, you can think of the average as the balance point of a 
143 | distribution. The average is that point that cancels out the sum of deviations
144 | around it.
145 | 
146 | 
147 | 
148 | ### Your turn
149 | 
150 | We know that the average of `x` is `r mean(x)`. What happens to this average if:
151 | 
152 | - you add a constant $b$ to all values in `x`?
153 | - you multiply the values in `x` times a constant $a$?
154 | 
155 | For instance, let's add 2 to all vaues in `x`?
156 | 
157 | ```{r}
158 | mean(x + 2)
159 | ```
160 | 
161 | Now, let's multiply by 2 all values in `x`:
162 | 
163 | ```{r}
164 | mean(x * 2)
165 | ```
166 | 
167 | Spend some time in R to examine what happens to the average of $x + k$ and 
168 | $k \times x$ with several choices of $k$, e.g. -2, 5, 100. 
169 | 
170 | Now, let's see what happens to the average when you add a constant $b$ to all 
171 | values in `x`, and multiply them times some constant $a$?
172 | 
173 | ```{r}
174 | mean(x)
175 | a = 2
176 | b = 3
177 | mean(a*x + b)
178 | ```
179 | 
180 | Again, spend some time in R trying different values for `a` and `b`. 
181 | What's your conclusion?
182 | 
183 | 
184 | ## The Median
185 | 
186 | Another common type of measure of center is the __median__. The median is the
187 | literal middle value of an ordered distribution. By _middle value_ we mean that
188 | half of observations are below the median, and the other half of observations
189 | are above it.
190 | 
191 | The easiest way to calculate the median in R is with the homonym function
192 | `median()`. Consider again the numbers in the vector `x`, the median of
193 | this set of values is:
194 | 
195 | ```{r}
196 | x = c(0, 1, 2, 3, 5, 7)
197 | 
198 | median(x)
199 | ```
200 | 
201 | 
202 | The median depends on the number of values. If you have a variable with an 
203 | even number of values, then the median is the average of the two middle-values.
204 | If you have a variable with an odd number of values, then the median is the
205 | middle-value.
206 | 
207 | 
208 | 
209 | ## More numeric summaries
210 | 
211 | Another interesting function in R that you can use to obtain descriptive
212 | information about a variable is `summary()`. When you use this function on a
213 | numeric vector (i.e. quantitative variable), the returned output includes:
214 | 
215 | - `Min.`: minimum
216 | - `1st Qu.`: first quartile
217 | - `Median`: median
218 | - `Mean`: average
219 | - `3rd Qu.`: third quartile
220 | - `Max.`: maximum
221 | 
222 | ```{r}
223 | x = c(0, 1, 2, 3, 5, 7)
224 | 
225 | summary(x)
226 | ```
227 | 
228 | 
229 | ## Average -vs- Median
230 | 
231 | Consider a new vector `x` that contains 25 numbers: five 1's, five 2's, five 3's,
232 | five 4's, and five 5's:
233 | 
234 | ```{r}
235 | x = rep(1:5, each = 5)
236 | ```
237 | 
238 | As you can tell, all values in `x` occur with the same frequency. And if you
239 | get a histogram, R will plot all bars with the same height:
240 | 
241 | ```{r out.width='50%', fig.align='center'}
242 | hist(x, breaks = c(0, 1, 2, 3, 4, 5), las = 1, col = 'gray80')
243 | ```
244 | 
245 | In this data, the average and the median are the same. In fact, this happens
246 | all the time you have a perfect symmetric distribution:
247 | 
248 | ```{r}
249 | mean(x)
250 | median(x)
251 | ```
252 | 
253 | 
254 | Now let's add one more observation to `x` with a value of 10, and obtain the
255 | average and the median:
256 | 
257 | ```{r}
258 | y = c(x, 10)
259 | mean(y)
260 | median(y)
261 | ```
262 | 
263 | Note that the average increased from `r mean(x)` to `r round(mean(y), 2)`, 
264 | while the median remained unchanged.
265 | 
266 | Let's make it more extreme and instead of adding a value of 10 let's add a
267 | value of 100 to `x`. The average and the median are:
268 | 
269 | ```{r}
270 | z = c(x, 100)
271 | mean(z)
272 | median(z)
273 | ```
274 | 
275 | You can look at the distributions of `x`, `y`, and `z` using the default plots 
276 | produced by `hist()`:
277 | 
278 | ```{r out.width='90%', fig.height=3, echo = FALSE, fig.align='center'}
279 | op = par(mfrow = c(1, 3))
280 | hist(x, las = 1, col = 'gray80')
281 | hist(y, las = 1, col = 'gray80')
282 | hist(z, las = 1, col = 'gray80')
283 | par(op)
284 | ```
285 | 
286 | This is a toy example that illustrates one difference between the median and the
287 | average. The median is more resistant (or robust) to extreme values,
288 | but not the average. Small and large values affect the average of a distribution.
289 | 
290 | 
291 | ### Example
292 | 
293 | Here's one more example that shows you how to use R to solve a typical textbook 
294 | exercise. The average and median of the first 99 values of a data set of 198 
295 | values are all equal to 120. If the average and median of the final 99 values 
296 | are all equal to 100, what can you say about the average of the entire data set. 
297 | What can you say about the median?
298 | 
299 | You can solve theis type of questions analytically, or you can use R. Here's how. 
300 | The problem deals with a data set of 198 values formed by two
301 | sets of numbers: the first 99 values are all equal to 120, the final 99 values
302 | are all equal to 100. You can create two R vectors to build the two sets of
303 | 99 values. This is achieved with the function `rep()` that allows
304 | you to __repeat__ one or more numeric values given a number of times:
305 | 
306 | ```{r}
307 | # first 99 values equal to 120
308 | first_values = rep(120, times = 99)
309 | 
310 | # final 99 values equal to 100
311 | final_values = rep(100, times = 99)
312 | 
313 | # all values
314 | all_values = c(first_values, final_values)
315 | ```
316 | 
317 | Having defined `first_values` and `final_values`, we build the entire list of 
318 | 198 values by combining them in the vector `all_values`. The next step involves
319 | finding the average and the median:
320 | 
321 | ```{r}
322 | # average
323 | mean(all_values)
324 | 
325 | # median
326 | median(all_values)
327 | ```
328 | 
329 | 
330 | 
331 | 


--------------------------------------------------------------------------------
/scripts/04-measures-center.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/04-measures-center.pdf


--------------------------------------------------------------------------------
/scripts/05-measures-spread.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Measures of Spread"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | urlcolor: blue
  7 | ---
  8 | 
  9 | > ### Learning Objectives
 10 | >
 11 | > - Becoming familiar with various measures of spread
 12 | > - Intro to the functions `range()`, `IQR()`, and `sd()`
 13 | > - Understand the concept of r.m.s. size of a list of numbers
 14 | > - Be aware of the difference between SD and SD+
 15 | 
 16 | 
 17 | ```{r setup, include=FALSE}
 18 | knitr::opts_chunk$set(echo = TRUE)
 19 | ```
 20 | 
 21 | ## Introduction
 22 | 
 23 | Quantitative variables can be summarized using two groups of measures:
 24 | 1) center, and 2) spread. Just like there are various measures of center 
 25 | (e.g. average, median, mode), we also have several measures of spread or 
 26 | variability:
 27 | 
 28 | - range
 29 | - interquartile range
 30 | - standard deviation (and variance)
 31 | 
 32 | 
 33 | ## Range
 34 | 
 35 | The most basic type of measure of spread is the __range__. The range is obtained
 36 | as the difference between the maximum value and the minimum value.
 37 | 
 38 | For example, let's consider the values 0, 5, -8, 7, and -3 used in the textbook 
 39 | (page 66). To find the range, you need to determine the smallest and largest 
 40 | values which in this case are 7 and -8, respectively. And then obtain the
 41 | difference: 
 42 | 
 43 | $$
 44 | range = 7 - (-8) = 15
 45 | $$
 46 | 
 47 | For illustration purposes, let's implement this minimalist example in R. First
 48 | we create a vector `x` with the five values. You can use the functions `max()`
 49 | and `min()` to get the largest and smallest values in `x`:
 50 | 
 51 | ```{r}
 52 | x = c(0, 5, -8, 7, -3)
 53 | maximum = max(x)
 54 | minimum = min(x)
 55 | 
 56 | # range
 57 | maximum - minimum
 58 | ```
 59 | 
 60 | Actually, there is a `range()` function in R, which gives you the maximum and
 61 | minimum value (but not the subtraction):
 62 | 
 63 | ```{r}
 64 | # range: max value, and min value
 65 | range(x)
 66 | ```
 67 | 
 68 | The range is one type of measure of variability. It tells you the _length_
 69 | of the scatter in the data. The issue with the range is that extreme values
 70 | may have a considerable effect on it. For example, if you add a value of 20 to 
 71 | `x` the new range becomes:
 72 | 
 73 | ```{r}
 74 | y = c(x, 20)
 75 | 
 76 | # range
 77 | max(y) - min(y)
 78 | ```
 79 | 
 80 | The presence of outliers will affect the magnitude of the range.
 81 | 
 82 | 
 83 | ## Interquartile Range (IQR)
 84 | 
 85 | To overcome the limitations of the range we can use a different type of range 
 86 | called the __interquartile range__ or _IQR_. This is a range based not on the 
 87 | minimum and maximum values but on the first and third quartiles.
 88 | 
 89 | One way to compute quartiles in R is with the function `quantile()`. There are
 90 | slightly different formulas to compute quartiles. To find the quartiles---as 
 91 | discussed in most introductory statistics books---you need to use the argument 
 92 | `type = 2` inside the `quantile()` function:
 93 | 
 94 | ```{r}
 95 | x = c(0, 5, -8, 7, -3)
 96 | 
 97 | # 1st quartile
 98 | Q1 = quantile(x, probs = 0.25, type = 2)
 99 | 
100 | # 3rd quartile
101 | Q3 = quantile(x, probs = 0.75, type = 2)
102 | 
103 | # IQR
104 | Q3 - Q1
105 | ```
106 | 
107 | You can also use the dedicated function `IQR()` to compute the interquartile
108 | range:
109 | 
110 | ```{r}
111 | IQR(x, type = 2)
112 | ```
113 | 
114 | Compared to the classic range, the IQR is more resistant to outliers because
115 | it does not consider the entire set of values, just those between the first 
116 | and third quartile. If we add a extreme negative value -50, and an extreme
117 | positive value of 40 to `x`, the IQR should not be affected:
118 | 
119 | ```{r}
120 | y = c(x, -50, 40)
121 | 
122 | # IRQ
123 | IQR(y, type = 2)
124 | ```
125 | 
126 | 
127 | ## The Root Means Square (RMS)
128 | 
129 | Another measure of spread is the Standard Deviation (SD). However, in order to
130 | talk about the SD, I will follow the same approach of the FPP book and I will first talk about the __Root Mean Square__ or RMS.
131 | 
132 | The values in our toy example are 0, 5, -8, 7, and -3. To find a central value
133 | we can use either the average or the median:
134 | 
135 | ```{r results='hide'}
136 | x = c(0, 5, -8, 7, -3)
137 | 
138 | mean(x)
139 | median(x)
140 | ```
141 | 
142 | What about a measure of _size_? In other words, how would you find a measure
143 | of how small or how big the values in `x` are? Is it possible to obtain a 
144 | quantity that tells you something about the representative _magnitude_ of values 
145 | in `x`?
146 | 
147 | To answer this question about a typical magnitude of values we need to ignore
148 | the signs. One way to do this is by looking at the absolute values, and then 
149 | compute the average:
150 | 
151 | ```{r}
152 | abs(x)
153 | 
154 | mean(abs(x))
155 | ```
156 | 
157 | For convenience reasons (e.g. algebraic manipulation and nice mathematical properties), statisticians prefer to square the values instead of using the 
158 | absolute values. And then compute the average of such squares:
159 | 
160 | ```{r}
161 | # square value
162 | x^2
163 | 
164 | # average of square values
165 | sum(x^2) / length(x)
166 | ```
167 | 
168 | The issue with using square values is that now you end up working with 
169 | square units, and with a larger number that has little to do with a typical
170 | magnitude of the original values. To tackle this problem, we take the square
171 | root:
172 | 
173 | ```{r}
174 | # root-mean-square (r.m.s)
175 | sqrt(sum(x^2) / length(x))
176 | 
177 | # equivalent to
178 | sqrt(mean(x^2))
179 | ```
180 | 
181 | The value `r round(sqrt(mean(x^2)), 2)` is referred to as the _r.m.s. size_ of 
182 | the numbers in `x`. The RMS size provides a numeric summary for the magnitude
183 | of the data. It is not really the average magnitude, but you can think of it
184 | as such.
185 | 
186 | 
187 | ## Standard Deviation (SD)
188 | 
189 | Now that we have introduced the concept of r.m.s. size of a list of numbers,
190 | we can talk about a third measure of spread known as the 
191 | __Standard Deviation__ (SD). Simply put, the Standard Deviation is a measure 
192 | of spread that quantifies the amount of variation around the average.
193 | 
194 | A keyword is the term __deviation__. In the previous script tutorial---about measures of center---I introduced the concept of _deviations_.
195 | If we denote a set of $n$ values with $x_1, x_2, \dots, x_n$, and a reference
196 | value by $ref$, a deviation is the difference between an observed 
197 | value $x_i$ and the value of reference $ref$, that is, $(x_i - ref)$.
198 | 
199 | A special type of deviation is when the reference value becomes the average.
200 | If $avg$ represents the average of $x_1, x_2, \dots, x_n$, we can calculate the
201 | deviations of all observations from the average: $(x_i - avg)$.
202 | 
203 | The Standard Deviation is based on these deviations. To be more precise, it 
204 | is based on the R.M.S. size of deviations from the average:
205 | 
206 | $$
207 | SD = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - avg)^2 }
208 | $$
209 | 
210 | The SD says how far away numbers $x_1, x_2, \dots x_n$ are from their average.
211 | In this sense, you can think of the SD as the typical magnitude of scatter 
212 | around the average.
213 | 
214 | 
215 | ### The `sd()` function
216 | 
217 | All statistical packages come with a function that allows you to calculate 
218 | the Standard Deviation. In R, there is the function `sd()`. However, the way
219 | `sd()` works is by using a slightly different formula:
220 | 
221 | $$
222 | SD^{+} = \sqrt{ \frac{1}{n-1} \sum_{i=1}^{n} (x_i - avg)^2 }
223 | $$
224 | 
225 | Note that $SD^{+}$ divides by $n-1$ instead of $n$. When the number of values
226 | $n$ is big, $\sqrt{n-1}$ is very close to $\sqrt{n}$. However, for relatively
227 | small values of $n$, there diference between $\sqrt{n-1}$ and $\sqrt{n}$ can
228 | be considerable.
229 | 
230 | If you want to use `sd()` to obtain $SD$, you need to multiply the output by a 
231 | correction factor of $\frac{\sqrt{n-1}}{n}$:
232 | 
233 | ```{r}
234 | x = c(0, 5, -8, 7, -3)
235 | n = length(x)
236 | 
237 | # SD
238 | sqrt((n-1)/n) * sd(x)
239 | 
240 | # SD+
241 | sd(x)
242 | ```
243 | 
244 | 
245 | 


--------------------------------------------------------------------------------
/scripts/05-measures-spread.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/05-measures-spread.pdf


--------------------------------------------------------------------------------
/scripts/06-normal-curve.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "The Normal Curve"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | urlcolor: blue
  7 | ---
  8 | 
  9 | > ### Learning Objectives
 10 | >
 11 | > - Becoming familiar with the normal curve
 12 | > - Intro to the functions `dnorm()`, `pnorm()`, and `qnorm()`
 13 | > - How to find areas under the normal curve using R
 14 | > - Converting values to standard units
 15 | 
 16 | 
 17 | ```{r setup, include=FALSE}
 18 | knitr::opts_chunk$set(echo = TRUE)
 19 | ```
 20 | 
 21 | 
 22 | ## Introduction
 23 | 
 24 | Let's look at the distributions of some variables in the data of NBA players: 
 25 | 
 26 | ```{r}
 27 | # assembling the URL of the CSV file
 28 | repo = 'https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/'
 29 | datafile = 'master/data/nba_players.csv'
 30 | url = paste0(repo, datafile)
 31 | # read in data set
 32 | nba = read.csv(url)
 33 | ```
 34 | 
 35 | More specifically, let's take a peek at the histograms of variables `height`,
 36 | `weight`, `age`, `points2_percent`
 37 | 
 38 | ```{r echo = FALSE, out.width='95%', fig.align='center'}
 39 | variables = c('height', 'weight', 'age', 'points2_percent')
 40 | op = par(mfrow = c(2, 2))
 41 | for (i in variables) {
 42 |   hist(nba[ ,i], xlab = i,
 43 |        col = 'gray80', las = 1,
 44 |        main = paste('Histogram of', i))
 45 | }
 46 | par(op)
 47 | ```
 48 | 
 49 | - `height` seems to have a slightly left skewed distribution.
 50 | - `weight` looks roughly symmetric.
 51 | - `age` has a right skewed distribution.
 52 | - `points2_percent` appears to be fairly symmetric.
 53 | 
 54 | These distributions are examples of some of the possible patterns that 
 55 | you will find when describing data in real life. If you are lucky, you may
 56 | even get to see a perfect symmetric distribution one day.
 57 | 
 58 | Among the wide range of distribution shapes that we encounter when looking 
 59 | at data, one special pattern has received most of the attention: the so-called 
 60 | _symmetric bell-shaped_ or mound-shaped distribution, like that of 
 61 | `points2_percent` and `weight`. It is true that these two histograms are far 
 62 | from perfect symmetry, but we can put them within the _fairly_ bell-shaped 
 63 | category.
 64 | 
 65 | 
 66 | ## Normal Curve
 67 | 
 68 | It turns out that there is one mathematical function that fits (density) 
 69 | histograms having a symmetric bell-shaped pattern: the famous __normal curve__
 70 | given by the following equation
 71 | 
 72 | $$
 73 | y = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} x^2}
 74 | $$
 75 | 
 76 | This equation, also known as the Laplace-Gaussian curve, was first discovered by
 77 | Abraham de Moivre (circa 1720) while working on the first problems about 
 78 | probability. However, his work around the normal equation went unnoticed for
 79 | many years. By the time historians realized he had been the first person to 
 80 | come up with the normal equation, most people had attributed authorship to
 81 | either French scholar Pierre-Simon Laplace and/or German mathematician
 82 | Carl Friedrich Gauss.
 83 | 
 84 | In the past, before the 1880s, the curve was referred to as the _Error curve_,
 85 | because of its application around the errors from measurements in astronomy. 
 86 | The name _normal_ appeared around the late 1870s and early 1880s, where British
 87 | biometricians like Francis Galton, and later on his disciple Karl Pearson, 
 88 | together with Ronald Fisher, popularized the word _normal_. Galton never 
 89 | explained why he used the term "normal" although it seems that he was implying 
 90 | the sense of conforming to a norm (i.e. a standard, model, pattern, type).
 91 | 
 92 | 
 93 | ### Plotting the Normal Curve in R
 94 | 
 95 | You can use R to obtain a graph of the normal curve. One approach is to generate
 96 | values for the x-axis, and then use the equation of the normal curve to obtain
 97 | values for the y-axis:
 98 | 
 99 | ```{r out.width='60%', fig.align='center'}
100 | x = seq(from = -3, to = 3, by = 0.01)
101 | y = (1/sqrt(2 * pi)) * exp(-(x^2)/2)
102 | 
103 | plot(x, y, type = "l", lwd = 3, col = "blue")
104 | ```
105 | 
106 | First we generate a vector `x` with some values for the x-axis ranging from 
107 | -3 to 3. Then we use `x` to find the heights of the `y` variable. Finally, 
108 | we use the values in `x` and `y` as coordinates of the `plot()`. The argument 
109 | `type = 'l'` is used to graph a line instead of dots. The argument `lwd` allows 
110 | you to define the width of the line. And `col` lets you define a color.
111 | 
112 | 
113 | ## Normal Distribution Functions
114 | 
115 | Instead of working with the equation `y = (1/sqrt(2 * pi)) * exp(-(x^2)/2)`, 
116 | R has a family of four functions dedicated to the normal curve: 
117 | 
118 | - `dnorm()` density function
119 | - `pnorm()` distribution function
120 | - `qnorm()` quantile function
121 | - `rnorm()` random number generator function
122 | 
123 | 
124 | ### Heights of the curve with `dnorm()`
125 | 
126 | The function `dnorm()` is the __density__ function. This is actually the function
127 | that lets you find the height of the curve (i.e. $y$ values). Instead of 
128 | manually coding the normal equation, you can use `dnorm()` and get the 
129 | previously obtained graph like this:
130 | 
131 | ```{r out.width='60%', fig.align='center'}
132 | x = seq(from = -3, to = 3, by = 0.01)
133 | y = dnorm(x)
134 | 
135 | plot(x, y, type = "l", lwd = 3, col = "blue")
136 | ```
137 | 
138 | 
139 | ### Areas under the curve with `pnorm()`
140 | 
141 | The function `pnorm()` is the distribution function. By default, `pnorm()`
142 | returns the area under the curve to the __left__ of a specified `x` value. For
143 | instance, the area to the left of 0 is 0.5 or 50%:
144 | 
145 | ```{r}
146 | pnorm(0)
147 | ```
148 | 
149 | Try `pnorm()` with these values
150 | 
151 | ```{r eval = FALSE}
152 | pnorm(-2)
153 | pnorm(-1)
154 | pnorm(1)
155 | pnorm(2)
156 | ```
157 | 
158 | You can also use `pnorm()` to find areas under the normal curve to the __right__
159 | of a specific `x` value. This is done by using the argument `lower.tail = FALSE`:
160 | 
161 | ```{r}
162 | # area to the right of 1
163 | pnorm(1, lower.tail = FALSE)
164 | ```
165 | 
166 | Try finding the areas to the right of:
167 | 
168 | ```{r eval = FALSE}
169 | pnorm(-2.5, lower.tail = FALSE)
170 | pnorm(-2, lower.tail = FALSE)
171 | pnorm(0.5, lower.tail = FALSE)
172 | pnorm(1.5, lower.tail = FALSE)
173 | ```
174 | 
175 | 
176 | Sometimes you need to find areas in between two $z$ values. For instance, the
177 | area between -1 and 1 (which is about 68%). Finding this type of areas involves
178 | subtracting the larger area to the left of 1 minus the smaller area to the 
179 | left of -1:
180 | 
181 | ```{r}
182 | # area between -1 and 1
183 | pnorm(1) - pnorm(-1)
184 | ```
185 | 
186 | What abot the area between -2 and 2? 
187 | 
188 | ```{r}
189 | # area between -2 and 2
190 | pnorm(2) - pnorm(-2)
191 | ```
192 | 
193 | 
194 | 
195 | ### Z values of a given area with `qnorm()`
196 | 
197 | The function `qnorm()` is the quantile function. You can think of this function
198 | as the inverse of `pnorm()`. That is, for a given area under the curve, use
199 | `qnorm()` to find what is the corresponding `z` value (i.e. value on the x-axis):
200 | 
201 | ```{r}
202 | # z-value such that the area to its left is 0.5
203 | qnorm(0.5)
204 | 
205 | # z-value such that the area to its left is 0.3
206 | qnorm(0.3)
207 | ```
208 | 
209 | Likewise, you can use the argument `lower.tail = FALSE` to find values given
210 | a right-tail area:
211 | 
212 | ```{r}
213 | # z-value such that the area to its right is 0.5
214 | qnorm(0.5, lower.tail = FALSE)
215 | 
216 | # z-value such that the area to its right is 0.3
217 | qnorm(0.3, lower.tail = FALSE)
218 | ```
219 | 
220 | 
221 | 
222 | ## Standard Units
223 | 
224 | In real life, most variables will be measured in some scale: `height` measured 
225 | in inches, `weight` measured in ounces, `age` measured in years, 
226 | `points2_percent` measured in percentage. To be able to use the normal curve
227 | as an approximation for symmetric bell-shaped distributions, you will need
228 | to convert the original units into __standard units__ (SU).
229 | 
230 | Recall that the conversion formula from $x$ to standard units is:
231 | 
232 | $$
233 | SU = \frac{x - avg}{SD}
234 | $$
235 | 
236 | Let's see how you could convert `weight` values to SU using R. First we need
237 | to obtain find the average and standard deviation of `weight`:
238 | 
239 | ```{r}
240 | # average weight
241 | avg_weight = mean(nba$weight)
242 | avg_weight
243 | 
244 | # SD weight
245 | # (remember to use correction factor)
246 | n = nrow(nba)
247 | sd_weight = sqrt((n-1)/n) * sd(nba$weight)
248 | sd_weight
249 | ```
250 | 
251 | To convert the weights of the players to standard units, subtract the average 
252 | and then divide by the SD:
253 | 
254 | ```{r}
255 | # weight in SU
256 | su_weight = (nba$weight - avg_weight) / sd_weight
257 | 
258 | # weights in SU of first 5 players
259 | su_weight[1:5]
260 | ```
261 | 
262 | 
263 | How does the histogram for `su_weight` look like?
264 | 
265 | ```{r out.width='60%', fig.align='center'}
266 | # density histogram
267 | hist(su_weight, las = 1, col = 'gray80', probability = TRUE,
268 |      ylim = c(0, 0.5), xlim = c(-3.5, 3.5),
269 |      main = 'Histogram of Weight in SU', xlab = 'standard units')
270 | ```
271 | 
272 | 
273 | An alternative picture of the distribution of `su_weight` can be obtained by
274 | plotting a kernel density curve:
275 | 
276 | ```{r out.width='60%', fig.align='center'}
277 | dens_weight = density(su_weight)
278 | plot(dens_weight, axes = FALSE, ylim = c(0, 0.5), xlim = c(-3.5, 3.5),
279 |      main = 'Density Curve', xlab = 'standard units', lwd = 2, col = 'blue')
280 | # x-axis
281 | axis(side = 1)
282 | # y-axis
283 | axis(side = 2, las = 1)
284 | ```
285 | 
286 | Looking at both the histogram, and the kernel density curve, the shape of the
287 | distribution is symmertric but it does not have a peak around zero.
288 | You can say that it has moero of a plateau or flat peak.
289 | 
290 | 
291 | ## Using Normal Approximation
292 | 
293 | Although `weight` does not have the central peak, we can try to see how good 
294 | the normal curve approximates its distribution. From the attributes of the 
295 | normal curve, we know that 50% of players should have a height below 
296 | `avg_weight`. We can directly check what is the proportion of players below 
297 | `avg_weight`:
298 | 
299 | ```{r}
300 | # proportion of players below average weight
301 | sum(nba$weight <= avg_weight) / n
302 | ```
303 | 
304 | This confirms that `weight` (and `su_weight`) does have a symmetric shape.
305 | 
306 | From the empirical 68-95-99.7 rule, we know that about 68% of players should 
307 | have weights between `r round(avg_weight, 2)` plus-minus
308 | `r round(sd_weight, 2)`, that is, between `r round(avg_weight - sd_weight, 2)`
309 |  and `r round(avg_weight + sd_weight, 2)`
310 |  
311 | ```{r}
312 | weight_minus = avg_weight - sd_weight
313 | weight_plus = avg_weight + sd_weight
314 | 
315 | # proportion of players within 1 SD from average weight
316 | sum(nba$weight <= weight_plus & nba$weight >= weight_minus) / n
317 | ```
318 | 
319 | As you can tell, the proportion of players around 1 SD is not 68% but 65%. 
320 | However, the difference of 3% is not that big.
321 | 
322 | 
323 | ### Asumming Normality ...
324 | 
325 | Let's pretend that `wheight` does have a symmetric bell-shaped distribution,
326 | and that we are interested in finding the proportion of players with weights
327 | below 200 pounds.
328 | 
329 | You can use `pnorm()` to find such proportion, and without having to convert
330 | to standard units. All you need to do is specify the `mean` and `sd` arguments
331 | with the corresponding average and SD values, respectively:
332 | 
333 | ```{r}
334 | # proportion of players with weight below 200 pounds
335 | pnorm(200, mean = avg_weight, sd = sd_weight)
336 | ```
337 | 
338 | To find the proportion of players with weights above 230 pounds, include the
339 | argument `lower.tail = FALSE`:
340 | 
341 | ```{r}
342 | # proportion of players with weight above 230 pounds
343 | pnorm(230, mean = avg_weight, sd = sd_weight, lower.tail = FALSE)
344 | ```
345 | 
346 | You can also use `qnorm()` to find what would be the corresponding weight
347 | such that 60% of players are below it:
348 | 
349 | ```{r}
350 | qnorm(0.6, mean = avg_weight, sd = sd_weight)
351 | ```
352 | 
353 | Or what is the weight wuch that 35% of players are above it:
354 | 
355 | ```{r}
356 | qnorm(0.35, mean = avg_weight, sd = sd_weight, lower.tail = FALSE)
357 | ```
358 | 


--------------------------------------------------------------------------------
/scripts/06-normal-curve.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/06-normal-curve.pdf


--------------------------------------------------------------------------------
/scripts/07-scatter-diagrams.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Scatter Diagrams"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | urlcolor: blue
  7 | ---
  8 | 
  9 | > ### Learning Objectives
 10 | >
 11 | > - How to use `plot()` to create scatter diagrams
 12 | > - Adding points with `points()`
 13 | > - Adding lines with `abline()`
 14 | > - How to use `ggplot()` to create scatter diagrams
 15 | 
 16 | 
 17 | ```{r setup, include=FALSE}
 18 | knitr::opts_chunk$set(echo = TRUE)
 19 | ```
 20 | 
 21 | ## Introduction
 22 | 
 23 | The easiest way to plot scatter diagrams in R is with the `plot()` function.
 24 | I should say that `plot()` produces different kinds of plots depending on the
 25 | type of input(s) that you pass to it.
 26 | 
 27 | If you pass two numeric variables (i.e. two R vectors) `x` and `y`, `plot()`
 28 | will produce a scatter diagram. For example, consider the `height` and `weight` 
 29 | variables of the following toy data table:
 30 | 
 31 | ```{r echo = FALSE}
 32 | library(xtable)
 33 | dat = data.frame(
 34 |   name = c('Luke', 'Leia', 'Obi-Wan', 'Yoda', 'Chebacca'),
 35 |   sex = c('male', 'female', 'male', 'male', 'male'),
 36 |   height = c(172, 150, 182, 66, 228),
 37 |   weight = c(77, 49, 44, 78, 112)
 38 | )
 39 | ```
 40 | 
 41 | ```{r, echo=FALSE, results='asis', message=FALSE}
 42 | xtb <- xtable(dat, digits = 2)
 43 | print(xtb, comment = FALSE, type = 'latex',
 44 |       include.rownames = FALSE)
 45 | ```
 46 | 
 47 | To make a scatter diagram with `height` and `weight`, you can create two 
 48 | vectors and pass them to `plot()`:
 49 | 
 50 | ```{r out.width='50%', fig.align='center', fig.width=3, fig.height=3.5}
 51 | height = c(172, 150, 182, 66, 228)  # in centimeters
 52 | weight = c(77, 49, 44, 78, 112)     # in kilograms
 53 | 
 54 | # default scatter diagram
 55 | plot(height, weight)
 56 | ```
 57 | 
 58 | If you pass a factor to `plot()` it will produce a bar-chart:
 59 | 
 60 | ```{r out.width='50%', fig.align='center', fig.width=2.5, fig.height=3.5}
 61 | # qualitative variable (as an R factor)
 62 | sex = factor(c('male', 'female', 'male', 'male', 'male'))
 63 | 
 64 | # default scatter diagram
 65 | plot(sex)
 66 | ```
 67 | 
 68 | 
 69 | Note that `plot()` displays a very simple, and kind of ugly, scatter diagram. 
 70 | This not an accident. In fact, the basic plots in R follow a "quick and dirty" 
 71 | approach. They are not publication quality, but that is OK. The default display
 72 | of `plot()` was not designed to produce pretty graphics, but rather to produce
 73 | visualizations that quickly allow you to explore the data, identify patterns, 
 74 | help you ask new research questions, and then move on with more visualizations
 75 | or to the next analytical stages.
 76 | 
 77 | Although `plot()` produces a basic graph, you can use several arguments, or
 78 | graphical parameters, to obtain a nicer chart. To find more information about 
 79 | the available graphical parameters for `plot()`, take a look at the documentation
 80 | provided by `help(plot)`.
 81 | 
 82 | The following code uses various graphical parameters to display a more visually
 83 | appealing scatter diagram:
 84 | 
 85 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
 86 | # nicer scatter diagram
 87 | plot(height, weight, 
 88 |      las = 1,         # orientation of y-axis tick marks 
 89 |      pch = 19,        # filled dots
 90 |      col = '#598CDD', # color of dots
 91 |      xlab = 'Height (cm)',   # x-axis label
 92 |      ylab = 'Weight (kg)',   # y-axis label
 93 |      main = 'Height -vs- Weight scatter diagram')
 94 | ```
 95 | 
 96 | 
 97 | ## Adding points and lines
 98 | 
 99 | Often, you may want to add more points and/or line(s) to a given plot. When
100 | you use `plot()`, you add points with `points()`, and lines with `abline()`.
101 | 
102 | For example, say you want to add the point of averages. First, get the
103 | averages:
104 | 
105 | ```{r}
106 | avg_height = mean(height)
107 | avg_weight = mean(weight)
108 | ```
109 | 
110 | Once you have the coordinates of the point of averages, you can `plot()` again 
111 | the scatter diagram, adding the point of averages with `points()`:
112 | 
113 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
114 | # scatter diagram
115 | plot(height, weight, 
116 |      las = 1,         # orientation of y-axis tick marks 
117 |      pch = 19,        # filled dots
118 |      col = '#598CDD', # color of dots
119 |      xlab = 'Height (cm)',   # x-axis label
120 |      ylab = 'Weight (kg)',   # y-axis label
121 |      main = 'Height -vs- Weight scatter diagram')
122 | # point of averages
123 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato")
124 | ```
125 | 
126 | 
127 | Another common task involves adding one or more lines to a scatter diagram
128 | produced by `plot()`. One option to achieve this task is via the `abline()`
129 | function. Here's an example showing the previous scatter diagram, with two
130 | guide lines corresponding to the point of averages
131 | 
132 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
133 | # scatter diagram
134 | plot(height, weight, 
135 |      las = 1,         # orientation of y-axis tick marks 
136 |      pch = 19,        # filled dots
137 |      col = '#598CDD', # color of dots
138 |      xlab = 'Height (cm)',   # x-axis label
139 |      ylab = 'Weight (kg)',   # y-axis label
140 |      main = 'Height -vs- Weight scatter diagram')
141 | # guide lines for point of avgs
142 | abline(h = avg_weight, v = avg_height, col = "tomato")
143 | # point of averages
144 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato")
145 | ```
146 | 
147 | The argument `h` is used to specify the y-value for _horizontal_ lines; 
148 | the argument `v` is used to specify the x-value for _vertical_ lines.
149 | 
150 | If what you want is to specify a line with intercept `a` and slope `b`, then
151 | specify these arguments inside `abline()`:
152 | 
153 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
154 | # scatter diagram
155 | plot(height, weight, 
156 |      las = 1,         # orientation of y-axis tick marks 
157 |      pch = 19,        # filled dots
158 |      col = '#598CDD', # color of dots
159 |      xlab = 'Height (cm)',   # x-axis label
160 |      ylab = 'Weight (kg)',   # y-axis label
161 |      main = 'Height -vs- Weight scatter diagram')
162 | # guide lines for point of avgs
163 | abline(h = avg_weight, v = avg_height, col = "tomato")
164 | # line with intercep and slope
165 | abline(a = 40, b = 0.3, col = "gray50", lty = 2, lwd = 2)
166 | # point of averages
167 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato")
168 | ```
169 | 
170 | 
171 | 
172 | ## Scatter diagrams with `ggplot2`
173 | 
174 | Another approach to create scatter diagrams in R is to use functions from the 
175 | package `"ggplot2"`. This package provides a different philosophy to define
176 | graphs, and it also produces plots with visual attributes carefully chosen
177 | to provide prettier plots.
178 | 
179 | You should have the package `"ggplot2"` already installed, since you were 
180 | supposed to use it for HW02. Assuming that this is the case, you need to load 
181 | `"ggplot2"` with the function `library()` in order to start using its functions:
182 | 
183 | ```{r warning=FALSE, message=FALSE}
184 | # load ggplot2
185 | library(ggplot2)
186 | ```
187 | 
188 | One of the major differences between basic plots---like those produced by `plot()`---and graphics with `ggplot()`, is that the latter requires the data 
189 | to be in the form of a data frame:
190 | 
191 | ```{r}
192 | dat = data.frame(
193 |   name = c('Luke', 'Leia', 'Obi-Wan', 'Yoda', 'Chewbacca'),
194 |   sex = c('male', 'female', 'male', 'male', 'male'),
195 |   height = c(172, 150, 182, 66, 228),
196 |   weight = c(77, 49, 44, 78, 112)
197 | )
198 | ```
199 | 
200 | To create a scatter diagram with `"ggplot2"`, type the following commands:
201 | 
202 | ```{r out.width='40%', fig.align='center', fig.width=3, fig.height=3}
203 | ggplot(data = dat, aes(x = height, y = weight)) +
204 |   geom_point()
205 | ```
206 | 
207 | - The main input of `ggplot()` is `data` which takes the name of the data 
208 | frame containing the variables.
209 | - The `aes()` function---inside `ggplot()`---allows you to specify which 
210 | variables will be used for the `x` and `y` positions.
211 | - The `+` operator is used to add a _layer_, in this case, the layer corresponds
212 | to `geom_point()`
213 | - The function `geom_point()` specifies the type of geometric object to be 
214 | displayed: points (since we want a scatter diagram with dots). 
215 | 
216 | As you can tell, the default chart produced by `ggplot()` is nicer than the
217 | one produced with `plot()`. You can customize the previous graph to add more
218 | details:
219 | 
220 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4}
221 | ggplot(data = dat, aes(x = height, y = weight)) +
222 |   geom_point(size = 3) +
223 |   theme_bw() +
224 |   ggtitle("Height -vs- Weight scatter diagram")
225 | ```
226 | 
227 | Here's another example of a scatter diagram that includes labels for each dot:
228 | 
229 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4}
230 | ggplot(data = dat, aes(x = height, y = weight)) +
231 |   geom_point(size = 3) +
232 |   geom_text(aes(label = name), hjust=0, vjust=0) +
233 |   xlim(0, 300) +
234 |   theme_bw() +
235 |   ggtitle("Height -vs- Weight scatter diagram")
236 | ```
237 | 
238 | Adding specific points with `ggplot()` is a bit trickier. This is because
239 | you need to provide data to `ggplot()` in the form of a data.frame. In order 
240 | to plot the point of averages with `ggpot()`, we need to create a data frame
241 | for such a point:
242 | 
243 | ```{r}
244 | # data frame for the point of averages
245 | avgs = data.frame(height = avg_height, weight = avg_weight)
246 | avgs
247 | ```
248 | 
249 | One way to add the point of averages is to use `geom_point()` twice: one for 
250 | the heighths and weights of the individuals, and the second time for the 
251 | point of averages:
252 | 
253 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4}
254 | ggplot(data = dat, aes(x = height, y = weight)) +
255 |   geom_point(size = 3) +
256 |   geom_point(data = avgs, aes(x = height, y = weight), 
257 |              col = "tomato", size = 4) +
258 |   geom_text(aes(label = name), hjust=0, vjust=0) +
259 |   xlim(0, 300) +
260 |   theme_bw() +
261 |   ggtitle("Height -vs- Weight scatter diagram")
262 | ```
263 | 
264 | Finally, here's how to add guide lines for the point of averages:
265 | 
266 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4}
267 | ggplot(data = dat, aes(x = height, y = weight)) +
268 |   geom_point(size = 3) +
269 |   geom_point(data = avgs, aes(x = height, y = weight), 
270 |              col = "tomato", size = 4) +
271 |   geom_vline(xintercept = avg_height, col = 'tomato') +
272 |   geom_hline(yintercept = avg_weight, col = 'tomato') +
273 |   geom_text(aes(label = name), hjust=0, vjust=0) +
274 |   xlim(0, 300) +
275 |   theme_bw() +
276 |   ggtitle("Height -vs- Weight scatter diagram")
277 | ```
278 | 


--------------------------------------------------------------------------------
/scripts/07-scatter-diagrams.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/07-scatter-diagrams.pdf


--------------------------------------------------------------------------------
/scripts/08-correlation.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Correlation Coefficient"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | urlcolor: blue
  7 | ---
  8 | 
  9 | > ### Learning Objectives
 10 | >
 11 | > - Using scatter diagrams to visualize association of two variables
 12 | > - Using R to "manually" compute the correlation coefficient
 13 | > - Getting to know the function `cor()`
 14 | > - Understanding how change of scales affect the correlation
 15 | 
 16 | 
 17 | ```{r setup, include=FALSE}
 18 | knitr::opts_chunk$set(echo = TRUE)
 19 | ```
 20 | 
 21 | ## Introduction
 22 | 
 23 | In the previous script we talked about how to plot scatter diagrams in R using
 24 | two different approaches: 1) the basic `plot()` function, and 2) the more 
 25 | advanced graphics package `"ggplot2"`.
 26 | Knowing how to create scatter diagrams will help us introduce the ideas that
 27 | have to do with the analysis of two quantitative variables.
 28 | 
 29 | Describing and summarizing a single (quantitative) variable is usually the 
 30 | firts step of any data analysis. This should allow you to get to know the data
 31 | by looking at the distributions of the variables, and reducing the numerical
 32 | information in the data to a set of measures of center and spread.
 33 | 
 34 | After performing a univariate analysis, the next step will usually consist of 
 35 | exploring how two variables may be associated, determine the type of association, 
 36 | how strong is the association (if any), and how to summarize such association.
 37 | 
 38 | 
 39 | ## Anscombe Data Set
 40 | 
 41 | In this tutorial we are going to use a special data set known as the _Anscombe_
 42 | data or _Anscombe's Quartet_. This data was created by Francis Anscombe in 
 43 | the early 1970s to illustrate statistical similarities and differences between 
 44 | four pairs of $x-y$ values. This is one of the many data sets that come
 45 | in R, and it is available in the object `anscombe`
 46 | 
 47 | ```{r}
 48 | # Anscombe's Quartet
 49 | anscombe
 50 | ```
 51 | 
 52 | The data frame `anscombe` contains 8 variables: 4 `x`'s and 4 `y`'s. The way you should handle these variables is: `x1` with `y1`, `x2` with `y2`, and so on.
 53 | 
 54 | 
 55 | ### Histograms
 56 | 
 57 | Let's begin a univariate analysis by looking at the histograms of the `x` 
 58 | variables:
 59 | 
 60 | ```{r xhistograms, eval = FALSE}
 61 | # historgams of x-variables in 2x2 layout
 62 | op = par(mfrow = c(2, 2))
 63 | hist(anscombe$x1, col = 'gray80', las = 1)
 64 | hist(anscombe$x2, col = 'gray80', las = 1)
 65 | hist(anscombe$x3, col = 'gray80', las = 1)
 66 | hist(anscombe$x4, col = 'gray80', las = 1)
 67 | par(op)
 68 | ```
 69 | 
 70 | ```{r xhistograms, echo = FALSE, fig.height=6}
 71 | ```
 72 | 
 73 | Note that `x1`, and `x2`, and `x3` have the exact same histogram. If you look 
 74 | at the data frame, this is explained by the fact that these variables have the 
 75 | same values. In contrast, `x4` has almost all of its vallues equal to 8, 
 76 | except for one value of 19.
 77 | 
 78 | Now let's look at the histograms of the `y` variables:
 79 | 
 80 | ```{r yhistograms, eval = FALSE}
 81 | # historgams of y-variables in 2x2 layout
 82 | op = par(mfrow = c(2, 2))
 83 | hist(anscombe$y1, col = 'gray80', las = 1)
 84 | hist(anscombe$y2, col = 'gray80', las = 1)
 85 | hist(anscombe$y3, col = 'gray80', las = 1)
 86 | hist(anscombe$y4, col = 'gray80', las = 1)
 87 | par(op)
 88 | ```
 89 | 
 90 | ```{r yhistograms, echo = FALSE, fig.height=6}
 91 | ```
 92 | 
 93 | ### Measures of Center and Spread
 94 | 
 95 | To get various summary statistics, you can use the function `summary()`
 96 | 
 97 | ```{r}
 98 | # basic summary of x-variables
 99 | summary(anscombe[ ,1:4])
100 | 
101 | # SD+ of x-variables
102 | apply(anscombe[, 1:4], MARGIN = 2, FUN = sd)
103 | ```
104 | 
105 | Again, note that the `summary()` output for `x1`, and `x2`, and `x3` is the same. 
106 | As for the standard deviation ($SD^+$), all `x`-variables have identical values.
107 | To calculate all the standard deviations at once, we are using the function 
108 | `apply()`. This function allows you to _apply_ a function, e.g. `sd()`, to the 
109 | columns (`MARGIN = 2`) of the input data `anscombe[, 1:4]`.
110 | 
111 | Now let's get the summary indicators and standard deviation for the `y` variables:
112 | 
113 | ```{r}
114 | # basic summary of y-variables
115 | summary(anscombe[ ,5:8])
116 | 
117 | # SD+ of y-variables
118 | apply(anscombe[, 5:8], MARGIN = 2, FUN = sd)
119 | ```
120 | 
121 | Can you notice anything special? Here's a hint: look at the averages and SDs.
122 | All four `y` variables have pretty much the same averages and SDs. But they 
123 | have different ranges, quartiles, and medians. And if you take a peek at their
124 | histograms, their distirbutions also have different shapes.
125 | 
126 | 
127 | ## Scatter Diagrams
128 | 
129 | The real interest in the Anscombe data set has to do with studying the
130 | association between each pair of $x-y$ values. The best way to start exploring 
131 |  pairwise associations is by looking at the scatter diagrams of each 
132 | pair of points. How would you describe the shapes and patterns in each plot?
133 | 
134 | ```{r scatterplots, eval = FALSE}
135 | # scatter diagrams in 2x2 layout
136 | op = par(mfrow = c(2, 2), mar = c(4.5, 4, 1, 1))
137 | plot(anscombe$x1, anscombe$y1, pch = 20)
138 | plot(anscombe$x2, anscombe$y2, pch = 20)
139 | plot(anscombe$x3, anscombe$y3, pch = 20)
140 | plot(anscombe$x4, anscombe$y4, pch = 20)
141 | par(op)
142 | ```
143 | 
144 | ```{r scatterplots, fig.height=5.5, echo = FALSE}
145 | ```
146 | 
147 | - The first set `x1` and `y1` shows some degree of linear association. Although 
148 | the dots do not lie on a line, we can say that they follow a linear pattern.
149 | 
150 | - The second set clearly has a non-linear pattern; instead, the dots follow 
151 | some type of curve (perhaps quadratic) or a polynomial of degree greater than 1.
152 | 
153 | - The third set is almost perfectly linear except for the observation 
154 | corresponding to $x = 13$ which falls outside the pattern of the rest of $y$
155 | values.
156 | 
157 | - The fourth set is similar to the third one in the sense that there is one 
158 | observation (an outlier?) that does not follow the pattern of the other values.
159 | Most dots follow a vertical line at $x=8$ except for the dot at $x=19$.
160 | 
161 | 
162 | ## Correlation Coefficient
163 | 
164 | In addition to the visual inspection of the scatter diagrams, statisticians 
165 | use a summary measure to quantify the degree of _linear association_ between 
166 | two quantitative variables: the __coefficient of correlation__.
167 | 
168 | One way to obtain the correlation coefficient of two variables $x$ and $y$ is 
169 | as the average of the product of $x$ and $y$ in standard units.
170 | 
171 | Let's consider `x1` and `y1` from the `anscobe` data set, and use R to "manually" 
172 | calculate the correlation coefficient. This involves obtaining the average and 
173 | the standard deviation $SD$, and then converting values to standard units:
174 | 
175 | ```{r}
176 | # number of observations
177 | n = nrow(anscombe)
178 | 
179 | # x1 in SU
180 | x1_avg = mean(anscombe$x1)
181 | x1_sd = sqrt((n-1)/n) * sd(anscombe$x1)
182 | x1su = (anscombe$x1 - x1_avg) / x1_sd
183 | 
184 | # y1 in SU
185 | y1_avg = mean(anscombe$y1)
186 | y1_sd = sqrt((n-1)/n) * sd(anscombe$y1)
187 | y1su = (anscombe$y1 - y1_avg) / y1_sd
188 | 
189 | # correlation: average of products
190 | mean(x1su * y1su)
191 | ```
192 | 
193 | Here's some good news. You don't really need to "manually" calcualte the 
194 | correlation coefficient. R actually has a function to compute the correlation
195 | of two variables: `cor()`
196 | 
197 | ```{r}
198 | # correlation coefficient
199 | cor(anscombe$x1, anscombe$y1)
200 | ```
201 | 
202 | Now let's get the correlation coefficients for all four pairs of variables:
203 | 
204 | ```{r}
205 | cor(anscombe$x1, anscombe$y1)
206 | cor(anscombe$x2, anscombe$y2)
207 | cor(anscombe$x3, anscombe$y3)
208 | cor(anscombe$x4, anscombe$y4)
209 | ```
210 | 
211 | Any surprises? As you can tell, all four pairs of $x,y$ variables have basically
212 | the same correlation of `r round(cor(anscombe$x1, anscombe$y1), 3)`. 
213 | But not all of them have scatter diagrams in which the points clustered around 
214 | a line.
215 | 
216 | The take home message is that the correlation coefficient can be misleading in 
217 | the presence of outliers or non-linear association. 
218 | 
219 | 
220 | ## Properties of the Correlation Coefficient
221 | 
222 | One of the properties of the correlation coefficient is that it is a symmetric 
223 | measure. By this we mean that the order of the variables is not important. 
224 | You can interchange between $x$ and $y$, and the correlation between them 
225 | is unchanged:
226 | 
227 | $$
228 | cor(x,y) = cor(y,x)
229 | $$
230 | 
231 | To illustrate this property, let's create two variables:
232 | 
233 | ```{r}
234 | # two variables
235 | x = c(1, 3, 4, 5, 7, 6)
236 | y = c(5, 9, 7, 8, 9, 10)
237 | ```
238 | 
239 | ```{r scatterdiags, eval = FALSE}
240 | op = par(mfrow = c(1,2))
241 | plot(x, y, pch = 20, col = "blue", las = 1, cex = 1.5)
242 | plot(y, x, pch = 20, col = "blue", las = 1, cex = 1.5)
243 | par(op)
244 | ```
245 | 
246 | ```{r scatterdiags, out.width='80%', fig.align='center', fig.width = 8, fig.height=4}
247 | ```
248 | 
249 | The scatter diagram changes depending on what variable is on each axis. 
250 | However, the correlation coefficient in both cases is the same:
251 | 
252 | ```{r}
253 | # symmetric
254 | cor(x, y)
255 | cor(y, x)
256 | ```
257 | 
258 | 
259 | ### Change of Scale
260 | 
261 | The other properties of the correlation coefficient have to do with what the
262 | FPP book calls _change of scale_. To be more precise, the considered change
263 | of scales involve __linear__ change of scales (i.e. linear transformation).
264 | Typical operations that result in a linear change of scale are:
265 | 
266 | - Adding a scalar: $x + 3, y$
267 | - Multiplying times a positive scalar: $2x, y$
268 | - Multiplying times a negative scalar: $-2x, y$
269 | - Adding and multiplying: $2x + 3, y$
270 | 
271 | ```{r change-scale, eval = FALSE}
272 | # scatter diagrams in 2x2 layout
273 | op = par(mfrow = c(2, 2), mar = c(4.5, 4, 1, 1))
274 | plot(x + 3, y, pch = 20, col = "orange", las = 1, cex = 1.5)
275 | plot(2 * x, y, pch = 20, col = "green3", las = 1, cex = 1.5)
276 | plot((-2) * x, y, pch = 20, col = "violet", las = 1, cex = 1.5)
277 | plot(2 * x + 3, y, pch = 20, col = "red", las = 1, cex = 1.5)
278 | par(op)
279 | ```
280 | 
281 | ```{r change-scale, echo = FALSE, fig.height=6}
282 | ```
283 | 
284 | ```{r correlations, eval = FALSE}
285 | cor(x, y)
286 | cor(x + 3, y)
287 | cor(2 * x, y)
288 | cor(-2 * x, y)
289 | cor(2 * x + 3, y)
290 | ```
291 | 
292 | ```{r correlations, echo = FALSE}
293 | ```
294 | 
295 | Wat can you conclude from the change of scales? In which case the correlation 
296 | coefficient is affected by such changes?
297 | 


--------------------------------------------------------------------------------
/scripts/08-correlation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/08-correlation.pdf


--------------------------------------------------------------------------------
/scripts/09-regression-line.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/09-regression-line.pdf


--------------------------------------------------------------------------------
/scripts/10-prediction-and-errors-in-regression.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Predictions and Errors in Regression"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | fontsize: 11pt
  7 | urlcolor: blue
  8 | ---
  9 | 
 10 | > ### Learning Objectives
 11 | >
 12 | > - Calculating predicted values with the regression method
 13 | > - Looking at the regression residuals
 14 | > - Calculating r.m.s. error for regression
 15 | 
 16 | 
 17 | ```{r setup, include=FALSE}
 18 | knitr::opts_chunk$set(echo = TRUE)
 19 | ```
 20 | 
 21 | 
 22 | ## Introduction
 23 | 
 24 | In the previous script, you learned about the function `lm()` to obtain a simple lienar regression model. Specifically, we looked at the regression `coefficients`: the intercept and the slope. You also learned how to plot a scatter diagram with the regression line, via the `abline()` function, as well as how to "manually" calculate the intercept and slope with the formulas:
 25 | 
 26 | $$
 27 | slope = r \times \frac{SD_y}{SD_x}
 28 | $$
 29 | 
 30 | In turn, Chapter 12 presents the formula of the intercept as:
 31 | 
 32 | $$
 33 | intercept = avg_y - slope \times avg_x
 34 | $$
 35 | 
 36 | 
 37 | ## Regression with Height Data Set
 38 | 
 39 | To cotinue our discussion, we'll keep using the data set in the file csv file `pearson.csv` (in the github repository):
 40 | 
 41 | ```{r}
 42 | # assembling the URL of the CSV file
 43 | # (otherwise it won't fit within the margins of this document)
 44 | repo = 'https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/'
 45 | datafile = 'master/data/pearson.csv'
 46 | url = paste0(repo, datafile)
 47 | 
 48 | # read in data set
 49 | dat = read.csv(url)
 50 | ```
 51 | 
 52 | The data frame `dat` contains `r nrow(dat)` rows, and `r ncol(dat)` columns:
 53 | 
 54 | - `Father`: height of the father (in inches)
 55 | - `Son`: height of the son (in inches)
 56 | 
 57 | Here's a reminder on how to use the function `lm()` to regress `Son` on `Father`:
 58 | 
 59 | ```{r}
 60 | # run regression analysis
 61 | reg = lm(Son ~ Father, data = dat)
 62 | reg
 63 | ```
 64 | 
 65 | You can compare the coefficients given by `lm()` with your own calculated 
 66 | $b_1$ and $b_0$ according to the previous formulas. First let's get the main 
 67 | ingredients:
 68 | 
 69 | ```{r}
 70 | # number of values (to be used for correcting SD+)
 71 | n = nrow(dat)
 72 | 
 73 | # averages
 74 | avg_x = mean(dat$Father)
 75 | avg_y = mean(dat$Son)
 76 | 
 77 | # SD (corrected SD+)
 78 | sd_x = sqrt((n-1)/n) * sd(dat$Father)
 79 | sd_y = sqrt((n-1)/n) * sd(dat$Son)
 80 | 
 81 | # correlation coefficient
 82 | r = cor(dat$Father, dat$Son)
 83 | ```
 84 | 
 85 | Now let's compute the slope and intercept, and compare them with 
 86 | `reg$coefficients`
 87 | 
 88 | ```{r}
 89 | # slope
 90 | b1 = r * (sd_y / sd_x)
 91 | b1
 92 | 
 93 | # intercept
 94 | b0 = avg_y - (b1 * avg_x)
 95 | b0
 96 | 
 97 | # compared with coeffs
 98 | reg$coefficients
 99 | ```
100 | 
101 | 
102 | ## Predicting Values
103 | 
104 | As I mentioned in the last tutorial, regression tools are 
105 | mainly used for prediction purposes. This means that we can use the estimated 
106 | regression line $\mathtt{Son} \approx b_0 + b_1 \mathtt{Father}$, to predict 
107 | the height of Son given a particular Father's height.
108 | 
109 | For example, if a father has a height of 71 inches, what is the predicted 
110 | son's height? 
111 | 
112 | __Option a)__ One way to answer this question is with the regression method described in chapter 10 of FPP. The first step consists of converting $x$ in standard units, then multiplying times $r$ to get the predicted $\hat{y}$ in standard units, and finally rescaling the predicted value to the original units.
113 | 
114 | ```{r}
115 | # height of father in standard units
116 | height = 71
117 | height_su = (height - avg_x) / sd_x
118 | height_su
119 | ```
120 | 
121 | ```{r}
122 | # predicted Son's height in standard units
123 | prediction_su = r * height_su
124 | prediction_su
125 | ```
126 | 
127 | ```{r}
128 | # rescaled to original units
129 | prediction = prediction_su * sd_y + avg_y
130 | prediction
131 | ```
132 | 
133 | 
134 | __Option b)__ Another way to find the predicted son's height when the height of the father is 71 is by using the equation of the regression line:
135 | 
136 | ```{r}
137 | # predict height of son with a 71 in. tall father
138 | b0 + b1 * 71
139 | ```
140 | 
141 | __Option c)__ A third option is with the `predict()` function. The first 
142 | argument must be an `"lm"` object; the second argument must be a data frame 
143 | containing the values for `Fater`:
144 | 
145 | ```{r}
146 | # new data (must be a data frame)
147 | newdata = data.frame(Father = 71)
148 | 
149 | # predict son's height
150 | predict(reg, newdata)
151 | ```
152 | 
153 | If you want to know the predicted values based on several `Father`'s heights, 
154 | then do something like this:
155 | 
156 | ```{r}
157 | more_data = data.frame(Father = c(65, 66.7, 67, 68.5, 70.5, 71.3))
158 | 
159 | predict(reg, more_data)
160 | ```
161 | 
162 | 
163 | ## R.M.S. Error for Regression
164 | 
165 | The predictions given by the regression line will tend to be off. There is 
166 | usually some difference between the observed values $y$ and the predicted 
167 | values $\hat{y}$. This difference is called __residual__. The residuals are 
168 | part of the `"lm"` object `reg`. 
169 | You can take a peek at such residuals with `head()`
170 | 
171 | ```{r}
172 | # first six residuals
173 | head(reg$residuals)
174 | ```
175 | 
176 | By how much the predicted values will be off?
177 | To find the answer, you need to calculate the _Root Mean Square_ (RMS) error 
178 | for regression. In other words, you need to take the residuals 
179 | (i.e. difference between actual values and predicted values), and get the
180 | square root of the average of their squares.
181 | 
182 | ```{r}
183 | # r.m.s. error for regression
184 | rms = sqrt(mean(reg$residuals^2))
185 | rms
186 | ```
187 | 
188 | The r.m.s. value tells you the typical size of the residuals. This means that 
189 | the typical predicted heights of sons will be off by about `r round(rms, 2)` 
190 | inches.
191 | 
192 | 
193 | ## Are residuals homoscedastic?
194 | 
195 | As you know, the main assumption in a simple regression analysis is that $X$ 
196 | and $Y$ are approximately linearly related. This means that we can 
197 | use a line as a good summary for the cloud of points. For a line to able to do 
198 | a good summarizing job, the amount of spread around the regression line should 
199 | be fairly the same (i.e. constant). This requirement has a very 
200 | specific---and rather ugly---name: __homoscedasticity__; which simply means 
201 | "same scatter". Visually, homoscedascity comes in the form of the so-called 
202 | football-shaped cloud of points. Or in a more geometric sense, cloud of points 
203 | with a chiefly elliptical shape.
204 | 
205 | The `"lm"` object `reg` contains the vector of redisuals (see `reg$residuals`).
206 | The residuals from the regression line must average out to 0. To confirm this,
207 | let's get their average:
208 | 
209 | ```{r}
210 | mean(reg$residuals)
211 | ```
212 | 
213 | You can take a look at the _residual plot_ by running this command:
214 | 
215 | ```{r out.width='60%', fig.align='center', fig.width=6, fig.height=5}
216 | # residuals plot
217 | plot(reg, which = 1)
218 | ```
219 | 
220 | which is equivalent to this other command:
221 | 
222 | ```{r eval = FALSE}
223 | # equivalently
224 | plot(reg$fitted.values, reg$residuals)
225 | ```
226 | 
227 | This residual plot is not exactly the same that the book describes (pages 187-188).
228 | To plot the residuals like the book does, you would need to use the `Father` 
229 | variable in the x-axis:
230 | 
231 | ```{r out.width='60%', fig.align='center', fig.width=6, fig.height=5}
232 | # residuals plot (as in FPP)
233 | plot(dat$Father, reg$residuals)
234 | abline(h = 0, lty = 2)  # horizontal dashed line
235 | ```
236 | 
237 | The difference is only in the scale of the horizontal axis. But the important 
238 | part in both plots is the shape of the cloud.
239 | As you look across the residual plot, there is no systematic tendency for the 
240 | points to drift up or down. The red line displayed by `plot(reg, which = 1)`, 
241 | is a regression line for the residuals. When residuals are homoscedastic, this 
242 | line is basically a horizontal line. This is what you want to see when 
243 | inspecting the residual plot. Why? Because it supports the appropriate use of 
244 | the regression line.
245 | 
246 | 
247 | ## Summary output
248 | 
249 | `reg` is an object of class `"lm"`---linear model. For this type of R object, 
250 | you can use the `summary()` function to get additional information and diagnostics:
251 | 
252 | ```{r}
253 | # summarized linear model
254 | sum_reg = summary(reg)
255 | sum_reg
256 | ```
257 | 
258 | The information displayed by `summary()` is the typical output that most 
259 | statistical programs provide about a simple linear regression model. There 
260 | are four major parts: 
261 | 
262 | - `Call`: the command used when invoking `lm()`.
263 | - `Residuals`: summary indicators of the residuals.
264 | - `Coefficients`: table of regression coefficients.
265 | - Additional statistics: more diagnostics toosl.
266 | 
267 | In the same way that `lm()` produces `"lm"` objects, `summary()` of `"lm"` 
268 | objects produce `"summary.lm"` objects. This type of objects also contain 
269 | more information than what is displayed by default. To see the list of all the 
270 | components in `sum_reg`, you can use again the function `names()`:
271 | 
272 | ```{r}
273 | names(sum_reg)
274 | ```
275 | 
276 | 


--------------------------------------------------------------------------------
/scripts/10-prediction-and-errors-in-regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/10-prediction-and-errors-in-regression.pdf


--------------------------------------------------------------------------------
/scripts/11-binomial-formula.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/11-binomial-formula.pdf


--------------------------------------------------------------------------------
/scripts/12-chance-process.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Chance Processes and Variability"
  3 | subtitle: "Intro to Stats, Spring 2017"
  4 | author: "Prof. Gaston Sanchez"
  5 | output: html_document
  6 | fontsize: 11pt
  7 | urlcolor: blue
  8 | ---
  9 | 
 10 | > ### Learning Objectives
 11 | >
 12 | > - How to use R to simulate chance processes
 13 | > - Getting to know the function `sample()`
 14 | > - Simulate flipping a coin
 15 | > - Simulate rolling a die
 16 | > - Simulate drawing tickets from a box
 17 | 
 18 | 
 19 | ```{r setup, include=FALSE}
 20 | knitr::opts_chunk$set(echo = TRUE)
 21 | ```
 22 | 
 23 | ## Introduction
 24 | 
 25 | In this tutorial we will see how to use R to simulate basic chance processes 
 26 | like tossing a coin, rolling a die, or drawing tickets from a box. The aim is 
 27 | to give you some tools that allow you to better understanding and visualize 
 28 | fundamental concepts such as the law of large numbers, the law of averages, 
 29 | and the central limit theorem.
 30 | 
 31 | 
 32 | ## Coins, Dice, and Boxes with Tickets
 33 | 
 34 | Chance processes, also referred to as chance experiments, have to do with 
 35 | actions in which the resulting outcome turns out to be different in each 
 36 | occurrence.
 37 | 
 38 | Typical examples of basic chance processes are tossing one or more coins, 
 39 | rolling one or more dice, selecting one or more cards from a deck of cards, 
 40 | and in general, things that can be framed in terms of drawing tickets out of 
 41 | a box (or any other type of container: bag, urn, etc.).
 42 | 
 43 | You can use your computer, and R in particular, to simulate chances processes.
 44 | In order to do that, the first step consists of learning how to create a 
 45 | virtual coin, or die, or box with tickets.
 46 | 
 47 | 
 48 | ### Creating a coin
 49 | 
 50 | The simplest way to create a coin with two sides, `"heads"` and `"tails"`, is 
 51 | with an R vector via the _combine_ function `c()`
 52 | 
 53 | ```{r}
 54 | coin = c("heads", "tails")
 55 | ```
 56 | 
 57 | You can also create a _numeric_ coin that shows `0` and `1` instead of
 58 | `"heads"` and `"tails"`:
 59 | 
 60 | ```{r}
 61 | coin = c(0, 1)
 62 | ```
 63 | 
 64 | 
 65 | ### Creating a die
 66 | 
 67 | What about simulating a die in R? Pretty much the same way you create a coin:
 68 | simply define a vector with numbers representing the number of spots in a die.
 69 | 
 70 | ```{r}
 71 | die = c(1, 2, 3, 4, 5, 6)
 72 | 
 73 | # equivalent
 74 | die = 1:6
 75 | ```
 76 | 
 77 | 
 78 | ### Creating a box with tickets
 79 | 
 80 | Likewise, you can create a general box with tickets. For instance, say you have 
 81 | a box with tickets labeled 1, 2, 3 and 4; this can be implemented in R as:
 82 | 
 83 | ```{r}
 84 | tickets = c(1, 2, 3, 4)
 85 | ```
 86 | 
 87 | 
 88 | 
 89 | ## Drawing tickets with `sample()`
 90 | 
 91 | Once you have an object that represents the _box with tickets_, the next step 
 92 | involves learning how to draw tickets from the box. One way to simulate drawing 
 93 | tickets from a box in R is with the function `sample()` which lets you draw 
 94 | random samples, with or without replacement, from an input vector. 
 95 | 
 96 | For example, consider a "box" with tickets 1, 2, 3. 
 97 | To draw one ticket, use `sample()` like this:
 98 | 
 99 | ```{r}
100 | # box with tickets
101 | tickets = c(1, 2, 3)
102 | 
103 | # draw one ticket
104 | sample(tickets, size = 1)
105 | ```
106 | 
107 | By default, `sample()` draws each ticket with the same probability. In other 
108 | words, ecah ticket is assigned the same probability of being chosen. Another 
109 | default behavior of `sample()` is to take a sample of the specified `size` 
110 | without replacement. If `size = 1`, it does not really matter whether the sample 
111 | is done with or without replacement. 
112 | 
113 | To draw two tickets WITHOUT replacement, use `sample()` like this:
114 | 
115 | ```{r}
116 | # draw 2 tickets without replacement
117 | sample(tickets, size = 2)
118 | ```
119 | 
120 | To draw two tickets WITH replacement, use `sample()` and specify its argument 
121 | `replace = TRUE`, like this:
122 | 
123 | ```{r}
124 | # draw 2 tickets with replacement
125 | sample(tickets, size = 2, replace = TRUE)
126 | ```
127 | 
128 | The way `sample()` works is by taking a random sample from the input vector. 
129 | This means that every time you invoke `sample()` you will likely get a different 
130 | output.
131 | 
132 | In order to make the examples replicable (so you can get the same output as me),
133 | you need to specify what is called a __random seed__. This is done with the 
134 | function `set.seed()`. By setting a _seed_, every time 
135 | you use one of the random generator functions, like `sample()`, you will get 
136 | the same values.
137 | 
138 | ```{r}
139 | # set random seed
140 | set.seed(1257)
141 | 
142 | # draw 4 tickets with replacement
143 | sample(tickets, size = 4, replace = TRUE)
144 | ```
145 | 
146 | Try the code above. You should get the exact same sample.
147 | 
148 | Last but not least, `sample()` comes with the argument `prob` which allows you 
149 | to provide specific probabilities for each element in the input vector.
150 | 
151 | By default, `prob = NULL`, which means that every element has the same 
152 | probability of being drawn. In the example of tossing a coin, the command 
153 | `sample(coin)` is equivalent to `sample(coin, prob = c(0.5, 0.5))`. In the 
154 | latter case we explicitly specify a probability of 50% chance of heads, and 
155 | 50% chance of tails:
156 | 
157 | ```{r echo = FALSE}
158 | # tossing a fair coin
159 | coin = c("heads", "tails")
160 | 
161 | sample(coin)
162 | sample(coin, prob = c(0.5, 0.5))
163 | ```
164 | 
165 | However, you can provide different probabilities for each of the elements in 
166 | the input vector. For instance, to simulate a __loaded__ coin with chance of 
167 | heads 20%, and chance of tails 80%, set `prob = c(0.2, 0.8)` like so:
168 | 
169 | ```{r}
170 | # tossing a loaded coin (20% heads, 80% tails)
171 | sample(coin, size = 5, replace = TRUE, prob = c(0.2, 0.8))
172 | ```
173 | 
174 | 
175 | -----
176 | 
177 | 
178 | ## Simulating tossing a coin
179 | 
180 | Now that we've talked about `sample()`, let's use R to implement code that 
181 | simulates tossing a fair coin one or more times. 
182 | 
183 | __Recap.__ To toss a coin using R, we first need an object that plays the role 
184 | of a coin. A simple way to create a `coin` is using a vector with two elements: 
185 | `"heads"` and `"tails"`. Then, to simulate tossing a coin one or more times, 
186 | we use the `sample()` function.
187 | Here's how to simulate a coin toss using `sample()` to take a random sample of 
188 | size 1 from `coin`:
189 | 
190 | ```{r coin-vector}
191 | # coin object
192 | coin <- c("heads", "tails")
193 | 
194 | # one toss
195 | sample(coin, size = 1)
196 | ```
197 | 
198 | To simulate multiple tosses, just change the `size` argument, and specify
199 | sampling with replacement (`replace = TRUE`):
200 | 
201 | ```{r various-tosses}
202 | # 3 tosses
203 | sample(coin, size = 3, replace = TRUE)
204 | 
205 | # 6 tosses
206 | sample(coin, size = 6, replace = TRUE)
207 | ```
208 | 
209 | 
210 | ### Coin Simulations
211 | 
212 | Now that we have all the elements to toss a coin with R, let's simulate flipping 
213 | a coin 100 times, and use the function `table()` to count the resulting number 
214 | of `"heads"` and `"tails"`:
215 | 
216 | ```{r}
217 | # number of flips
218 | num_flips = 100
219 | 
220 | # flips simulation
221 | coin = c('heads', 'tails')
222 | flips = sample(coin, size = num_flips, replace = TRUE)
223 | 
224 | # number of heads and tails
225 | freqs = table(flips)
226 | freqs
227 | ```
228 | 
229 | In my case, I got `r freqs[1]` heads and `r freqs[2]` tails. Your results will 
230 | probably be different than mine. Some of you will get more `"heads"`, some of 
231 | you will get more `"tails"`, and some will get exactly 50 `"heads"` and 50 
232 | `"tails"`.
233 | 
234 | Run another series of 100 flips, and find the frequency of `"heads"` and `"tails"`:
235 | 
236 | ```{r}
237 | flips = sample(coin, size = num_flips, replace = TRUE)
238 | freqs = table(flips)
239 | freqs
240 | ```
241 | 
242 | Let's make things a little bit more complex but also more interesting. The idea 
243 | is to repeat 100 flips 1000 times. To carry out this simulation, we are going 
244 | to use a programming structure called a `for` loop. This is one way to tell 
245 | the computer to repeat the same action a given number of times. 
246 | Don't worry about this. Just execute the following lines of code:
247 | 
248 | ```{r}
249 | # total number of repetitions
250 | times = 1000
251 | 
252 | # "empty" vectors to store number of heads and tails in each repetition
253 | heads = c(0, times)
254 | tails = c(0, times)
255 | 
256 | # 100 flips of a coin, repeated 1000 times
257 | for (i in 1:times) {
258 |   flips = sample(coin, size = 100, replace = TRUE)
259 |   freqs = table(flips)
260 |   heads[i] = freqs[1]
261 |   tails[i] = freqs[2]
262 | }
263 | ```
264 | 
265 | What the code above is doing is simulating 100 flips of a coin, not once, 
266 | not twice, but 1000 times. In each repetition, we count how many `"heads"` 
267 | and how many `"tails"`, and store those counts in the vectors `heads` and 
268 | `tails`, respectively. 
269 | 
270 | Each vector, `heads` and `tails`, contains 1000 values. Moreover, we can get 
271 | a histogram to see the empirical relative frequency:
272 | 
273 | ```{r fig.align='center', out.width='75%', fig.height=4.5}
274 | barplot(table(heads)/1000, las = 1, cex.names = 0.5, border = NA,
275 |         main = "Frequency of number of heads in 100 flips")
276 | ```
277 | 
278 | ```{r fig.align='center', out.width='75%', fig.height=4.5}
279 | barplot(table(tails)/1000, las = 1, cex.names = 0.5, border = NA,
280 |         main = "Frequency of number of tails in 100 flips")
281 | ```
282 | 
283 | 
284 | 
285 | ## Frequencies
286 | 
287 | Typical probability problems that have to do with coin tossing, require
288 | to compute the total proportion of `"heads"` and `"tails"`:
289 | 
290 | ```{r five-tosses}
291 | # five tosses
292 | five <- sample(coin, size = 5, replace = TRUE)
293 | 
294 | # proportion of heads
295 | sum(five == "heads") / 5
296 | 
297 | # proportion of tails
298 | sum(five == "tails") / 5
299 | ```
300 | 
301 | It is also customary to compute the relative frequencies of `"heads"` and
302 | `"tails"` in a series of tosses:
303 | 
304 | ```{r relative-freqs}
305 | # relative frequencies of heads
306 | cumsum(five == "heads") / 1:length(five)
307 | 
308 | # relative frequencies of tails
309 | cumsum(five == "tails") / 1:length(five)
310 | ```
311 | 
312 | Likewise, it is common to look at how the relative frequencies of heads or 
313 | tails change over a series of tosses:
314 | 
315 | ```{r plot-freqs}
316 | set.seed(5938)
317 | hundreds <- sample(coin, size = 500, replace = TRUE)
318 | head_freqs = cumsum(hundreds == "heads") / 1:500
319 | 
320 | plot(1:500, head_freqs, type = "l", ylim = c(0, 1), las = 1,
321 |      col = "#3989f8", lwd = 2,
322 |      xlab = 'number of tosses',
323 |      ylab = 'frequency of heads')
324 | # reference line at 0.5
325 | abline(h = 0.5, col = 'gray50', lwd = 1.5, lty = 2)
326 | ```
327 | 
328 | So far we have written code in R that simulates tossing a coin one or more
329 | times. We have included commands to compute proportion of heads and tails, 
330 | as well the relative frequencies of heads (or tails) in a series of tosses.
331 | In addition, we have produced a plot of the relative frequencies and see
332 | how, as the number of tosses increases, the frequency of heads (and tails) 
333 | approach 0.5.
334 | 
335 | 
336 | -----
337 | 
338 | ## Simulating rolling a die
339 | 
340 | Now that you know how to simulate flipping a coin one or more times, you can 
341 | do the same to simulate rolling a die:
342 | 
343 | ```{r}
344 | die = 1:6
345 | 
346 | # rolling a die once
347 | sample(die, size = 1)
348 | 
349 | # rolling a pair of dice
350 | sample(die, size = 2, replace = TRUE)
351 | 
352 | # rolling a die 5 times
353 | sample(die, size = 5, replace = TRUE)
354 | ```
355 | 
356 | 
357 | 


--------------------------------------------------------------------------------
/scripts/12-chance-process.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/12-chance-process.pdf


--------------------------------------------------------------------------------
/scripts/Makefile:
--------------------------------------------------------------------------------
 1 | # input files
 2 | RMDS = $(wildcard *.Rmd)
 3 | 
 4 | # output files
 5 | PDFS = $(patsubst %.Rmd, %.pdf, $(RMDS))
 6 | HTMLS = $(patsubst %.Rmd, %.html, $(RMDS))
 7 | 
 8 | 
 9 | .PHONY: all htmls clean
10 | 
11 | 
12 | all: $(PDFS)
13 | 
14 | 
15 | htmls: $(HTMLS)
16 | 
17 | 
18 | %.pdf: %.Rmd
19 | 	Rscript -e "library(rmarkdown); render('$<', output_format = 'pdf_document')"
20 | 
21 | 
22 | %.html: %.Rmd
23 | 	Rscript -e "library(rmarkdown); render('$<', output_format = 'html_document')"
24 | 
25 | 
26 | clean:
27 | 	rm -rf *.pdf *.html
28 | 


--------------------------------------------------------------------------------
/scripts/README.md:
--------------------------------------------------------------------------------
1 | # Intro Stats Scripts
2 | 
3 | This folder contains the Rmd scripts used in lecture as well as out-of-class for the introductory courses to Probability and Statistics at UC Berkeley.
4 | 
5 | 
6 | 


--------------------------------------------------------------------------------
/scripts/images/karl-pearson.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/images/karl-pearson.jpg


--------------------------------------------------------------------------------
/scripts/images/western-conference-standings-2016.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/images/western-conference-standings-2016.png


--------------------------------------------------------------------------------
/syllabus/README.md:
--------------------------------------------------------------------------------
1 | ## Syllabus
2 | 
3 | > - [Stat 20](syllabus-stat20.md)
4 | > - [Stat 131A](syllabus-stat131A.md)
5 | 
6 | ![](mrs-mutner-rules.jpg)


--------------------------------------------------------------------------------
/syllabus/mrs-mutner-rules.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/syllabus/mrs-mutner-rules.jpg


--------------------------------------------------------------------------------
/syllabus/syllabus-stat131A.md:
--------------------------------------------------------------------------------
  1 | ## Course Syllabus Stat 131A
  2 | 
  3 | Stat 131A: Introduction to Probability and Statistics for Life Scientists, Spring 2017
  4 | 
  5 | - __Instructor:__ Gaston Sanchez, gaston.stat[at]gmail.com
  6 | - __Class Time:__ MWF 2-3pm in 50 Birge
  7 | - __Session Dates:__ 01/18/17 - 05/05/17
  8 | - __Code #:__ 23461
  9 | - __Units:__ 4 (more info [here](http://classes.berkeley.edu/content/2017-spring-stat-131A-001-lec-001))
 10 | - __Office Hours:__ TuTh 11:30am-12:30pm in 309 Evans (or by appointment)
 11 | - __Text:__ Statistics, 4th edition (by Freedman, Pisani, and Purves)
 12 | - __Final:__ Tue, May-09, 11:30am-2:30pm
 13 | - __GSIs:__ Office hours of the GSIs will be posted on the bCourses page. You can go to the office hours of __any__ GSI, not just your own.
 14 | 
 15 | | Discussion | Date      | Room             | GSI          |
 16 | |------------|-----------|------------------|--------------|
 17 | | 101        | MW  3-4pm | 250 Sutardja Dai | Shuhui Huang |
 18 | | 102        | MW  3-4pm | B51 Hildebrand   | Andy Mao     |
 19 | | 103        | MW  4-5pm |   9 Evans        | Shuhui Huang |
 20 | | 104        | MW  5-6pm |  70 Evans        | Andy Mao     |
 21 | 
 22 | 
 23 | ### Description
 24 | 
 25 | __Statistics 131A__ is a course designed primarily as an introductory course for statistical thinking. You do need to be comfortable with math at the level of intermediate algebra (e.g., the equation of a straight line, plotting points, taking powers and roots, percentages). 
 26 | 
 27 | The emphasis of the course is critical thinking about quantitative evidence. Topics include reasoning and fallacies, descriptive statistics, association, correlation, regression, elements of probability, set theory, chance variability, random variables, expectation, standard error, sampling, hypothesis tests, confidence intervals, experiments and observational studies.
 28 | 
 29 | 
 30 | ### Methods of Instruction
 31 | 
 32 | Using a combination of lecture and student participation, each class session will focus on learning the fundamentals. The required textbook is the classic __Statistics__ 4th edition by Freedman, Pisani and Purves. 
 33 | 
 34 | I firmly believe that one cannot do statistical computations without the help of good statistical software. In this course, you will be asked to do various assignments and practical work using the [statistical software R](https://www.r-project.org/). The main idea is to use R as a supporting tool to help you apply the key concepts of the textbook.
 35 | 
 36 | 
 37 | ### Homework Assignments
 38 | 
 39 | - Homework assignments will be assigned almost every week (about 13 HW).
 40 | - You will submit your homework via bCourses electronically (as a word, text, pdf, or html file).
 41 | - I will drop your lowest HW score.
 42 | - Don't wait until the last hour to do an assignment. Plan ahead and pace yourself.
 43 | - Don't wait until the last minute to submit your assignment.
 44 | - __No late assignments__ will be accepted, for any reason, including, but not limited to, theft or any extraordinary circumnstances (e.g. illnes, exhaustion, mourning, loss of internet connection, bCourses is down, broken computer).
 45 | - Note that answers to non-review questions are in the back of the book so you can check that your answer is correct.
 46 | - Solutions to the review exercises will be posted on bCourses.
 47 | - If you collaborate with other students when working on a HW assignment, please include the names of those students in your submission.
 48 | - You must write your own answers (using your own words). Copy and plagiarism will not be tolerated (see _Academic Honesty_ policy).
 49 | 
 50 | 
 51 | ### Discussion
 52 | 
 53 | - Discussion is an important part of the class and is meant to supplement lecture.
 54 | - Your GSI will review and expand on concepts introduced in lecture and encourage you to problem solve in groups. 
 55 | - There will be about 4/5 short quizzes given in discussion to test your understanding.
 56 | - Your quiz scores __will NOT__ be part of your grade.
 57 | - Students must attend the discussion group they are officially registered in.
 58 | 
 59 | 
 60 | ### Exams
 61 | 
 62 | - There will be two 50-minute in-class midterms, and one 3-hour final exam.
 63 | - The tentative dates of the midterms are Friday Feb-24, and Friday Apr-07. 
 64 | - The final exam is currently scheduled for Tuesday, May 9th from 11:30am-2:30pm. (classroom to be announced).
 65 | - If you do not take the final, you will NOT pass the class.
 66 | - There will be __no early or makeup exams__.
 67 | - ~~To ask for regrading, you must answer a test using pen. Tests answered with pencil will not be accepted for regrading.~~
 68 | - We will use _gradescrope_ to grade your tests (so you can use pen or pencil).
 69 | - When asking for regrading, please clearly state the reasons that make you think you deserve a higher score.
 70 | - You will have one full week after grades are published on gradescope to request a regrade.
 71 | - After the regrade deadline, no requests will be considered.
 72 | 
 73 | 
 74 | ### Grading Structure
 75 | 
 76 | - 20% homework (lowest 1 dropped)
 77 | - 25% midterm 1
 78 | - 25% midterm 2
 79 | - 30% final
 80 | 
 81 | No individual letter grades will be given for midterm, or final. You will get a letter grade for the course that is based on your overall score. Your final grade will be graded on a 30/30/30/10 (A/B/C/DF) scale.
 82 | 
 83 | 
 84 | ### Calculator Policy
 85 | 
 86 | - You will need one that adds, subtracts, multiplies, divides, takes square roots, raises numbers to a power, and preferably also computes factorials. Statistical calculators are unnecessary.
 87 | - However, no graphing calculators, phone calculators or tablet calculators are allowed.
 88 | - If you do not bring a calculator to a midterm or the final, do your computations by hand (you won't be allowed to borrow someone else's calculator).
 89 | 
 90 | 
 91 | ### Academic Honesty
 92 | 
 93 | I (Gaston Sanchez) expect you to do your own work and to uphold the standards of intellectual integrity. Collaborating on homework is fine and I encourage you to work together---but copying is not, nor is having somebody else submit assignments for you. Cheating will not be tolerated. Anyone found cheating will receive an F and will be reported to the [Center for the Student Conduct](http://sa.berkeley.edu/conduct). If you are having trouble with an assignment or studying for an exam, or if you are uncertain about permissible and impermissible conduct or collaboration, please come see me with your questions. 
 94 | 
 95 | 
 96 | ### Email Policy
 97 | 
 98 | - You should try to use email as a tool to set up a one-on-one meeting with me if office hours conflict with your schedule.
 99 | - Use the subject line __Stat 131 Meeting Request__.
100 | - Your message should include at least two times when you would like to meet and a brief (one-two sentence) description of the reason for the meeting.
101 | - Do NOT expect me to reply right away (I may not reply on time).
102 | - If you have an emergency, talk to me later during class or office hours.
103 | - I strongly encourage you to ask questions about the syllabus, covered material, and assignments during class time or lab discussions. 
104 | - I prefer to have conversations in person rather than via email, thus allowing us to get to know each other better and fostering a more collegial learning atmosphere.
105 | 
106 | 
107 | ### Accommodation Policy
108 | 
109 | Students needing accommodations for any physical, psychological, or learning disability, should speak with me during the first two weeks of the semester, either after class or during office hours and see [http://dsp.berkeley.edu](http://dsp.berkeley.edu) to learn about Berkeley’s policy. If you are a DSP student, please contact me at least three weeks prior to a midterm or final so that we can work out acceptable accommodations via the DSP Office.
110 | 
111 | If you are an athlete or Cal band member, please check your calendar and come see me as soon as possible to OH during the first two weeks of the semester. Please try your best to be present at each of the midterms as I cannot guarantee accommodation for a late exam.
112 | 
113 | 
114 | ### Safe, Supportive, and Inclusive Environment
115 | 
116 | Whenever a faculty member, staff member, post-doc, or GSI is responsible for 
117 | the supervision of a student, a personal relationship between them of a 
118 | romantic or sexual nature, even if consensual, is against university policy. 
119 | Any such relationship jeopardizes the integrity of the educational process.
120 | 
121 | Although faculty and staff can act as excellent resources for students, you 
122 | should be aware that they are required to report any violations of this campus 
123 | policy. If you wish to have a confidential discussion on matters related to this 
124 | policy, you may contact the _Confidential Care Advocates_ on campus for support 
125 | related to counseling or sensitive issues. Appointments can be
126 | made by calling (510) 642-1988.
127 | 
128 | The classroom, lab, and work place should be safe and inclusive environments 
129 | for everyone. The _Office for the Prevention of Harassment and Discrimination_ 
130 | (OPHD) is responsible for ensuring the University provides an environment for 
131 | faculty, staff and students that is free from discrimination and harassment on 
132 | the basis of categories including race, color, national origin, age, sex, gender, 
133 | gender identity, and sexual orientation. Questions or concerns? 
134 | Call (510) 643-7985, email ask_ophd@berkeley.edu, or go to 
135 | [http://survivorsupport.berkeley.edu/](http://survivorsupport.berkeley.edu/).
136 | 
137 | 
138 | ### Incomplete Policy
139 | 
140 | Under emergency/special circumstances, students may petition me to receive an Incomplete grade. By University policy, for a student to get an Incomplete requires (i) that the student was performing passing-level work until the time that (ii) something happened that---through no fault of the student---prevented the student from completing the coursework. If you take the final, you completed the course, even if you took it while ill, exhausted, mourning, etc. The time to talk to me about incomplete grades is BEFORE you take the final, when the situation that prevents you from finishing presents itself. Please clearly state your reasoning in your comments to me.
141 | 
142 | It is your responsibility to develop good time management skills, good studying habits, know your limits, and learn to ask for professional help.
143 | Life happens. Social, family, cultural, scholar, and individual circumstances can affect your performance (both positive and negatively). If you find yourself in a situation that raises concerns about passing the course, please come see me as soon as possible. Above all, please do not wait till the end of the semester to share your concerns about passing the course because it will be too late by then.
144 | 
145 | 
146 | ### Letters of Recommendation
147 | 
148 | Unless I have known you at least one year, and we have developed a good collegial relationship, I do not provide letters of recommendation. 
149 | 
150 | 
151 | ### Additional Course Policies
152 | 
153 | - Be sure to pay attention to deadlines.
154 | - In consideration to everybody in the classroom, please turn off your cell phone during class and lab time.
155 | 
156 | 
157 | 
158 | ### Fine Print
159 | 
160 | The course deadlines, assignments, exam times and material are subject to change at the whim of the professor.
161 | 
162 | 


--------------------------------------------------------------------------------
/syllabus/syllabus-stat20.md:
--------------------------------------------------------------------------------
  1 | ## Course Syllabus Stat 20
  2 | 
  3 | Stat 20: Introduction to Probability and Statistics, Spring 2017
  4 | 
  5 | - __Instructor:__ Gaston Sanchez, gaston.stat[at]gmail.com
  6 | - __Class Time:__ MWF 12-1pm in 2050 VLSB
  7 | - __Session Dates:__ 01/18/17 - 05/05/17
  8 | - __Code #:__ 23407
  9 | - __Units:__ 4 (more info [here](http://classes.berkeley.edu/content/2017-spring-stat-20-001-lec-001))
 10 | - __Office Hours:__ TuTh 11:30am-12:30pm in 309 Evans (or by appointment)
 11 | - __Text:__ Statistics, 4th edition (by Freedman, Pisani, and Purves)
 12 | - __Final:__ Wed, May-10, 3:00-6:00pm
 13 | - __GSIs:__ Office hours of the GSIs will be posted on the bCourses page. You can go to the office hours of __any__ GSI, not just your own.
 14 | 
 15 | | Discussion | Date         | Room         | GSI             |
 16 | |------------|--------------|--------------|-----------------|
 17 | | 101        | TuTh  9-10A  | 332 Evans    | Yoni Ackerman   |
 18 | | 102        | TuTh  9-10A  | 334 Evans    | Yizhou Zhao     |
 19 | | 103        | TuTh  10-11A | 332 Evans    | Yoni Ackerman   |
 20 | | 104        | TuTh  10-11A | 334 Evans    | Yizhou Zhao     |
 21 | | 105        | TuTh  11-12P | 332 Evans    | Mingjia Chen    |
 22 | | 106        | TuTh  11-12P | 334 Evans    | Jill Berkin     |
 23 | | 107        | TuTh  12-1P  | 332 Evans    | Mingjia Chen    |
 24 | | 108        | TuTh  1-2P   | 332 Evans    | Yanli Fan       |
 25 | | 109        | TuTh  2-3P   | 334 Evans    | Yanli Fan       |
 26 | | 110        | TuTh  2-3P   | 340 Evans    | Rohit Bahirwani |
 27 | | 111        | TuTh  3-4P   | 334 Evans    | Shalika Gupta   |
 28 | | 112        | TuTh  3-4P   | 340 Evans    | Rohit Bahirwani |
 29 | | 113        | TuTh  4-5P   | 334 Evans    | Calvin Chi      |
 30 | | 114        | TuTh  5-6P   | 334 Evans    | Calvin Chi      |
 31 | | 115        | TuTh  5-6P   | 205 Dwinelle | Jill Berkin     |
 32 | | 116        | TuTh  5-6P   | 187 Dwinelle | Shalika Gupta   |
 33 | 
 34 | 
 35 | ### Description
 36 | 
 37 | __Statistics 20__ is a course designed primarily as an introductory course for statistical thinking. You do need to be comfortable with math at the level of intermediate algebra (e.g., the equation of a straight line, plotting points, taking powers and roots, percentages). 
 38 | 
 39 | The emphasis of the course is critical thinking about quantitative evidence. Topics include reasoning and fallacies, descriptive statistics, association, correlation, regression, elements of probability, set theory, chance variability, random variables, expectation, standard error, sampling, hypothesis tests, confidence intervals, experiments and observational studies.
 40 | 
 41 | 
 42 | ### Methods of Instruction
 43 | 
 44 | Using a combination of lecture and student participation, each class session will focus on learning the fundamentals. The required textbook is the classic __Statistics__ 4th edition by Freedman, Pisani and Purves. 
 45 | 
 46 | I firmly believe that one cannot do statistical computations without the help of good statistical software. In this course, you will be asked to do various assignments and practical work using the [statistical software R](https://www.r-project.org/). The main idea is to use R as a supporting tool to help you apply the key concepts of the textbook.
 47 | 
 48 | 
 49 | ### Homework Assignments
 50 | 
 51 | - Homework assignments will be assigned almost every week (about 13 HW).
 52 | - You will submit your homework via bCourses electronically (as a word, text, pdf, or html file).
 53 | - I will drop your lowest HW score.
 54 | - Don't wait until the last hour to do an assignment. Plan ahead and pace yourself.
 55 | - Don't wait until the last minute to submit your assignment.
 56 | - __No late assignments__ will be accepted, for any reason, including, but not limited to, theft or any extraordinary circumnstances (e.g. illnes, exhaustion, mourning, loss of internet connection, bCourses is down, broken computer).
 57 | - Note that answers to non-review questions are in the back of the book so you can check that your answer is correct.
 58 | - Solutions to the review exercises will be posted on bCourses.
 59 | - If you collaborate with other students when working on a HW assignment, please include the names of those students in your submission.
 60 | - You must write your own answers (using your own words). Copy and plagiarism will not be tolerated (see _Academic Honesty_ policy).
 61 | 
 62 | 
 63 | ### Discussion
 64 | 
 65 | - Discussion is an important part of the class and is meant to supplement lecture.
 66 | - Your GSI will review and expand on concepts introduced in lecture and encourage you to problem solve in groups. 
 67 | - There will be about 4/5 short quizzes given in discussion to test your understanding.
 68 | - Your quiz scores __will NOT__ be part of your grade.
 69 | - Students must attend the discussion group they are officially registered in.
 70 | 
 71 | 
 72 | ### Exams
 73 | 
 74 | - There will be two 50-minute in-class midterms, and one 3-hour final exam.
 75 | - The tentative dates of the midterms are Friday Feb-24, and Friday Apr-07. 
 76 | - The final exam is currently scheduled for Wednesday, May 10th from 3:00pm-6:00pm. (classroom to be announced).
 77 | - If you do not take the final, you will NOT pass the class.
 78 | - There will be __no early or makeup exams__.
 79 | - ~~To ask for regrading, you must answer a test using pen. Tests answered with pencil will not be accepted for regrading.~~
 80 | - We will use _gradescrope_ to grade your tests (so you can use pen or pencil).
 81 | - When asking for regrading, please clearly state the reasons that make you think you deserve a higher score.
 82 | - You will have one full week after grades are published on gradescope to request a regrade.
 83 | - After the regrade deadline, no requests will be considered.
 84 | 
 85 | 
 86 | 
 87 | ### Grading Structure
 88 | 
 89 | - 20% homework (lowest 1 dropped)
 90 | - 25% midterm 1
 91 | - 25% midterm 2
 92 | - 30% final
 93 | 
 94 | No individual letter grades will be given for midterm, or final. You will get a letter grade for the course that is based on your overall score. Your final grade will be graded on a 30/30/30/10 (A/B/C/DF) scale.
 95 | 
 96 | 
 97 | ### Calculator Policy
 98 | 
 99 | - You will need one that adds, subtracts, multiplies, divides, takes square roots, raises numbers to a power, and preferably also computes factorials. Statistical calculators are unnecessary.
100 | - However, no graphing calculators, no phone calculators or tablet calculators are allowed.
101 | - If you do not bring a calculator to a midterm or the final, do your computations by hand (you won't be allowed to borrow someone else's calculator).
102 | 
103 | 
104 | ### Academic Honesty
105 | 
106 | I (Gaston Sanchez) expect you to do your own work and to uphold the standards of intellectual integrity. Collaborating on homework is fine and I encourage you to work together---but copying is not, nor is having somebody else submit assignments for you. Cheating will not be tolerated. Anyone found cheating will receive an F and will be reported to the [Center for the Student Conduct](http://sa.berkeley.edu/conduct). If you are having trouble with an assignment or studying for an exam, or if you are uncertain about permissible and impermissible conduct or collaboration, please come see me with your questions. 
107 | 
108 | 
109 | ### Email Policy
110 | 
111 | - You should try to use email as a tool to set up a one-on-one meeting with me if office hours conflict with your schedule.
112 | - Use the subject line __Stat 20 Meeting Request__.
113 | - Your message should include at least two times when you would like to meet and a brief (one-two sentence) description of the reason for the meeting.
114 | - Do NOT expect me to reply right away (I may not reply on time).
115 | - If you have an emergency, talk to me later during class or office hours.
116 | - I strongly encourage you to ask questions about the syllabus, covered material, and assignments during class time or lab discussions. 
117 | - I prefer to have conversations in person rather than via email, thus allowing us to get to know each other better and fostering a more collegial learning atmosphere.
118 | 
119 | 
120 | ### Accommodation Policy
121 | 
122 | Students needing accommodations for any physical, psychological, or learning disability, should speak with me during the first two weeks of the semester, either after class or during office hours and see [http://dsp.berkeley.edu](http://dsp.berkeley.edu) to learn about Berkeley’s policy. If you are a DSP student, please contact me at least three weeks prior to a midterm or final so that we can work out acceptable accommodations via the DSP Office.
123 | 
124 | If you are an athlete or Cal band member, please check your calendar and come see me as soon as possible to OH during the first two weeks of the semester. Please try your best to be present at each of the midterms as I cannot guarantee accommodation for a late exam.
125 | 
126 | 
127 | ### Safe, Supportive, and Inclusive Environment
128 | 
129 | Whenever a faculty member, staff member, post-doc, or GSI is responsible for 
130 | the supervision of a student, a personal relationship between them of a 
131 | romantic or sexual nature, even if consensual, is against university policy. 
132 | Any such relationship jeopardizes the integrity of the educational process.
133 | 
134 | Although faculty and staff can act as excellent resources for students, you 
135 | should be aware that they are required to report any violations of this campus 
136 | policy. If you wish to have a confidential discussion on matters related to this 
137 | policy, you may contact the _Confidential Care Advocates_ on campus for support 
138 | related to counseling or sensitive issues. Appointments can be
139 | made by calling (510) 642-1988.
140 | 
141 | The classroom, lab, and work place should be safe and inclusive environments 
142 | for everyone. The _Office for the Prevention of Harassment and Discrimination_ 
143 | (OPHD) is responsible for ensuring the University provides an environment for 
144 | faculty, staff and students that is free from discrimination and harassment on 
145 | the basis of categories including race, color, national origin, age, sex, gender, 
146 | gender identity, and sexual orientation. Questions or concerns? 
147 | Call (510) 643-7985, email ask_ophd@berkeley.edu, or go to 
148 | [http://survivorsupport.berkeley.edu/](http://survivorsupport.berkeley.edu/).
149 | 
150 | 
151 | ### Incomplete Policy
152 | 
153 | Under emergency/special circumstances, students may petition me to receive an Incomplete grade. By University policy, for a student to get an Incomplete requires (i) that the student was performing passing-level work until the time that (ii) something happened that---through no fault of the student---prevented the student from completing the coursework. If you take the final, you completed the course, even if you took it while ill, exhausted, mourning, etc. The time to talk to me about incomplete grades is BEFORE you take the final, when the situation that prevents you from finishing presents itself. Please clearly state your reasoning in your comments to me.
154 | 
155 | It is your responsibility to develop good time management skills, good studying habits, know your limits, and learn to ask for professional help.
156 | Life happens. Social, family, cultural, scholar, and individual circumstances can affect your performance (both positive and negatively). If you find yourself in a situation that raises concerns about passing the course, please come see me as soon as possible. Above all, please do not wait till the end of the semester to share your concerns about passing the course because it will be too late by then.
157 | 
158 | 
159 | ### Letters of Recommendation
160 | 
161 | Unless I have known you at least one year, and we have developed a good collegial relationship, I do not provide letters of recommendation. 
162 | 
163 | 
164 | ### Additional Course Policies
165 | 
166 | - Be sure to pay attention to deadlines.
167 | - In consideration to everybody in the classroom, please turn off your cell phone during class and lab time.
168 | 
169 | 
170 | 
171 | ### Fine Print
172 | 
173 | The course deadlines, assignments, exam times and material are subject to change at the whim of the professor.
174 | 
175 | 


--------------------------------------------------------------------------------