├── .gitignore
├── README.md
├── apps
├── README.md
├── ch03-histograms
│ ├── README.md
│ └── app.R
├── ch08-corr-coeff-diagrams
│ ├── README.md
│ └── app.R
├── ch10-heights-data
│ ├── README.md
│ ├── app.R
│ └── helpers.R
├── ch11-regression-residuals
│ ├── README.md
│ └── app.R
├── ch11-regression-strips
│ ├── README.md
│ ├── app.R
│ └── helpers.R
├── ch16-chance-error
│ ├── README.md
│ └── app.R
├── ch17-demere-games
│ ├── README.md
│ └── app.R
├── ch17-expected-value-std-error
│ ├── README.md
│ ├── app.R
│ └── helpers.R
├── ch18-coin-tossing
│ ├── README.md
│ └── app.R
├── ch18-roll-dice-product
│ ├── README.md
│ ├── app.R
│ └── helpers.R
├── ch18-roll-dice-sum
│ ├── README.md
│ ├── app.R
│ └── helpers.R
├── ch20-sampling-men
│ ├── README.md
│ └── app.R
├── ch21-accuracy-percentages
│ ├── README.md
│ ├── app.R
│ └── helpers.R
└── ch23-accuracy-averages
│ ├── README.md
│ ├── app.R
│ └── helpers.R
├── data
├── abalone.csv
├── distributions.csv
├── galton.csv
├── nba_players.csv
├── pearson.csv
├── stock-earnings-prices.csv
└── vegetables-smoking.csv
├── hw
├── README.md
├── hw01-questions.pdf
├── hw02-questions.pdf
├── hw03-questions.pdf
├── hw04-questions.pdf
├── hw05-questions.pdf
├── hw06-questions.pdf
├── hw07-questions.pdf
├── hw08-questions.pdf
├── hw09-questions.pdf
├── hw10-questions.pdf
├── hw11-questions.pdf
└── hw12-questions.pdf
├── labs
└── README.md
├── lectures
└── README.md
├── other
├── Karl-Pearson-and-the-origins-of-modern-statistics.pdf
├── Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf
├── The-strange-science-of-Francis-Galton.pdf
├── formula-sheet-final.pdf
├── formula-sheet-midterm1.pdf
├── formula-sheet-midterm2.pdf
├── standard-normal-table.pdf
├── t-table.pdf
└── z-table.pdf
├── scripts
├── 01-R-introduction.Rmd
├── 01-R-introduction.pdf
├── 02-data-variables.Rmd
├── 02-data-variables.pdf
├── 03-histograms.Rmd
├── 03-histograms.pdf
├── 04-measures-center.Rmd
├── 04-measures-center.pdf
├── 05-measures-spread.Rmd
├── 05-measures-spread.pdf
├── 06-normal-curve.Rmd
├── 06-normal-curve.pdf
├── 07-scatter-diagrams.Rmd
├── 07-scatter-diagrams.pdf
├── 08-correlation.Rmd
├── 08-correlation.pdf
├── 09-regression-line.Rmd
├── 09-regression-line.pdf
├── 10-prediction-and-errors-in-regression.Rmd
├── 10-prediction-and-errors-in-regression.pdf
├── 11-binomial-formula.Rmd
├── 11-binomial-formula.pdf
├── 12-chance-process.Rmd
├── 12-chance-process.pdf
├── Makefile
├── README.md
└── images
│ ├── karl-pearson.jpg
│ └── western-conference-standings-2016.png
└── syllabus
├── README.md
├── mrs-mutner-rules.jpg
├── syllabus-stat131A.md
└── syllabus-stat20.md
/.gitignore:
--------------------------------------------------------------------------------
1 | # Mac specific
2 | *.DS_Store
3 |
4 | # latex specific
5 | *.aux
6 | *.log
7 |
8 | # files in labs/
9 | labs/.DS_Store
10 | labs/*.html
11 |
12 | # files in data/
13 | data/.DS_Store
14 | data/.Rhistory
15 |
16 | # files in units/
17 | scripts/.DS_Store
18 | scripts/.Rhistory
19 |
20 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## About
2 |
3 | This repository holds the course materials for the Spring 2017 edition of:
4 |
5 | - Stat 20 __Introduction to Probability and Statistics__ at UC Berkeley.
6 | - Stat 131A __Introduction to Probability and Statistics for Life Scientists__ at UC Berkeley.
7 |
8 |
9 | ## Contents
10 |
11 | - [Syllabus](syllabus): Course logistics and policies.
12 | - [Lectures](lectures): Calendar of weekly topics, and lectures material.
13 | - [HW Assignments](hw): Weekly assignments.
14 | - [Labs](labs): Topics from textbook for lab.
15 | - [Scripts](scripts): Tutorial R scripts.
16 | - [Apps](apps): Shiny apps used in lecture's demos.
17 | - [Data](data): Data sets.
18 | - [Other](other): Other resources (e.g. tables, articles).
19 |
20 |
21 | ## R and RStudio
22 |
23 | We will use the statistical software __[R](https://www.r-project.org/)__ and the
24 | [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment)
25 | __[RStudio](https://www.rstudio.com/)__ as a computational tool to
26 | practice and apply the key concepts of the course.
27 |
28 | Both R and RStudio are free, and are available for Mac OS X, Windows, and Linux.
29 |
30 | To install R (Binary version):
31 |
32 | - Mac: [https://cran.cnr.berkeley.edu/bin/macosx/](https://cran.cnr.berkeley.edu/bin/macosx/)
33 | - Windows: [https://cran.cnr.berkeley.edu/bin/windows/](https://cran.cnr.berkeley.edu/bin/windows/)
34 |
35 | To install RStudio (free desktop version):
36 |
37 | - RStudio Desktop version [https://www.rstudio.com/products/rstudio/download/](https://www.rstudio.com/products/rstudio/download/)
38 |
39 |
40 | -----
41 |
42 | ### License
43 |
44 | 
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
45 |
46 | Author: [Gaston Sanchez](http://gastonsanchez.com)
47 |
--------------------------------------------------------------------------------
/apps/README.md:
--------------------------------------------------------------------------------
1 | # Shiny Apps
2 |
3 | This is a collection of Shiny apps to be used mainly during lecture to illustrate some of the concepts in the textbook _Statistics_ (FPP) 4th edition.
4 |
5 |
6 | ## Running the apps
7 |
8 | The easiest way to run an app is with the `runGitHub()` function from the `"shiny"` package. Please make sure you have installed the package `"shiny"`. In case of doubt, run:
9 |
10 | ```R
11 | install.packages("shiny")
12 | ```
13 |
14 |
15 | For instance, to run the app contained in the [ch03-histograms](/ch03-histograms) folder, run the following code in R:
16 |
17 | ```R
18 | library(shiny)
19 |
20 | # Run an app from a subdirectory in the repo
21 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch03-histograms")
22 | ```
23 |
--------------------------------------------------------------------------------
/apps/ch03-histograms/README.md:
--------------------------------------------------------------------------------
1 | # Histograms for NBA players data
2 |
3 | This is a Shiny app that generates histograms using data of NBA players from the season 2015-2016..
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide examples of histograms and distributions of quantitative variables. __Statistics, Chapter 3: The Histogram__ (pages 32-56):
9 |
10 | - A _variable_ is a characteristic of the subjects in a study. It can be either qualitative or quantitative.
11 | - A _histogram_ is a visual display used to look at the distribution of a quantitative variable.
12 | - A _histogram_ represents precents by area. It consists of a set of blocks. The area of each block represents the percentage of cases in the correspoding class interval.
13 | - With the _density scale_, the height of each block equals the percentage of cases in the the corresponding class interval, divided by the length of that interval.
14 |
15 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
16 |
17 |
18 | ## Data
19 |
20 | The data set is in the `nba_players.csv` file (see `data/` folder) which contains 528 rows and 39 columns, although this app only uses quantitative variables.
21 |
22 |
23 | ## How to run it?
24 |
25 |
26 | ```R
27 | library(shiny)
28 |
29 | # Easiest way is to use runGitHub
30 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch03-histograms")
31 | ```
32 |
33 |
--------------------------------------------------------------------------------
/apps/ch03-histograms/app.R:
--------------------------------------------------------------------------------
1 | # Title: Histograms with NBA Players Data
2 | # Description: this app uses data of NBA players to show various histograms
3 | # Author: Gaston Sanchez
4 |
5 | library(shiny)
6 |
7 | # data set
8 | nba <- read.csv('../../data/nba_players.csv', header = TRUE)
9 |
10 | # quantitative variables
11 | quantitative <- c(
12 | "height","weight","salary","experience","age","games","games_started",
13 | "minutes_played","field_goals","field_goal_attempts","field_goal_percent",
14 | "points3","points3_attempts","points3_percent","points2","points2_attempts",
15 | "points2_percent","effective_field_goal_percent","free_throws",
16 | "free_throw_attempts","free_throw_percent","offensive_rebounds",
17 | "defensive_rebounds","total_rebounds","assists","steals","blocks",
18 | "turnovers","fouls","points")
19 |
20 | # select just quantitative variables
21 | dat <- nba[ ,quantitative]
22 |
23 |
24 | # Define UI for application that draws a histogram
25 | ui <- fluidPage(
26 |
27 | # Application title
28 | titlePanel("NBA Players"),
29 |
30 | # Sidebar with a slider input for number of bins
31 | sidebarLayout(
32 | sidebarPanel(
33 | selectInput("variable", "Select a Variable",
34 | choices = colnames(dat), selected = 'height'),
35 |
36 | sliderInput("bins",
37 | "Number of bins:",
38 | min = 1,
39 | max = 50,
40 | value = 10),
41 |
42 | checkboxInput('density', label = strong('Use density scale'))
43 | ),
44 |
45 | # Show a plot of the generated distribution
46 | mainPanel(
47 | plotOutput("histogram")
48 | )
49 | )
50 | )
51 |
52 |
53 | # Define server logic required to draw a histogram
54 | server <- function(input, output) {
55 |
56 | output$histogram <- renderPlot({
57 | # generate bins based on input$bins from ui.R
58 | x <- na.omit(dat[ ,input$variable])
59 | bins <- seq(min(x), max(x), length.out = input$bins + 1)
60 |
61 | histogram <- hist(x, breaks = bins,
62 | probability = input$density,
63 | col = 'gray80', border = 'white', las = 1,
64 | axes = FALSE, xlab = "",
65 | main = paste("Histogram of", input$variable))
66 | axis(side = 2, las = 1)
67 | axis(side = 1, at = bins, labels = round(bins, 2))
68 |
69 | })
70 | }
71 |
72 | # Run the application
73 | shinyApp(ui = ui, server = server)
74 |
75 |
--------------------------------------------------------------------------------
/apps/ch08-corr-coeff-diagrams/README.md:
--------------------------------------------------------------------------------
1 | # Correlation Coefficient Diagrams
2 |
3 | This is a Shiny app that generates scatter diagrams based on the specified correlation coefficient.
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide examples of scatter diagrams as those displayed on the FPP book, page 127. See Chapter 8: The Correlation Coefficient (page 127).
9 |
10 | The scatter diagrams are based on random generated data following a multivariate normal distribution.
11 |
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 |
14 |
15 | ## How to run it?
16 |
17 |
18 | ```R
19 | library(shiny)
20 |
21 | # Easiest way is to use runGitHub
22 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch08-corr-coeff-diagrams")
23 | ```
24 |
25 |
--------------------------------------------------------------------------------
/apps/ch08-corr-coeff-diagrams/app.R:
--------------------------------------------------------------------------------
1 | #
2 | # This is a Shiny web application. You can run the application by clicking
3 | # the 'Run App' button above.
4 | #
5 | # Find out more about building applications with Shiny here:
6 | #
7 | # http://shiny.rstudio.com/
8 | #
9 |
10 | library(shiny)
11 | library(MASS)
12 |
13 |
14 | # Define UI for application that draws a histogram
15 | ui <- fluidPage(
16 |
17 | # Application title
18 | titlePanel("Scatter Diagrams and Correlation"),
19 |
20 | # Sidebar with a slider input for number of bins
21 | sidebarLayout(
22 | sidebarPanel(
23 | numericInput("seed",
24 | "Random seed",
25 | min = 100,
26 | max = 99999,
27 | value = 1234),
28 | sliderInput("corr",
29 | "Correlation Coefficient",
30 | min = -1,
31 | max = 1,
32 | step = 0.05,
33 | value = 0.7),
34 | sliderInput("size",
35 | "Number of points",
36 | min = 10,
37 | max = 5000,
38 | step = 5,
39 | value = 500),
40 | sliderInput("cex",
41 | "Size of points",
42 | min = 0,
43 | max = 5,
44 | step = 0.1,
45 | value = 1),
46 | sliderInput("alpha",
47 | "Transparency of points",
48 | min = 0,
49 | max = 1,
50 | step = 0.01,
51 | value = 0.8)
52 | ),
53 |
54 | # Show a plot of the generated distribution
55 | mainPanel(
56 | plotOutput("scatterplot")
57 | )
58 | )
59 | )
60 |
61 | # Define server logic required to draw a histogram
62 | server <- function(input, output) {
63 |
64 | output$scatterplot <- renderPlot({
65 | # generate bins based on input$bins from ui.R
66 | set.seed(input$seed)
67 | cor_matrix = matrix(c(1, input$corr, input$corr, 1), 2)
68 | xy = mvrnorm(input$size, c(0, 0), cor_matrix)
69 | plot(xy, type = "n", axes=FALSE, xlab="", ylab="",
70 | xlim=c(-3, 3), ylim=c(-3, 3))
71 | abline(h=0, v=0, col="gray80", lwd = 2)
72 | points(xy[,1], xy[,2], pch=20, cex=input$cex,
73 | col=rgb(0.45, 0.59, 0.84, alpha = input$alpha))
74 | })
75 | }
76 |
77 | # Run the application
78 | shinyApp(ui = ui, server = server)
79 |
80 |
--------------------------------------------------------------------------------
/apps/ch10-heights-data/README.md:
--------------------------------------------------------------------------------
1 | # Regression Scatterplot for Pearson's Height data
2 |
3 | This is a Shiny app that generates a scatter diagram to illustrate the regression method using Pearson's heights data set.
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide a visual display of some of the concepts from __Statistics, Chapter 10: Regression__ (pages 158-165):
9 |
10 | - Point of averages
11 | - SD line
12 | - Graph of averages
13 | - Regression line
14 | - Correlation coefficient
15 |
16 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
17 |
18 |
19 | ## Data
20 |
21 | This app uses the Pearson's Height Data. The data is in the `data/` folder. which contains 1078 rows and 2 columns:
22 |
23 | - `Father`: The father's height, in inches
24 | - `Son`: The height of the son, in inches
25 |
26 | The app only uses variables: `Father, Mother, Child`
27 |
28 | Original source: [http://www.math.uah.edu/stat/data/Pearson.csv](http://www.math.uah.edu/stat/data/Pearson.csv)
29 |
30 |
31 | ## How to run it?
32 |
33 | There are many ways to download the app and run it:
34 |
35 | ```R
36 | library(shiny)
37 |
38 | # Easiest way is to use runGitHub
39 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch10-heights-data")
40 | ```
41 |
--------------------------------------------------------------------------------
/apps/ch10-heights-data/app.R:
--------------------------------------------------------------------------------
1 | #
2 | # This is a Shiny web application. You can run the application by clicking
3 | # the 'Run App' button above.
4 | #
5 | # Find out more about building applications with Shiny here:
6 | #
7 | # http://shiny.rstudio.com/
8 | #
9 |
10 | library(shiny)
11 | source('helpers.R')
12 |
13 | # reading a couple of lines just to get the names of variables
14 | dat <- read.csv('../../data/pearson.csv')
15 |
16 | # Define UI for application that draws a histogram
17 | ui <- fluidPage(
18 |
19 | # Application title
20 | titlePanel("Pearson's Height Data Set"),
21 |
22 | # Define the sidebar with one input
23 | sidebarPanel(
24 | selectInput("xvar", "X-axis variable",
25 | choices = colnames(dat), selected = 'Father'),
26 | selectInput("yvar", "Y-axis variable",
27 | choices = colnames(dat), selected = 'Son'),
28 | sliderInput("cex",
29 | label = "Size of points",
30 | min = 0, max = 3, value = 2, step = 0.1),
31 | checkboxInput('reg_line', label = strong('Regression line')),
32 | checkboxInput('point_avgs', label = strong('Point of Averages')),
33 | checkboxInput('sd_line', label = strong('SD line')),
34 | checkboxInput('sd_guides', label = strong('SD guides')),
35 | sliderInput("breaks",
36 | label = "Graph of Averages",
37 | min = 0, max = 10, value = 0, step = 1),
38 | hr(),
39 | helpText('Correlation:'),
40 | verbatimTextOutput("correlation")
41 | ),
42 |
43 | # Show a plot of the generated distribution
44 | mainPanel(
45 | plotOutput("datPlot")
46 | )
47 | )
48 |
49 |
50 | # Define server logic required to draw a histogram
51 | server <- function(input, output) {
52 |
53 | # Correlation
54 | output$correlation <- renderPrint({
55 | cor(dat[,input$xvar], dat[,input$yvar])
56 | })
57 |
58 | # Fill in the spot we created for a plot
59 | output$datPlot <- renderPlot({
60 | # standard deviations
61 | sdx <- sd(dat[,input$xvar])
62 | sdy <- sd(dat[,input$yvar])
63 | avgx <- mean(dat[,input$xvar])
64 | avgy <- mean(dat[,input$yvar])
65 |
66 | # Render scatterplot
67 | plot(dat[,input$xvar], dat[,input$yvar],
68 | main = 'scatter diagram', type = 'n', axes = FALSE,
69 | xlab = paste(input$xvar, " height (in)"),
70 | ylab = paste(input$yvar, "height (in)"))
71 | box()
72 | axis(side = 1)
73 | axis(side = 2, las = 1)
74 | points(dat[,input$xvar], dat[,input$yvar],
75 | pch = 21, col = 'white', bg = '#777777aa',
76 | lwd = 2, cex = input$cex)
77 | # Point of Averages
78 | if (input$point_avgs) {
79 | points(avgx, avgy,
80 | pch = 21, col = 'white', bg = 'tomato',
81 | lwd = 3, cex = 3)
82 | }
83 | # SD line
84 | if (input$sd_line) {
85 | cor_xy <- cor(dat[,input$xvar], dat[,input$yvar])
86 | if (cor_xy >= 0) {
87 | sd_line <- line_equation(avgx - sdx, avgy - sdy, avgx + sdx, avgy + sdy)
88 | abline(a = sd_line$intercept, b = sd_line$slope,
89 | lwd = 4, lty = 2, col = 'orange')
90 | } else {
91 | sd_line <- line_equation(avgx + sdx, avgy - sdy, avgx - sdx, avgy + sdy)
92 | abline(a = sd_line$intercept, b = sd_line$slope,
93 | lwd = 4, lty = 2, col = 'orange')
94 | }
95 | }
96 | # SD guides
97 | if (input$sd_guides) {
98 | abline(v = c(avgx - sdx, avgx + sdx),
99 | h = c(avgy - sdy, avgy + sdy),
100 | lty = 1, lwd = 3, col = '#FFA600aa')
101 | }
102 | # Graph of averages
103 | if (input$breaks > 1) {
104 | graph_avgs <- averages(dat[,input$xvar], dat[,input$yvar],
105 | breaks = input$breaks)
106 | points(graph_avgs$x, graph_avgs$y, pch = "+",
107 | col = '#ff6700', cex = 3)
108 | }
109 | # Regression line
110 | if (input$reg_line) {
111 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
112 | abline(reg = reg, lwd = 4, col = '#4878DF')
113 | }
114 |
115 | }, height = 650, width = 650)
116 | }
117 |
118 |
119 | # Run the application
120 | shinyApp(ui = ui, server = server)
121 |
122 |
--------------------------------------------------------------------------------
/apps/ch10-heights-data/helpers.R:
--------------------------------------------------------------------------------
1 | # compuptes the slope and intercept terms of a line between two points
2 | line_equation <- function(x1, y1, x2, y2) {
3 | slope <- (y2 - y1) / (x2 - x1)
4 | intercept <- y1 - slope*x1
5 | list(intercept = intercept, slope = slope)
6 | }
7 |
8 | # computes x,y averages depending on a given number of intervals (x-axis)
9 | # (to be used for showing graph of averages)
10 | averages <- function(x, y, breaks = 5) {
11 | x_cut<- cut(x, breaks = breaks)
12 | y_averages <- as.vector(tapply(y, x_cut, mean))
13 | x_boundaries <- gsub('\\(', '', levels(x_cut))
14 | x_boundaries <- gsub('\\]', '', x_boundaries)
15 | x_boundaries <- strsplit(x_boundaries, ',')
16 | x1 <- as.numeric(sapply(x_boundaries, function(u) u[1]))
17 | x2 <- as.numeric(sapply(x_boundaries, function(u) u[2]))
18 | x_midpoints <- x1 + (x2 - x1) / 2
19 | list(x = x_midpoints, y = y_averages)
20 | }
21 |
--------------------------------------------------------------------------------
/apps/ch11-regression-residuals/README.md:
--------------------------------------------------------------------------------
1 | # Regression Residuals
2 |
3 | This is a Shiny app that generates two graphs: 1) a scatterplot with a
4 | regression line, and 2) a residual plot from the fitted regression line.
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to illustrate the concepts of homoscedastic and heteroscedastic
10 | residuals described in
11 | __Statistics, Chapter 11: The R.M.S. Error for Regression__ (pages 180-201):
12 |
13 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger
14 | Purves (2007). Fourth Edition. Norton & Company.
15 |
16 |
17 | ## Data
18 |
19 | This app uses the data from NBA basketball players in the 2015-2016 season.
20 | The csv file `nba_players.csv` is the `data/` folder of the github repository.
21 |
22 |
23 | ## How to run it?
24 |
25 |
26 | ```R
27 | library(shiny)
28 |
29 | # Easiest way is to use runGitHub
30 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch11-regression-residuals")
31 | ```
32 |
--------------------------------------------------------------------------------
/apps/ch11-regression-residuals/app.R:
--------------------------------------------------------------------------------
1 | # Title: Spread within vertical strips in regression
2 | # Description: this app uses Pearson's height data
3 | # Author: Gaston Sanchez
4 |
5 | library(shiny)
6 |
7 | # reading a couple of lines just to get the names of variables
8 | nba <- read.csv('../../data/nba_players.csv')
9 |
10 | # quantitative variables
11 | quantitative <- c(
12 | "height","weight","experience","age","games","games_started",
13 | "minutes_played","field_goals","field_goal_attempts","field_goal_percent",
14 | "points3","points3_attempts","points3_percent","points2","points2_attempts",
15 | "points2_percent","effective_field_goal_percent","free_throws",
16 | "free_throw_attempts","free_throw_percent","offensive_rebounds",
17 | "defensive_rebounds","total_rebounds","assists","steals","blocks",
18 | "turnovers","fouls","points")
19 |
20 | # select just quantitative variables
21 | dat <- nba[ ,quantitative]
22 |
23 | # Define UI for application that draws a histogram
24 | ui <- fluidPage(
25 | # Give the page a title
26 | titlePanel("NBA Players"),
27 |
28 | # Generate a row with a sidebar
29 | sidebarLayout(
30 |
31 | # Define the sidebar with one input
32 | sidebarPanel(
33 | selectInput("xvar", "X-axis variable",
34 | choices = colnames(dat), selected = 'height'),
35 | selectInput("yvar", "Y-axis variable",
36 | choices = colnames(dat), selected = 'weight'),
37 | hr(),
38 | helpText('Correlation:'),
39 | verbatimTextOutput("correlation"),
40 | helpText('r.m.s. error:'),
41 | verbatimTextOutput("rms_error")
42 | ),
43 |
44 | # Create a spot for the barplot
45 | mainPanel(
46 | plotOutput("datPlot"),
47 | plotOutput("residualPlot")
48 | )
49 |
50 | )
51 | )
52 |
53 |
54 | # Define server logic required to draw a histogram
55 | server <- function(input, output) {
56 |
57 | # Correlation
58 | output$correlation <- renderPrint({
59 | cor(dat[,input$xvar], dat[,input$yvar])
60 | })
61 |
62 | # r.m.s. error
63 | output$rms_error <- renderPrint({
64 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
65 | sqrt(mean((reg$residuals)^2))
66 | })
67 |
68 | # Fill in the spot we created for a plot
69 | output$datPlot <- renderPlot({
70 |
71 | # Render a scatter diagram
72 | plot(dat[,input$xvar], dat[,input$yvar],
73 | main = 'scatter diagram', type = 'n', axes = FALSE,
74 | xlab = input$xvar, ylab = input$yvar)
75 | box()
76 | axis(side = 1)
77 | axis(side = 2, las = 1)
78 | points(dat[,input$xvar], dat[,input$yvar],
79 | pch = 21, col = 'white', bg = '#4878DFaa',
80 | lwd = 2, cex = 2)
81 | # regression line
82 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
83 | abline(reg = reg, lwd = 3, col = '#e35a6d')
84 |
85 | })
86 |
87 | # histogram
88 | output$residualPlot <- renderPlot({
89 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
90 | # Render scatterplot
91 | plot(dat[,input$xvar], reg$residuals, las = 1,
92 | main = 'Residual plot', xlab = input$xvar,
93 | ylab = 'residuals', col = '#ACB6F1', type = 'n')
94 | abline(h = 0, col = 'gray70', lw = 2)
95 | points(dat[,input$xvar], reg$residuals,
96 | pch = 20, col = '#888888aa', cex = 2)
97 | })
98 | }
99 |
100 |
101 |
102 | # Run the application
103 | shinyApp(ui = ui, server = server)
104 |
105 |
--------------------------------------------------------------------------------
/apps/ch11-regression-strips/README.md:
--------------------------------------------------------------------------------
1 | # Vertical Strips for Pearson's Height data
2 |
3 | This is a Shiny app that generates a scatter diagram to illustrate the
4 | distribution of values on the y-axis within a vertical strip.
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to provide a visual display of some of the concepts from
10 | __Statistics, Chapter 11: The R.M.S. Error for Regression__ (pages 180-201):
11 |
12 | - Looking at vertical strips
13 | - Using the normal curve inside a vertical strip
14 |
15 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger
16 | Purves (2007). Fourth Edition. Norton & Company.
17 |
18 |
19 | ## Data
20 |
21 | This app uses the Pearson and Lee's Height Data as described in
22 | [Pearson's Height Data](http://www.math.uah.edu/stat/data/Pearson.csv).
23 | The data is in the `pearson.csv` file, available in the `data/` folder of
24 | the github repository.
25 |
26 |
27 | ## How to run it?
28 |
29 |
30 | ```R
31 | library(shiny)
32 |
33 | # Easiest way is to use runGitHub
34 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch11-regression-strips")
35 | ```
36 |
--------------------------------------------------------------------------------
/apps/ch11-regression-strips/app.R:
--------------------------------------------------------------------------------
1 | # Title: Spread within vertical strips in regression
2 | # Description: this app uses Pearson's height data
3 | # Author: Gaston Sanchez
4 |
5 | library(shiny)
6 | source('helpers.R')
7 |
8 | # reading a couple of lines just to get the names of variables
9 | dat <- read.csv('../../data/pearson.csv')
10 |
11 | # Define UI for application that draws a histogram
12 | ui <- fluidPage(
13 |
14 | # Application title
15 | titlePanel("Pearson's Height Data Set"),
16 |
17 | # Define the sidebar with one input
18 | sidebarPanel(
19 | selectInput("xvar", "X-axis variable",
20 | choices = colnames(dat), selected = 'Father'),
21 | selectInput("yvar", "Y-axis variable",
22 | choices = colnames(dat), selected = 'Son'),
23 | checkboxInput('reg_line', label = strong('Regression line')),
24 | sliderInput("cex",
25 | label = "Size of points",
26 | min = 0, max = 3, value = 1.5, step = 0.1),
27 | #checkboxInput('point_avgs', label = strong('Point of Averages')),
28 | #checkboxInput('sd_line', label = strong('SD line')),
29 | #checkboxInput('sd_guides', label = strong('SD guides')),
30 | sliderInput("center",
31 | label = "x location",
32 | min = 60,
33 | max = 76,
34 | value = 70, step = 0.25),
35 | sliderInput("width",
36 | label = "width",
37 | min = 0,
38 | max = 4,
39 | value = 0, step = 0.1),
40 | hr(),
41 | helpText('Correlation:'),
42 | verbatimTextOutput("correlation")
43 | ),
44 |
45 | # Show a plot of the generated distribution
46 | mainPanel(
47 | plotOutput("datPlot"),
48 | plotOutput("histogram")
49 | )
50 | )
51 |
52 |
53 | # Define server logic required to draw a histogram
54 | server <- function(input, output) {
55 |
56 | # Correlation
57 | output$correlation <- renderPrint({
58 | cor(dat[,input$xvar], dat[,input$yvar])
59 | })
60 |
61 | # Fill in the spot we created for a plot
62 | output$datPlot <- renderPlot({
63 | # Render scatterplot
64 | plot(dat[,input$xvar], dat[,input$yvar],
65 | main = 'scatter diagram', type = 'n', axes = FALSE,
66 | xlab = input$xvar, ylab = input$yvar)
67 | box()
68 | axis(side = 1)
69 | axis(side = 2, las = 1)
70 | points(dat[,input$xvar], dat[,input$yvar],
71 | pch = 21, col = 'white', bg = '#777777aa',
72 | lwd = 2, cex = input$cex)
73 | # vertical strips
74 | abline(v = c(input$center - input$width, input$center + input$width),
75 | lty = 1, lwd = 3, col = '#5A6DE3')
76 | # Regression line
77 | if (input$reg_line) {
78 | reg <- lm(dat[,input$yvar] ~ dat[,input$xvar])
79 | abline(reg = reg, lwd = 3, col = '#e35a6d')
80 | }
81 | })
82 |
83 | # histogram
84 | output$histogram <- renderPlot({
85 | xmin <- input$center - input$width
86 | xmax <- input$center + input$width
87 | child <- dat$Son[dat$Father >= xmin & dat$Father <= xmax]
88 | hist(child, main = '', col = '#ACB6FF', las = 1)
89 | })
90 |
91 | }
92 |
93 | # Run the application
94 | shinyApp(ui = ui, server = server)
95 |
96 |
--------------------------------------------------------------------------------
/apps/ch11-regression-strips/helpers.R:
--------------------------------------------------------------------------------
1 | line_equation <- function(x1, y1, x2, y2) {
2 | slope <- (y2 - y1) / (x2 - x1)
3 | intercept <- y1 - slope*x1
4 | list(intercept = intercept, slope = slope)
5 | }
6 |
7 |
8 | averages <- function(x, y, breaks = 5) {
9 | x_cut<- cut(x, breaks = breaks)
10 | y_averages <- as.vector(tapply(y, x_cut, mean))
11 | x_boundaries <- gsub('\\(', '', levels(x_cut))
12 | x_boundaries <- gsub('\\]', '', x_boundaries)
13 | x_boundaries <- strsplit(x_boundaries, ',')
14 | x1 <- as.numeric(sapply(x_boundaries, function(u) u[1]))
15 | x2 <- as.numeric(sapply(x_boundaries, function(u) u[2]))
16 | x_midpoints <- x1 + (x2 - x1) / 2
17 | list(x = x_midpoints, y = y_averages)
18 | }
19 |
20 |
--------------------------------------------------------------------------------
/apps/ch16-chance-error/README.md:
--------------------------------------------------------------------------------
1 | # Chance Error
2 |
3 | This is a Shiny app that illustrates the concept of chance error when simulating tossing a coin a given number of times.
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide a visual display motivated by John Kerrich's coin-tossing experiment __Statistics, Chapter 16: The Law of Averages__
9 |
10 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
11 |
12 |
13 | ## Data
14 |
15 | The data simulates tossing a coin using the random binomial generator function `rbinom()`. The input parameters are the number of tosses, and optionally, the probability of heads.
16 |
17 |
18 | ## Plot
19 |
20 | There are two options for the displayed plot:
21 |
22 | 1. shows the chance error (i.e. number of heads minus half the number of tosses) on the y-axis, and the number of tosses on the x-axis.
23 | 2. shows the percent error (i.e. proportion of heads) on the y-axis, and the number of tosses on the x-axis.
24 |
25 |
26 | ## How to run it?
27 |
28 | ```R
29 | library(shiny)
30 |
31 | # Easiest way is to use runGitHub
32 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch16-chance-error")
33 | ```
34 |
--------------------------------------------------------------------------------
/apps/ch16-chance-error/app.R:
--------------------------------------------------------------------------------
1 | # Title: Chance Error and Percent Error
2 | # Description: Chance error when tossing a coin (based on John Kerrich's)
3 | # Chapter 16: The Law of Averages, p 275-278
4 | # Author: Gaston Sanchez
5 |
6 | library(shiny)
7 |
8 | # Define UI for application that draws a histogram
9 | ui <- fluidPage(
10 |
11 | # Give the page a title
12 | titlePanel("Coin Tossing Experiment"),
13 |
14 | # Generate a row with a sidebar
15 | sidebarLayout(
16 |
17 | # Define the sidebar with one input
18 | sidebarPanel(
19 | numericInput("seed", label = "Random Seed:", 12345,
20 | min = 10000, max = 50000, step = 1),
21 | sliderInput("chance", label = "Chance of heads:",
22 | min = 0, max = 1, value = 0.5, step = 0.01),
23 | sliderInput("tosses", label = "Number of tosses:",
24 | min = 100, max = 10000, value = 3000, step = 50),
25 | radioButtons("error", label = "Display",
26 | choices = list("Chance error" = 1,
27 | "Percent error" = 2),
28 | selected = 2),
29 | hr(),
30 | helpText('Total number of heads:'),
31 | verbatimTextOutput("num_heads"),
32 | helpText('Proportion of heads:'),
33 | verbatimTextOutput("prop_heads")
34 | ),
35 |
36 | # Create a spot for the barplot
37 | mainPanel(
38 | plotOutput("chancePlot")
39 | )
40 | )
41 | )
42 |
43 |
44 | # Define server logic required to draw a histogram
45 | server <- function(input, output) {
46 |
47 | seed <- reactive({
48 | input$seed
49 | })
50 | tosses <- reactive({
51 | input$tosses
52 | })
53 | chance <- reactive({
54 | input$chance
55 | })
56 |
57 | # Number of heads
58 | output$num_heads <- renderPrint({
59 | set.seed(seed())
60 | flips <- rbinom(n = tosses(), 1, prob = chance())
61 | sum(flips)
62 | })
63 |
64 | # Proportion of heads
65 | output$prop_heads <- renderPrint({
66 | set.seed(seed())
67 | flips <- rbinom(n = tosses(), 1, prob = chance())
68 | round(100 * sum(flips) / tosses(), 2)
69 | })
70 |
71 | # Fill in the spot we created for a plot
72 | output$chancePlot <- renderPlot({
73 | set.seed(input$seed)
74 | tosses <- input$tosses
75 | flips <- rbinom(n = tosses, 1, prob = chance())
76 | num_heads <- cumsum(flips)
77 | prop_heads <- (num_heads / 1:tosses)
78 | num_tosses <- 1:tosses
79 |
80 | # Render a barplot
81 | difference <- num_heads[num_tosses] - (chance() * num_tosses)
82 | proportion <- prop_heads[num_tosses]
83 | if (input$error == 1) {
84 | plot(num_tosses, difference,
85 | col = '#627fe2', type = 'l', lwd = 2,
86 | xlab = "Number of tosses",
87 | ylab = '# of heads - 1/2 # of tosses',
88 | axes = FALSE, main = 'Chance Error: # successes - # expected')
89 | abline(h = 0, col = '#88888855', lwd = 2, lty = 2)
90 | axis(side = 2, las = 1)
91 | } else {
92 | plot(num_tosses, proportion, ylim = c(0, 1),
93 | col = '#627fe2', type = 'l', lwd = 2,
94 | xlab = 'Number of tosses',
95 | ylab = 'Proportion of heads',
96 | axes = FALSE, main = 'Percent Error: % successes - % expected')
97 | abline(h = chance(), col = '#88888855', lwd = 2, lty = 2)
98 | axis(side = 2, las = 1, at = seq(0, 1, 0.1))
99 | }
100 | axis(side = 1)
101 | })
102 |
103 | }
104 |
105 | # Run the application
106 | shinyApp(ui = ui, server = server)
107 |
108 |
--------------------------------------------------------------------------------
/apps/ch17-demere-games/README.md:
--------------------------------------------------------------------------------
1 | # Expected value and Standard Error with De Mere's Games
2 |
3 | This is a Shiny app that illustrates the concept of Expected Value and Standard Error
4 | when simulating De Mere's games (100 times by default).
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to provide a visual display for the concepts in __Statistics, Chapter 17: The Expected Value and Standard Error.__
10 |
11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
12 |
13 |
14 | ## Data
15 |
16 | The app allows you to simulate two main scenarios:
17 |
18 | 1. __Rolling a fair die 4 times.__ This actually done by drawing 4 tickets out of a box with six tickets.
19 | The structure of the box consists of one ticket `1`, and five tickets `0`.
20 | 2. __Rolling a pair of dice 24 times.__ This actually done by drawing 24 tickets out of a box with 36 tickets.
21 | The structure of the box consists of 1 ticket `1`, and 35 tickets `0`.
22 |
23 |
24 | ## Plots
25 |
26 | There are three displayed graphs.
27 |
28 | 1. A probability distribution (theoretical probabilities for the number of tickets `1`).
29 | 2. An empirical pareto chart (cumulative distribution) with the proportion of tickets `1` when
30 | playing the game a given number of times.
31 | 3. A line chart with the empirical (net) gain when playing the game a given number of times.
32 |
33 |
34 | ## How to run it?
35 |
36 | ```R
37 | library(shiny)
38 |
39 | # Easiest way is to use runGitHub
40 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch17-demere-games")
41 | ```
42 |
--------------------------------------------------------------------------------
/apps/ch17-demere-games/app.R:
--------------------------------------------------------------------------------
1 | # Title: More Expected Value and Standard Error
2 | # Description: Simulation of De Mere's rolling dice games using
3 | # a box model for the number of "aces" (ticket 1).
4 | # Author: Gaston Sanchez
5 |
6 | library(shiny)
7 |
8 | # Define UI for application that draws a histogram
9 | ui <- fluidPage(
10 |
11 | # Give the page a title
12 | titlePanel("De Mere's games"),
13 |
14 | # Generate a row with a sidebar
15 | sidebarLayout(
16 |
17 | # Define the sidebar with one input
18 | sidebarPanel(
19 | fluidRow(
20 | column(5,
21 | numericInput("tickets1", "# Tickets 1", 1,
22 | min = 1, max = 35, step = 1)),
23 | column(5,
24 | numericInput("tickets0", "# Tickets 0", 5,
25 | min = 1, max = 35, step = 1))
26 | ),
27 | helpText('Avg of box, and SD of box'),
28 | verbatimTextOutput("avg_sd_box"),
29 | numericInput("draws", label = "Number of Draws:", value = 4,
30 | min = 1, max = 100, step = 1),
31 | helpText('Expected Value and SE'),
32 | verbatimTextOutput("ev_se"),
33 | hr(),
34 | sliderInput("reps", label = "Number of games:",
35 | min = 1, max = 5000, value = 100, step = 1),
36 | helpText('Actual gain'),
37 | verbatimTextOutput("gain"),
38 | numericInput("seed", label = "Random Seed:", 12345,
39 | min = 10000, max = 50000, step = 1)
40 | ),
41 |
42 | # Create a spot for the barplot
43 | mainPanel(
44 | tabsetPanel(type = "tabs",
45 | tabPanel("Sum", plotOutput("sumPlot")),
46 | tabPanel("Pareto", plotOutput("paretoPlot")),
47 | tabPanel("Games", plotOutput("gamesPlot"))
48 | )
49 | )
50 | )
51 | )
52 |
53 |
54 | # Define server logic required to draw a histogram
55 | server <- function(input, output) {
56 |
57 | tickets <- reactive({
58 | tickets <- c(rep(1, input$tickets1), rep(0, input$tickets0))
59 | })
60 |
61 | avg_box <- reactive({
62 | mean(tickets())
63 | })
64 |
65 | sd_box <- reactive({
66 | total <- input$tickets1 + input$tickets0
67 | sqrt((input$tickets1 / total) * (input$tickets0 / total))
68 | })
69 |
70 | sum_draws <- reactive({
71 | set.seed(input$seed)
72 | samples <- 1:input$reps
73 | for (i in 1:input$reps) {
74 | samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE))
75 | }
76 | samples
77 | })
78 |
79 | # Average and SD of box
80 | output$avg_sd_box <- renderPrint({
81 | cat(avg_box(), ", ", sd_box(), sep = '')
82 | })
83 |
84 | # Expected Value, and Standard Error
85 | output$ev_se <- renderPrint({
86 | ev = input$draws * avg_box()
87 | se = sqrt(input$draws) * sd_box()
88 | cat(ev, ", ", se, sep = '')
89 | })
90 |
91 | # Probability Histogram
92 | output$sumPlot <- renderPlot({
93 | # Render a barplot
94 | total_tickets <- input$tickets1 + input$tickets0
95 | prob_ticket1 <- input$tickets1 / total_tickets
96 | probabilities <- dbinom(0:input$draws, size = input$draws, prob_ticket1)
97 | barplot(round(probabilities, 4), border = NA, las = 1,
98 | names.arg = 0:input$draws,
99 | xlab = paste("Number of tickets 1"),
100 | ylab = 'Probability',
101 | main = paste("Probability Distribution\n",
102 | "(# ticekts 1)"))
103 | abline(h = 0.5, col = '#EC5B5B99', lty = 2, lwd = 1.4)
104 | })
105 |
106 | # Pareto chart: cumulative percentage of draws
107 | output$paretoPlot <- renderPlot({
108 | # Render a barplot
109 | freqs_draws <- table(sum_draws()) / input$reps
110 | freq_aux <- barplot(freqs_draws, plot = FALSE)
111 | barplot(freqs_draws,
112 | ylim = c(0, 1.1),
113 | border = NA, las = 1,
114 | xlab = paste('Number of tickets 1 in', input$reps, 'games'),
115 | ylab = 'Percentage',
116 | main = paste("Empirical Cumulative Relative Frequency\n",
117 | "(at least one ticket 1)"))
118 | abline(h = 0.5, col = '#EC5B5B99', lty = 2, lwd = 1.4)
119 | lines(freq_aux[-1], cumsum(freqs_draws[-1]), lwd = 3, col = "gray60")
120 | points(freq_aux[-1], cumsum(freqs_draws[-1]), pch=19, col="gray30")
121 | text(freq_aux[-1], cumsum(freqs_draws[-1]),
122 | round(cumsum(freqs_draws[-1]), 3), pos = 3)
123 | })
124 |
125 | # Plot with gains
126 | output$gamesPlot <- renderPlot({
127 | results <- rep(-1, input$reps)
128 | results[sum_draws() > 0] <- 1
129 | plot(1:input$reps, cumsum(results), type = "n", axes = FALSE,
130 | xlab = paste('Number of tickets 1 in', input$reps, 'games'),
131 | ylab = "Gained amount",
132 | main = "Empirical Gain")
133 | abline(h = 0, col = '#EC5B5B99', lty = 2, lwd = 1.4)
134 | axis(side = 1)
135 | axis(side = 2, las = 1, pos = 0)
136 | lines(1:input$reps, cumsum(results), lwd = 1.5)
137 | })
138 |
139 | # actual gain
140 | output$gain <- renderPrint({
141 | results <- rep(-1, input$reps)
142 | results[sum_draws() > 0] <- 1
143 | sum(results)
144 | })
145 | }
146 |
147 | # Run the application
148 | shinyApp(ui = ui, server = server)
149 |
150 |
--------------------------------------------------------------------------------
/apps/ch17-expected-value-std-error/README.md:
--------------------------------------------------------------------------------
1 | # Expected value and Standard Error
2 |
3 | This is a Shiny app that illustrates the concept of Expected Value and Standard Error when simulating rolling a die (5 times by default).
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide a visual display for the concepts in __Statistics, Chapter 17: The Expected Value and Standard Error.__
9 |
10 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
11 |
12 |
13 | ## Data
14 |
15 | The data simulates rolling a die 5 times by default.
16 |
17 |
18 | ## Plot
19 |
20 | A bar-chart of frequencies for the sum of draws is displayed.
21 |
22 |
23 | ## How to run it?
24 |
25 | ```R
26 | library(shiny)
27 |
28 | # Easiest way is to use runGitHub
29 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch17-expected-value-std-error")
30 | ```
31 |
--------------------------------------------------------------------------------
/apps/ch17-expected-value-std-error/app.R:
--------------------------------------------------------------------------------
1 | # Title: Expected Value and Standard Error
2 | # Description: EV and SE when rolling a die 5 times
3 | # Chapter 17: The EV and SE, p 288-296
4 | # Author: Gaston Sanchez
5 |
6 | library(shiny)
7 | source("helpers.R")
8 |
9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |
12 | # Give the page a title
13 | titlePanel("Rolling a Die"),
14 |
15 | # Generate a row with a sidebar
16 | sidebarLayout(
17 |
18 | # Define the sidebar with one input
19 | sidebarPanel(
20 | numericInput("dice", label = "Number of dice:", 5,
21 | min = 1, max = 10, step = 1),
22 | numericInput("seed", label = "Random Seed:", 12330,
23 | min = 10000, max = 50000, step = 1),
24 | sliderInput("reps", label = "Number of repetitions:",
25 | min = 100, max = 10000, value = 100, step= 10),
26 | hr(),
27 | helpText('Average of sums:'),
28 | verbatimTextOutput("num_heads"),
29 | helpText('SD of sums:'),
30 | verbatimTextOutput("prop_heads")
31 | ),
32 |
33 | # Create a spot for the barplot
34 | mainPanel(
35 | plotOutput("chancePlot")
36 | )
37 | )
38 | )
39 |
40 |
41 | # Define server logic required to draw a histogram
42 | server <- function(input, output) {
43 |
44 | # Empirical average of sum of draws
45 | output$num_heads <- renderPrint({
46 | set.seed(input$seed)
47 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
48 | # avg of sums
49 | mean(total_points)
50 | })
51 |
52 | # Empirical SD of sum of draws
53 | output$prop_heads <- renderPrint({
54 | set.seed(input$seed)
55 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
56 | # avg of sums
57 | sd(total_points) * sqrt((input$reps - 1)/input$reps)
58 | })
59 |
60 | # Fill in the spot we created for a plot
61 | output$chancePlot <- renderPlot({
62 | set.seed(input$seed)
63 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
64 | # put in relative terms
65 | prop_points <- 100 * table(total_points) / input$reps
66 | ymax <- find_ymax(max(prop_points), 2)
67 | # Render a barplot
68 | barplot(prop_points, las = 1, border = "gray40",
69 | space = 0, ylim = c(0, ymax),
70 | main = sprintf("%s Repetitions", input$reps))
71 | })
72 | }
73 |
74 | # Run the application
75 | shinyApp(ui = ui, server = server)
76 |
77 |
--------------------------------------------------------------------------------
/apps/ch17-expected-value-std-error/helpers.R:
--------------------------------------------------------------------------------
1 | # helper functions to simulate rolling a die
2 | # and adding the number of sposts
3 |
4 | # roll one die
5 | roll_die <- function(times = 1) {
6 | die <- 1:6
7 | sample(die, times, replace = TRUE)
8 | }
9 |
10 |
11 | # roll a pair of dice
12 | roll_pair <- function() {
13 | roll_die(2)
14 | }
15 |
16 | # sum of spots
17 | sum_rolls <- function(times = 1) {
18 | sum(roll_die(times))
19 | }
20 |
21 | # product of numbers
22 | prod_rolls <- function(times = 1) {
23 | prod(roll_die(times))
24 | }
25 |
26 | # check whether 'x' is multiple of 'num'
27 | is_multiple <- function(x, num) {
28 | x %% num == 0
29 | }
30 |
31 | # find the y-max value for ylim in barplot()
32 | find_ymax <- function(x, num) {
33 | if (is_multiple(x, num)) {
34 | return(max(x))
35 | } else {
36 | return(num * ((x %/% num) + 1))
37 | }
38 | }
39 |
--------------------------------------------------------------------------------
/apps/ch18-coin-tossing/README.md:
--------------------------------------------------------------------------------
1 | # Tossing Coins
2 |
3 | This is a Shiny app that generates a probability histogram when tossing
4 | a coin a specified number of times.
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to provide a visual display similar to the probability histograms
10 | in chapter 18 of "Statistics".
11 |
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 |
14 |
15 | ## Data
16 |
17 | The data computes the probabilities when tossing a coin a specified number of times.
18 | The input parameters are the number of tosses, and the chance of heads.
19 |
20 |
21 | ## Plot
22 |
23 | The produced plot is a probability histogram.
24 |
25 |
26 | ## How to run it?
27 |
28 | ```R
29 | library(shiny)
30 |
31 | # Easiest way is to use runGitHub
32 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-tossing-coins")
33 | ```
34 |
--------------------------------------------------------------------------------
/apps/ch18-coin-tossing/app.R:
--------------------------------------------------------------------------------
1 | # Title: Probability histograms
2 | # Description: Probability histograms for the number of heads
3 | # in "n" tosses of a coin
4 | # Chapter 18: Normal Approx, p 316
5 | # Author: Gaston Sanchez
6 |
7 | library(shiny)
8 |
9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |
12 | # Give the page a title
13 | titlePanel("Tossing Coins"),
14 |
15 | # Generate a row with a sidebar
16 | sidebarLayout(
17 |
18 | # Define the sidebar with one input
19 | sidebarPanel(
20 | sliderInput("tosses", label = "Number of tosses:",
21 | min = 1, max = 500, value = 100, step = 1),
22 | sliderInput("chance", label = "Chance of heads",
23 | min = 0, max = 1, value = 0.5, step= 0.05),
24 | hr(),
25 | helpText('Expected Value:'),
26 | verbatimTextOutput("exp_value"),
27 | helpText('Standard Error'),
28 | verbatimTextOutput("std_error")
29 | ),
30 |
31 | # Create a spot for the barplot
32 | mainPanel(
33 | plotOutput("chancePlot")
34 | )
35 | )
36 | )
37 |
38 |
39 | # Define server logic required to draw a histogram
40 | server <- function(input, output) {
41 |
42 | # Expected Value
43 | output$exp_value <- renderPrint({
44 | input$tosses * input$chance
45 | })
46 |
47 | # Standard Error
48 | output$std_error <- renderPrint({
49 | sqrt(input$tosses * input$chance * (1 - input$chance))
50 | })
51 |
52 | # Fill in the spot we created for a plot
53 | output$chancePlot <- renderPlot({
54 | probs <- 100 * dbinom(0:input$tosses,
55 | size = input$tosses,
56 | prob = input$chance)
57 |
58 | exp_value <- input$tosses * input$chance
59 | std_error <- sqrt(input$tosses * input$chance * (1 - input$chance))
60 |
61 | below3se <- (exp_value - 3 * std_error)
62 | above3se <- (exp_value + 3 * std_error)
63 |
64 | from <- floor(below3se) + 1
65 | to <- ceiling(above3se) + 1
66 |
67 | if (input$tosses >= 10 & from > 0) {
68 | xpos <- barplot(probs[from:to], plot = FALSE)
69 | # Render probability histogram as a barplot
70 | op = par(mar = c(6.5, 4.5, 4, 2))
71 | barplot(probs[from:to], axes = FALSE, col = "gray70",
72 | names.arg = (from-1):(to-1), border = NA,
73 | ylim = c(0, ceiling(max(probs))),
74 | ylab = "probability (%)",
75 | main = sprintf("Probability Histogram\n %s Tosses",
76 | input$tosses))
77 | axis(side = 2, las = 1)
78 | axis(side = 1, line = 3,
79 | at = seq(xpos[1], xpos[length(xpos)], length.out = 7),
80 | labels = seq(-3, 3, 1))
81 | mtext("Standard Units", side = 1, line = 5.5)
82 | par(op)
83 | } else {
84 | barplot(probs, axes = FALSE, col = "gray70",
85 | names.arg = 0:input$tosses, border = NA,
86 | ylim = c(0, ceiling(max(probs))),
87 | ylab = "probability (%)",
88 | main = sprintf("Probability Histogram\n %s Tosses",
89 | input$tosses))
90 | axis(side = 2, las = 1)
91 | }
92 | })
93 | }
94 |
95 | # Run the application
96 | shinyApp(ui = ui, server = server)
97 |
98 |
--------------------------------------------------------------------------------
/apps/ch18-roll-dice-product/README.md:
--------------------------------------------------------------------------------
1 | # Rolling Dice: Sum of Points
2 |
3 | This is a Shiny app that generates empirical histograms when simulating
4 | rolling dice and finding the total product of spots.
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to provide a visual display similar to the empirical and
10 | probability histograms shown in page 313 of "Statistics", chapter 18.
11 |
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 |
14 |
15 | ## Data
16 |
17 | The data simulates rolling (by default) a pair of dice (but the user can choose between
18 | one and 10 dices). The input parameters are the number of dice, the random seed, and
19 | the number of repetitions.
20 |
21 |
22 | ## Plots
23 |
24 | There are two tabs:
25 |
26 | 1. An empirical histogram.
27 | 2. A probability histogram (probability distribution).
28 |
29 |
30 | ## How to run it?
31 |
32 | ```R
33 | library(shiny)
34 |
35 | # Easiest way is to use runGitHub
36 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-roll-dice-product")
37 | ```
38 |
--------------------------------------------------------------------------------
/apps/ch18-roll-dice-product/app.R:
--------------------------------------------------------------------------------
1 | # Title: Roll dice and multiply spots
2 | # Description: Empirical vs Probability Histograms
3 | # Chapter 18: Probability Histograms, page 313
4 | # Author: Gaston Sanchez
5 |
6 | library(shiny)
7 | source("helpers.R")
8 |
9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |
12 | # Give the page a title
13 | titlePanel("Rolling Dice: Product"),
14 |
15 | # Generate a row with a sidebar
16 | sidebarLayout(
17 |
18 | # Define the sidebar with one input
19 | sidebarPanel(
20 | numericInput("dice", label = "Number of dice:", 2,
21 | min = 1, max = 10, step = 1),
22 | numericInput("seed", label = "Random Seed:", 12330,
23 | min = 10000, max = 50000, step = 1),
24 | sliderInput("reps", label = "Number of repetitions:",
25 | min = 100, max = 10000, value = 100, step= 10)
26 | ),
27 |
28 | # Create tabs for plots
29 | mainPanel(
30 | tabsetPanel(type = "tabs",
31 | tabPanel("Empirical", plotOutput("empiricalPlot")),
32 | tabPanel("Probability", plotOutput("probabilityPlot"))
33 | )
34 | )
35 | )
36 | )
37 |
38 |
39 | # Define server logic required to draw a histogram
40 | server <- function(input, output) {
41 |
42 | # Empirical average of sum of draws
43 | output$num_heads <- renderPrint({
44 | set.seed(input$seed)
45 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
46 | # avg of sums
47 | mean(total_points)
48 | })
49 |
50 | # Empirical SD of product of draws
51 | output$prop_heads <- renderPrint({
52 | set.seed(input$seed)
53 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
54 | # avg of sums
55 | sd(total_points) * sqrt((input$reps - 1)/input$reps)
56 | })
57 |
58 | # Empirical Histogram
59 | output$empiricalPlot <- renderPlot({
60 | set.seed(input$seed)
61 | total_points <- sapply(rep(input$dice, input$reps), prod_rolls)
62 | # put in relative terms
63 | prop_points <- 100 * table(total_points) / input$reps
64 | ymax <- find_ymax(max(prop_points), 2)
65 | # Render a barplot
66 | # Frequencies of products
67 | freq <- numeric((6^input$dice))
68 | freq[1:(6^input$dice) %in% names(prop_points)] <- prop_points
69 | names(freq) <- 1:(6^input$dice)
70 |
71 | # Render a barplot
72 | barplot(freq, space = 0, las = 1, border = "gray40",
73 | cex.names = 0.8, ylim = c(0, ymax),
74 | main = sprintf("%s Repetitions", input$reps))
75 | })
76 |
77 | # Probability Histogram
78 | output$probabilityPlot <- renderPlot({
79 | outcomes <- multiply_spots(input$dice)
80 | freq <- rep(0, 6^input$dice)
81 | freq[as.numeric(names(outcomes))] <- outcomes
82 | # Render a barplot
83 | barplot(100 * freq, names.arg = 1:6^input$dice,
84 | las = 1, border = "gray40",
85 | space = 0,
86 | xlab = "Number of spots",
87 | ylab = "Chance (%)",
88 | main = "Probability Histogram")
89 | })
90 |
91 | }
92 |
93 | # Run the application
94 | shinyApp(ui = ui, server = server)
95 |
96 |
--------------------------------------------------------------------------------
/apps/ch18-roll-dice-product/helpers.R:
--------------------------------------------------------------------------------
1 | # Helper functions to simulate rolling a die
2 | # and adding-or-multiplying the number of spots
3 |
4 | # roll one die
5 | roll_die <- function(times = 1) {
6 | die <- 1:6
7 | sample(die, times, replace = TRUE)
8 | }
9 |
10 |
11 | # roll a pair of dice
12 | roll_pair <- function() {
13 | roll_die(2)
14 | }
15 |
16 | # sum of spots
17 | sum_rolls <- function(times = 1) {
18 | sum(roll_die(times))
19 | }
20 |
21 | # product of numbers
22 | prod_rolls <- function(times = 1) {
23 | prod(roll_die(times))
24 | }
25 |
26 | # check whether 'x' is multiple of 'num'
27 | is_multiple <- function(x, num) {
28 | x %% num == 0
29 | }
30 |
31 | # find the y-max value for ylim in barplot()
32 | find_ymax <- function(x, num) {
33 | if (is_multiple(x, num)) {
34 | return(max(x))
35 | } else {
36 | return(num * ((x %/% num) + 1))
37 | }
38 | }
39 |
40 | # reps <- 100
41 | # total_points <- sapply(rep(2, reps), sum_rolls)
42 | # prop_points <- 100 * table(total_points) / reps
43 | # barplot(prop_points, las = 1,
44 | # space = 0, ylim = c(0, 30),
45 | # main = sprintf("%s Repetitions", reps))
46 |
47 |
48 |
49 |
50 | # function that multiplies spots to a given result
51 | multiply_rolls <- function(given) {
52 | results <- rep(0, length(given) * 6)
53 | aux <- 1
54 | for (i in 1:length(given)) {
55 | for (j in 1:6) {
56 | results[aux] <- given[i] * j
57 | aux <- aux + 1
58 | }
59 | }
60 | results
61 | }
62 |
63 | # function that computes theoretical probabilities
64 | # for the addition of spots when rolling "k" dice
65 | multiply_spots <- function(num_dice) {
66 | # just one die
67 | if (num_dice == 1) {
68 | outcomes <- table(1:6) / (6^num_dice)
69 | } else {
70 | # two or more dice
71 | current <- 1:6
72 | for (k in 2:num_dice) {
73 | current <- multiply_rolls(current)
74 | }
75 | outcomes <- table(current) / (6^num_dice)
76 | }
77 | outcomes
78 | }
79 |
80 | # multiply_spots(1)
81 | # multiply_spots(2)
82 | # multiply_spots(3)
83 |
84 |
85 |
86 |
--------------------------------------------------------------------------------
/apps/ch18-roll-dice-sum/README.md:
--------------------------------------------------------------------------------
1 | # Rolling Dice: Sum of Points
2 |
3 | This is a Shiny app that generates empirical histograms when simulating
4 | rolling dice and finding the total number of spots.
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to provide a visual display similar to the empirical and
10 | probability histograms shown in page 311 of "Statistics", chapter 18.
11 |
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
13 |
14 |
15 | ## Data
16 |
17 | The data simulates rolling (by default) a pair of dice (but the user can choose between
18 | one and 10 dices). The input parameters are the number of dice, the random seed, and
19 | the number of repetitions.
20 |
21 |
22 | ## Plots
23 |
24 | There are two tabs:
25 |
26 | 1. An empirical histogram.
27 | 2. A probability histogram (probability distribution).
28 |
29 |
30 | ## How to run it?
31 |
32 | ```R
33 | library(shiny)
34 |
35 | # Easiest way is to use runGitHub
36 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch18-roll-dice-sum")
37 | ```
38 |
--------------------------------------------------------------------------------
/apps/ch18-roll-dice-sum/app.R:
--------------------------------------------------------------------------------
1 | # Title: Roll dice and add spots
2 | # Description: Empirical vs Probability Histograms
3 | # Chapter 18: Probability Histograms, page 311
4 | # Author: Gaston Sanchez
5 |
6 | library(shiny)
7 | source("helpers.R")
8 |
9 | # Define UI for application that draws a histogram
10 | ui <- fluidPage(
11 |
12 | # Give the page a title
13 | titlePanel("Rolling Dice: Sum"),
14 |
15 | # Generate a row with a sidebar
16 | sidebarLayout(
17 |
18 | # Define the sidebar with one input
19 | sidebarPanel(
20 | numericInput("dice", label = "Number of dice:", 2,
21 | min = 1, max = 10, step = 1),
22 | numericInput("seed", label = "Random Seed:", 12330,
23 | min = 10000, max = 50000, step = 1),
24 | sliderInput("reps", label = "Number of repetitions:",
25 | min = 100, max = 10000, value = 100, step= 10)
26 | #hr(),
27 | #helpText('Average of sums:'),
28 | #verbatimTextOutput("num_heads"),
29 | #helpText('SD of sums:'),
30 | #verbatimTextOutput("prop_heads")
31 | ),
32 |
33 | # Create tabs for plots
34 | mainPanel(
35 | tabsetPanel(type = "tabs",
36 | tabPanel("Empirical", plotOutput("empiricalPlot")),
37 | tabPanel("Probability", plotOutput("probabilityPlot"))
38 | )
39 | )
40 | )
41 | )
42 |
43 |
44 | # Define server logic required to draw a histogram
45 | server <- function(input, output) {
46 |
47 | # Empirical average of sum of draws
48 | output$num_heads <- renderPrint({
49 | set.seed(input$seed)
50 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
51 | # avg of sums
52 | mean(total_points)
53 | })
54 |
55 | # Empirical SD of sum of draws
56 | output$prop_heads <- renderPrint({
57 | set.seed(input$seed)
58 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
59 | # avg of sums
60 | sd(total_points) * sqrt((input$reps - 1)/input$reps)
61 | })
62 |
63 | # Empirical Histogram
64 | output$empiricalPlot <- renderPlot({
65 | set.seed(input$seed)
66 | total_points <- sapply(rep(input$dice, input$reps), sum_rolls)
67 | # put in relative terms
68 | prop_points <- 100 * table(total_points) / input$reps
69 | ymax <- find_ymax(max(prop_points), 2)
70 | # Render a barplot
71 | barplot(prop_points, las = 1, border = "gray40",
72 | space = 0, ylim = c(0, ymax),
73 | xlab = "Number of spots",
74 | ylab = "Relative Frequency",
75 | main = sprintf("%s Repetitions", input$reps))
76 | })
77 |
78 | # Probability Histogram
79 | output$probabilityPlot <- renderPlot({
80 | outcomes <- sum_spots(input$dice)
81 | # Render a barplot
82 | barplot(100 * outcomes,
83 | las = 1, border = "gray40",
84 | space = 0,
85 | xlab = "Number of spots",
86 | ylab = "Chance (%)",
87 | main = "Probability Histogram")
88 | })
89 |
90 | }
91 |
92 | # Run the application
93 | shinyApp(ui = ui, server = server)
94 |
95 |
--------------------------------------------------------------------------------
/apps/ch18-roll-dice-sum/helpers.R:
--------------------------------------------------------------------------------
1 | # Helper functions to simulate rolling a die
2 | # and adding-or-multiplying the number of spots
3 |
4 | # roll one die
5 | roll_die <- function(times = 1) {
6 | die <- 1:6
7 | sample(die, times, replace = TRUE)
8 | }
9 |
10 |
11 | # roll a pair of dice
12 | roll_pair <- function() {
13 | roll_die(2)
14 | }
15 |
16 | # sum of spots
17 | sum_rolls <- function(times = 1) {
18 | sum(roll_die(times))
19 | }
20 |
21 | # product of numbers
22 | prod_rolls <- function(times = 1) {
23 | prod(roll_die(times))
24 | }
25 |
26 | # check whether 'x' is multiple of 'num'
27 | is_multiple <- function(x, num) {
28 | x %% num == 0
29 | }
30 |
31 | # find the y-max value for ylim in barplot()
32 | find_ymax <- function(x, num) {
33 | if (is_multiple(x, num)) {
34 | return(max(x))
35 | } else {
36 | return(num * ((x %/% num) + 1))
37 | }
38 | }
39 |
40 | # reps <- 100
41 | # total_points <- sapply(rep(2, reps), sum_rolls)
42 | # prop_points <- 100 * table(total_points) / reps
43 | # barplot(prop_points, las = 1,
44 | # space = 0, ylim = c(0, 30),
45 | # main = sprintf("%s Repetitions", reps))
46 |
47 |
48 |
49 |
50 | # function that adds spots to a given result
51 | add_rolls <- function(given) {
52 | results <- rep(0, length(given) * 6)
53 | aux <- 1
54 | for (i in 1:length(given)) {
55 | for (j in 1:6) {
56 | results[aux] <- given[i] + j
57 | aux <- aux + 1
58 | }
59 | }
60 | results
61 | }
62 |
63 | # function that computes theoretical probabilities
64 | # for the addition of spots when rolling "k" dice
65 | sum_spots <- function(num_dice) {
66 | # just one die
67 | if (num_dice == 1) {
68 | outcomes <- table(1:6) / (6^num_dice)
69 | } else {
70 | # two or more dice
71 | current <- 1:6
72 | for (k in 2:num_dice) {
73 | current <- add_rolls(current)
74 | }
75 | outcomes <- table(current) / (6^num_dice)
76 | }
77 | outcomes
78 | }
79 |
80 | # sum_spots(1)
81 | # sum_spots(2)
82 | # sum_spots(3)
83 |
84 |
85 |
86 |
--------------------------------------------------------------------------------
/apps/ch20-sampling-men/README.md:
--------------------------------------------------------------------------------
1 | # Sampling Men
2 |
3 | This is a Shiny app that illustrates the concept of chance errors in sampling.
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide a visual display for the Introduction example in
9 | __Statistics, Chapter 20: Chance Errors in Sampling__
10 |
11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007). Fourth Edition. Norton & Company.
12 |
13 |
14 | ## Data
15 |
16 | The data consists of a box model with 6672 tickets: 3091 __1's__, and 3581 __0's__.
17 | The 1's tickets represent men, while the 0's represent women.
18 | The app simulates taking samples from the box. There are two parameters, one is the sample size, and the other is the number samples (i.e. # of repetitions).
19 |
20 |
21 | ## Plots
22 |
23 | There are two plots:
24 |
25 | 1. The first tab shows a histogram with the number of men in the samples.
26 | 2. The second tab shows a histogram with the percentage of men in the samples.
27 |
28 |
29 | ## How to run it?
30 |
31 | There are many ways to download the app and run it:
32 |
33 | ```R
34 | library(shiny)
35 |
36 | # Easiest way is to use runGitHub
37 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch20-sampling-men")
38 | ```
39 |
--------------------------------------------------------------------------------
/apps/ch20-sampling-men/app.R:
--------------------------------------------------------------------------------
1 | # Title: Roll dice and add spots
2 | # Description: Empirical vs Probability Histograms
3 | # Chapter 18: Probability Histograms, page 311
4 | # Author: Gaston Sanchez
5 |
6 | library(shiny)
7 |
8 | # Define UI for application that draws a histogram
9 | ui <- fluidPage(
10 |
11 | # Give the page a title
12 | titlePanel("Sampling Men (p 359)"),
13 |
14 | # Generate a row with a sidebar
15 | sidebarLayout(
16 |
17 | # Define the sidebar with one input
18 | sidebarPanel(
19 | fluidRow(
20 | column(5,
21 | numericInput("tickets1", "men [1]", 3091,
22 | min = 1, max = 4000, step = 1)),
23 | column(5,
24 | numericInput("tickets0", "women [0]", 3581,
25 | min = 1, max = 4000, step = 1))
26 | ),
27 | # helpText('Avg box, SD box'),
28 | verbatimTextOutput("avg_sd_box"),
29 | numericInput("size", label = "Sample Size (# draws):", value = 100,
30 | min = 10, max = 1500, step = 1),
31 | sliderInput("reps", label = "Number of repetitions:",
32 | min = 50, max = 2000, value = 100, step = 50),
33 | numericInput("seed", label = "Random Seed:", 12345,
34 | min = 10000, max = 50000, step = 1),
35 | hr(),
36 | helpText('Number average'),
37 | verbatimTextOutput("num_avg"),
38 | helpText('Percent average'),
39 | verbatimTextOutput("perc_avg")
40 | ),
41 |
42 | # Create tabs for plots
43 | mainPanel(
44 | tabsetPanel(type = "tabs",
45 | tabPanel("Number", plotOutput("numberPlot")),
46 | tabPanel("Percentage", plotOutput("percentPlot"))
47 | )
48 | )
49 | )
50 | )
51 |
52 |
53 | # Define server logic required to draw a histogram
54 | server <- function(input, output) {
55 |
56 | # Number of men
57 | output$avg_sd_box <- renderPrint({
58 | total <- input$tickets1 + input$tickets0
59 | avg_box <- input$tickets1 / total
60 | sd_box <- sqrt((input$tickets1/total) * (input$tickets0/total))
61 | cat(sprintf('Avg = %0.3f, SD = %0.3f', avg_box, sd_box))
62 | })
63 |
64 | num_men <- reactive({
65 | tickets <- rep(c(1, 0), c(input$tickets1, input$tickets0))
66 |
67 | set.seed(input$seed)
68 | size <- input$size
69 | samples <- 1:input$reps
70 | for (i in 1:input$reps) {
71 | samples[i] <- sum(sample(tickets, size = size))
72 | }
73 | samples
74 | })
75 |
76 | # Number of men
77 | output$num_avg <- renderPrint({
78 | round(mean(num_men()), 2)
79 | })
80 |
81 | # Percentage of men
82 | output$perc_avg <- renderPrint({
83 | round(100 * mean(num_men() / input$size), 2)
84 | })
85 |
86 | # Plot with number of men in samples
87 | output$numberPlot <- renderPlot({
88 | # Render a barplot
89 | barplot(table(num_men()),
90 | space = 0, las = 1,
91 | xlab = 'Number of men',
92 | ylab = '',
93 | main = 'Sample Men')
94 | })
95 |
96 | # Plot with percentage of men in samples
97 | output$percentPlot <- renderPlot({
98 | # Render a barplot
99 | percentage_men <- round(100 * num_men() / input$size)
100 | barplot(table(percentage_men) / length(num_men()),
101 | space = 0, las = 1,
102 | xlab = 'Percentage of men',
103 | ylab = 'Proportion',
104 | main = 'Sample Men')
105 | })
106 |
107 | }
108 |
109 | # Run the application
110 | shinyApp(ui = ui, server = server)
111 |
112 |
--------------------------------------------------------------------------------
/apps/ch21-accuracy-percentages/README.md:
--------------------------------------------------------------------------------
1 | # Ch21 - Percent Estimation
2 |
3 | This is a Shiny app that illustrates the concept of accuracy of percentages.
4 | In other words, confidence intervals when esitmating a percentage.
5 |
6 |
7 | ## Motivation
8 |
9 | The goal is to provide a visual display for the various examples in
10 | __Statistics, Chapter 21: Accuracy of Percentages__
11 |
12 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007).
13 | Fourth Edition. Norton & Company.
14 |
15 |
16 | ## Data
17 |
18 | - The data consists of a box model with two types of tickets: 0's and 1's.
19 | - The user can specify the nummber of both types of tickets (# of 1's, # of 0's).
20 | - The app simulates drawing tickets with replacement from the box.
21 | - There are three arguments:
22 | + the number of draws (i.e. sample size)
23 | + the number samples (i.e. # of repetitions)
24 | + the confidence level
25 |
26 |
27 | ## Plots
28 |
29 | There are three plots:
30 |
31 | 1. The first tab shows a histogram for the sum of draws.
32 | 2. The second tab shows a histogram for the percentage of tickets 1's.
33 | 3. The third tab shows a chart with the percentage of the box (i.e. population percentage),
34 | and the confidence intervals of the drawn samples.
35 |
36 |
37 | ## How to run it?
38 |
39 | ```R
40 | library(shiny)
41 |
42 | # Easiest way is to use runGitHub
43 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch21-accuracy-percentages")
44 | ```
45 |
--------------------------------------------------------------------------------
/apps/ch21-accuracy-percentages/app.R:
--------------------------------------------------------------------------------
1 | # Box with two types of tickets [# 1's, # 0's]
2 | # Drawing tickets from the box
3 | # Chapter 21: Accuracy of Percentages
4 |
5 | library(shiny)
6 |
7 | source('helpers.R')
8 |
9 | # Define the overall UI
10 | ui <- fluidPage(
11 |
12 | # Give the page a title
13 | titlePanel("Accuracy of Percentages"),
14 |
15 | # Generate a row with a sidebar
16 | sidebarLayout(
17 |
18 | # Define the sidebar with one input
19 | sidebarPanel(
20 | fluidRow(
21 | column(5,
22 | numericInput("tickets1", "# Tickets 1", 5,
23 | min = 1, max = 100, step = 1)),
24 | column(5,
25 | numericInput("tickets0", "# Tickets 0", 5,
26 | min = 1, max = 200, step = 1))
27 | ),
28 | helpText('Avg of box, and SD of box'),
29 | verbatimTextOutput("avg_sd_box"),
30 | hr(),
31 | sliderInput("draws", label = "Sample size (# draws):", value = 25,
32 | min = 5, max = 500, step = 1),
33 | numericInput("reps", label = "Number of samples (# reps):",
34 | min = 10, max = 1000, value = 50, step = 10),
35 | checkboxInput('param', value = TRUE, label = strong('Show parameter')),
36 | sliderInput("confidence", label = "Confidence level (%):", value = 68,
37 | min = 1, max = 99, step = 1),
38 | numericInput("seed", label = "Random Seed:", 12345,
39 | min = 10000, max = 50000, step = 1)
40 | ),
41 |
42 | # Create a spot for the barplot
43 | mainPanel(
44 | tabsetPanel(type = "tabs",
45 | tabPanel("Sum", plotOutput("sumPlot")),
46 | tabPanel("Percentage", plotOutput("percentPlot")),
47 | tabPanel("Estimates", plotOutput("intervalPlot"))
48 | )
49 | )
50 | )
51 | )
52 |
53 |
54 |
55 | # Define server logic required to draw a histogram
56 | server <- function(input, output) {
57 | tickets <- reactive({
58 | tickets <- c(rep(1, input$tickets1), rep(0, input$tickets0))
59 | })
60 |
61 | sum_draws <- reactive({
62 | set.seed(input$seed)
63 | samples <- 1:input$reps
64 | for (i in 1:input$reps) {
65 | samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE))
66 | }
67 | samples
68 | })
69 |
70 | avg_box <- reactive({
71 | mean(tickets())
72 | })
73 |
74 | sd_box <- reactive({
75 | total <- input$tickets1 + input$tickets0
76 | sqrt((input$tickets1 / total) * (input$tickets0 / total))
77 | })
78 |
79 | # Average and SD of box
80 | output$avg_sd_box <- renderPrint({
81 | cat(avg_box(), ", ", sd_box(), sep = '')
82 | })
83 |
84 | # Plot with sum of draws
85 | output$sumPlot <- renderPlot({
86 | # Render a barplot
87 | barplot(table(sum_draws()),
88 | space = 0, las = 1,
89 | xlab = 'Sum',
90 | ylab = '',
91 | main = sprintf('Sum of Box for %s Draws', input$draws))
92 | })
93 |
94 | # Plot with percentage of draws
95 | output$percentPlot <- renderPlot({
96 | # Render a barplot
97 | avg_draws <- round(sum_draws() / input$draws, 2)
98 | barplot(table(avg_draws),
99 | space = 0, las = 1,
100 | xlab = 'Percentage',
101 | ylab = '',
102 | main = "Percentage of 1's")
103 | })
104 |
105 | # Plot with confidence intervals
106 | output$intervalPlot <- renderPlot({
107 | avg_box <- mean(tickets())
108 | n <- length(tickets())
109 | sd_box <- sqrt((n-1)/n) * sd(tickets())
110 | se_sum <- sqrt(input$draws) * sd_box
111 | se_perc <- se_sum / input$draws
112 |
113 | # Render plot
114 | samples <- sum_draws() / input$draws
115 |
116 | #a <- samples - se_perc
117 | #b <- samples + se_perc
118 |
119 | a <- samples - ci_factor(input$confidence) * se_perc
120 | b <- samples + ci_factor(input$confidence) * se_perc
121 | covers <- (a <= avg_box & avg_box <= b)
122 | ci_cols <- rep('#ff000088', input$reps)
123 | ci_cols[covers] <- '#0000ff88'
124 |
125 | #xlim <- c(min(samples) - ci_factor(input$confidence) * se_perc,
126 | # max(samples) + ci_factor(input$confidence) * se_perc)
127 | xlim <- c(min(samples) - 3 * se_perc,
128 | max(samples) + 3 * se_perc)
129 | plot(samples, 1:length(samples), axes = FALSE,
130 | col = '#444444', pch = 21, cex = 0.5,
131 | xlim = xlim,
132 | ylab = 'Number of samples',
133 | xlab = "Confidence Intervals",
134 | main = "Percentage of 1's")
135 | axis(side = 1, at = seq(0, 1, 0.1))
136 | axis(side = 2, las = 1)
137 | if (input$param) {
138 | # display line for parameter
139 | abline(v = avg_box, col = '#0000FFdd', lwd = 2.5)
140 | }
141 | segments(x0 = a,
142 | x1 = b,
143 | y0 = 1:length(samples),
144 | y1 = 1:length(samples),
145 | col = ci_cols)
146 | })
147 |
148 | }
149 |
150 |
151 | # Run the application
152 | shinyApp(ui = ui, server = server)
153 |
154 |
--------------------------------------------------------------------------------
/apps/ch21-accuracy-percentages/helpers.R:
--------------------------------------------------------------------------------
1 | # function to compute SE factor
2 | # for a confidence level
3 | ci_factor <- function(level = 95) {
4 | area <- level + ((100 - level) / 2)
5 | qnorm(area/100)
6 | }
7 |
8 | # tests
9 |
10 | # 90% confidence level
11 | # ci_factor(90)
12 |
13 | # 95% confidence level
14 | # ci_factor(95)
15 |
16 | # 99% confidence level
17 | # ci_factor(99)
18 |
--------------------------------------------------------------------------------
/apps/ch23-accuracy-averages/README.md:
--------------------------------------------------------------------------------
1 | # Ch23 - Accuracy of Averages
2 |
3 | This is a Shiny app that illustrates the concept of accuracy of averages.
4 |
5 |
6 | ## Motivation
7 |
8 | The goal is to provide a visual display for the Introduction example in
9 | __Statistics, Chapter 23: Accuracy of Averages__
10 |
11 | Reference: "Statistics" by David Freedman, Robert Pisani and Roger Purves (2007).
12 | Fourth Edition. Norton & Company.
13 |
14 |
15 | ## Data
16 |
17 | The data consists of a box model with default tickets: 1, 2, 3, 4, 5, 6, 7.
18 | However, the numbers in the box can be changed by the user.
19 | The app simulates taking random samples from the box.
20 | There are two parameters, one is the number of draws (i.e. sample size),
21 | and the other is the number samples (i.e. # of repetitions).
22 |
23 |
24 | ## Plots
25 |
26 | There are three plots:
27 |
28 | 1. The first tab shows a histogram for the sum of draws.
29 | 2. The second tab shows a histogram for the average of draws.
30 | 3. The third tab shows a chart with the average of the box (i.e. population avg),
31 | and the confidence intervals of the drawn samples (i.e. sample averages)
32 |
33 |
34 | ## How to run it?
35 |
36 | ```R
37 | library(shiny)
38 |
39 | # Easiest way is to use runGitHub
40 | runGitHub("introstat-spring-2017", "ucb-introstat", subdir = "apps/ch23-accuracy-averages")
41 | ```
42 |
--------------------------------------------------------------------------------
/apps/ch23-accuracy-averages/app.R:
--------------------------------------------------------------------------------
1 | # Box with two types of tickets [# 1's, # 0's]
2 | # Drawing tickets from the box
3 | # Chapter 21: Accuracy of Percentages
4 |
5 | library(shiny)
6 |
7 | source('helpers.R')
8 |
9 | # Define the overall UI
10 | ui <- fluidPage(
11 |
12 | # Give the page a title
13 | titlePanel("Accuracy of Averages"),
14 |
15 | # Generate a row with a sidebar
16 | sidebarLayout(
17 |
18 | # Define the sidebar with one input
19 | sidebarPanel(
20 | textInput("tickets", label = "Numbers in box:",
21 | value = '1, 2, 3, 4, 5, 6, 7'),
22 | helpText('Avg of box, and SD of box'),
23 | verbatimTextOutput("avg_sd_box"),
24 | hr(),
25 | sliderInput("draws", label = "Sample size (# draws):", value = 25,
26 | min = 5, max = 500, step = 1),
27 | sliderInput("reps", label = "Number of samples (# reps):",
28 | min = 10, max = 1000, value = 50, step = 10),
29 | checkboxInput('param', value = TRUE, label = strong('Show parameter')),
30 | sliderInput("confidence", label = "Confidence level (%):", value = 68,
31 | min = 1, max = 99, step = 1),
32 | numericInput("seed", label = "Random Seed:", 12345,
33 | min = 10000, max = 50000, step = 1)
34 | ),
35 |
36 | # Create a spot for the barplot
37 | mainPanel(
38 | tabsetPanel(type = "tabs",
39 | tabPanel("Sum", plotOutput("sumPlot")),
40 | tabPanel("Average", plotOutput("averagePlot")),
41 | tabPanel("Estimates", plotOutput("intervalPlot"))
42 | )
43 | )
44 | )
45 | )
46 |
47 |
48 |
49 | # Define server logic required to draw a histogram
50 | server <- function(input, output) {
51 | tickets <- reactive({
52 | tickets <- gsub(' ', '', input$tickets)
53 | tickets <- unlist(strsplit(tickets, ','))
54 | as.numeric(tickets)
55 | })
56 |
57 | sum_draws <- reactive({
58 | set.seed(input$seed)
59 | samples <- 1:input$reps
60 | for (i in 1:input$reps) {
61 | samples[i] <- sum(sample(tickets(), size = input$draws, replace = TRUE))
62 | }
63 | samples
64 | })
65 |
66 | avg_box <- reactive({
67 | mean(tickets())
68 | })
69 |
70 | sd_box <- reactive({
71 | n <- length(tickets())
72 | sqrt((n-1)/n) * sd(tickets())
73 | })
74 |
75 | # Average and SD of box
76 | output$avg_sd_box <- renderPrint({
77 | cat(avg_box(), ", ", sd_box(), sep = '')
78 | })
79 |
80 | # Plot with sum of draws
81 | output$sumPlot <- renderPlot({
82 | # Render a barplot
83 | barplot(table(sum_draws()),
84 | space = 0, las = 1,
85 | xlab = 'Sum',
86 | ylab = '',
87 | main = sprintf('Sum of Box for %s Draws', input$draws))
88 | })
89 |
90 | # Plot with average of draws
91 | output$averagePlot <- renderPlot({
92 | # Render a barplot
93 | avg_draws <- round(sum_draws() / input$draws, 2)
94 | barplot(table(avg_draws),
95 | space = 0, las = 1,
96 | xlab = 'Average',
97 | ylab = '',
98 | main = "Average")
99 | })
100 |
101 | # Plot with confidence intervals
102 | output$intervalPlot <- renderPlot({
103 | avg_box <- mean(tickets())
104 | n <- length(tickets())
105 | sd_box <- sqrt((n-1)/n) * sd(tickets())
106 | se_sum <- sqrt(input$draws) * sd_box
107 | se_perc <- se_sum / input$draws
108 |
109 | # Render plot
110 | samples <- sum_draws() / input$draws
111 |
112 | #a <- samples - se_perc
113 | #b <- samples + se_perc
114 |
115 | a <- samples - ci_factor(input$confidence) * se_perc
116 | b <- samples + ci_factor(input$confidence) * se_perc
117 | covers <- (a <= avg_box & avg_box <= b)
118 | ci_cols <- rep('#ff000088', input$reps)
119 | ci_cols[covers] <- '#0000ff88'
120 |
121 | #xlim <- c(min(samples) - ci_factor(input$confidence) * se_perc,
122 | # max(samples) + ci_factor(input$confidence) * se_perc)
123 | xlim <- c(min(samples) - 3 * se_perc,
124 | max(samples) + 3 * se_perc)
125 | plot(samples, 1:length(samples), axes = FALSE,
126 | col = '#444444', pch = 21, cex = 0.5,
127 | xlim = xlim,
128 | ylab = 'Number of samples',
129 | xlab = "Confidence Intervals",
130 | main = "Average")
131 | axis(side = 1)
132 | axis(side = 2, las = 1)
133 | if (input$param) {
134 | # display line for parameter
135 | abline(v = avg_box, col = '#0000FFdd', lwd = 2.5)
136 | }
137 | segments(x0 = a,
138 | x1 = b,
139 | y0 = 1:length(samples),
140 | y1 = 1:length(samples),
141 | col = ci_cols)
142 | })
143 |
144 | }
145 |
146 |
147 | # Run the application
148 | shinyApp(ui = ui, server = server)
149 |
150 |
--------------------------------------------------------------------------------
/apps/ch23-accuracy-averages/helpers.R:
--------------------------------------------------------------------------------
1 | # function to compute SE factor
2 | # for a confidence level
3 | ci_factor <- function(level = 95) {
4 | area <- level + ((100 - level) / 2)
5 | qnorm(area/100)
6 | }
7 |
8 | # tests
9 |
10 | # 90% confidence level
11 | # ci_factor(90)
12 |
13 | # 95% confidence level
14 | # ci_factor(95)
15 |
16 | # 99% confidence level
17 | # ci_factor(99)
18 |
--------------------------------------------------------------------------------
/data/stock-earnings-prices.csv:
--------------------------------------------------------------------------------
1 | "industry","earnings","price"
2 | "auto",3.3,2.9
3 | "banks",8.6,6.5
4 | "chemicals",6.6,3.1
5 | "computers",10.2,5.3
6 | "drugs",11.3,10.0
7 | "electrical equipment",8.5,8.2
8 | "food",7.6,6.5
9 | "household products",9.7,10.1
10 | "machinery",5.1,4.7
11 | "oil domestic",7.4,7.3
12 | "oil international",7.7,7.7
13 | "oil equipment",10.1,10.8
14 | "railroad",6.6,6.6
15 | "retail food",6.9,6.9
16 | "department stores",10.1,9.5
17 | "soft drinks",12.7,12.0
18 | "steel",-1.0,-1.6
19 | "tobacco",12.3,11.7
20 | "utilities electric",2.8,1.4
21 | "utilities gas",5.2,6.2
22 |
--------------------------------------------------------------------------------
/data/vegetables-smoking.csv:
--------------------------------------------------------------------------------
1 | state,vegetables,smoking
2 | Alabama,20.1,18.8
3 | Alaska,24.8,18.8
4 | Arizona,23.7,13.7
5 | Arkansas,21,18.1
6 | California,28.9,9.8
7 | Colorado,24.5,13.5
8 | Connecticut,27.4,12.4
9 | Delaware,21.3,15.5
10 | Florida,26.2,15.2
11 | Georgia,23.2,16.4
12 | Hawaii,24.5,12.1
13 | Idaho,23.2,13.3
14 | Illinois,24,14.2
15 | Indiana,22,20.8
16 | Iowa,19.5,16.1
17 | Kansas,19.9,13.6
18 | Kentucky,16.8,23.5
19 | Louisiana,20.2,16.4
20 | Maine,28.7,15.9
21 | Maryland,28.7,13.4
22 | Massachusetts,28.6,13.5
23 | Michigan,22.8,16.7
24 | Minnesota,24.5,14.9
25 | Mississippi,16.5,18.6
26 | Missouri,22.6,18.5
27 | Montana,24.7,14.5
28 | Nebraska,20.2,16.1
29 | Nevada,22.5,16.6
30 | NewHampshire,29.1,15.4
31 | NewJersey,25.9,12.8
32 | NewMexico,21.5,14.6
33 | NewYork,26,14.6
34 | NorthCarolina,22.5,17.1
35 | NorthDakota,21.8,15
36 | Ohio,22.6,17.6
37 | Oklahoma,15.7,19
38 | Oregon,25.9,13.4
39 | Pennsylvania,23.9,17.9
40 | RhodeIsland,26.8,15.3
41 | SouthCarolina,21.2,17
42 | SouthDakota,20.5,13.8
43 | Tennessee,26.5,20.4
44 | Texas,22.6,13.2
45 | Utah,22.1,8.5
46 | Vermont,30.8,14.4
47 | Virginia,26.2,15.3
48 | Washington,25.2,12.5
49 | WestVirginia,20,21.3
50 | Wisconsin,22.2,15.9
51 | Wyoming,21.8,16.3
52 |
--------------------------------------------------------------------------------
/hw/README.md:
--------------------------------------------------------------------------------
1 | ## Homework Assignments
2 |
3 | - HW assignments are due on Thursdays (before midnight).
4 | - Further instructions will be posted on bCourses (see "Assignments" section).
5 | - Submit your homework electronically via bCourses as a word, text, pdf, or html file.
6 | - Please do NOT submit any other file format (e.g. `.pages`, `.Rmd`, `.R`) since it won't be rendered on bCourses.
7 | - Please become familiar with the HW policy described in the syllabus.
8 |
9 |
10 | Tentative Calendar, Spring 2017
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 | HW |
19 | Due |
20 | Topic |
21 |
22 |
23 |
24 |
25 | 1 |
26 | Jan 26 |
27 | Ch-3: A3,7, C2, R4,8, extra questions |
28 |
29 |
30 | 2 |
31 | Feb 02 |
32 | Ch-4: B5, D8, E4, R6,9, extra questions |
33 |
34 |
35 | 3 |
36 | Feb 09 |
37 |
38 | Ch-5: C1, D1, E3, Rev7,10
39 | Ch-8: B6,8,9, R9, extra questions |
40 |
41 |
42 | 4 |
43 | Feb 16 |
44 |
45 | Ch-9: A10, B2, E3, R4,8
46 | Ch-10: A2,4, C4, extra questions |
47 |
48 |
49 | 5 |
50 | Feb 23 |
51 | Ch-10: C2, R3,4
52 | Ch-11: B1,2, D2, E1, R4,7
53 | Ch-12: R3,5 |
54 |
55 |
56 | 6 |
57 | Mar 02 |
58 | Ch-13: R2,4,7,8,9
59 | |
60 |
61 |
62 | 7 |
63 | Mar 09 |
64 |
65 | Ch-14: R1,3,5,6,9
66 | and Binomial Probability |
67 |
68 |
69 | 8 |
70 | Mar 16 |
71 | Ch-16: B2,6, R1,4,9
72 | Ch-17: A1, B2, C1, E1, R2 |
73 |
74 |
75 | 9 |
76 | Mar 23 |
77 | Ch-18: B3, B5, C5, R2
78 | Ch-19: R5,7
79 | Ch-20: A4, B3, Rev3,4,6 |
80 |
81 |
82 | 10 |
83 | Apr 13 |
84 | Ch-21: A7,8, B4, C6,7, R2,7
85 | Ch-23: A2,5, C2, R4,10,12 |
86 |
87 |
88 | 11 |
89 | Apr 20 |
90 | Ch-26: B5, C1, F1, F7, R2,5,7,8,9 |
91 |
92 |
93 | 12 |
94 | Apr 27 |
95 | Ch-26: F4, R1; Ch-27: R1, R10
96 | Ch-29: R4,7,9,11 |
97 |
98 |
99 |
100 |
--------------------------------------------------------------------------------
/hw/hw01-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw01-questions.pdf
--------------------------------------------------------------------------------
/hw/hw02-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw02-questions.pdf
--------------------------------------------------------------------------------
/hw/hw03-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw03-questions.pdf
--------------------------------------------------------------------------------
/hw/hw04-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw04-questions.pdf
--------------------------------------------------------------------------------
/hw/hw05-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw05-questions.pdf
--------------------------------------------------------------------------------
/hw/hw06-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw06-questions.pdf
--------------------------------------------------------------------------------
/hw/hw07-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw07-questions.pdf
--------------------------------------------------------------------------------
/hw/hw08-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw08-questions.pdf
--------------------------------------------------------------------------------
/hw/hw09-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw09-questions.pdf
--------------------------------------------------------------------------------
/hw/hw10-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw10-questions.pdf
--------------------------------------------------------------------------------
/hw/hw11-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw11-questions.pdf
--------------------------------------------------------------------------------
/hw/hw12-questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/hw/hw12-questions.pdf
--------------------------------------------------------------------------------
/labs/README.md:
--------------------------------------------------------------------------------
1 | ## Lab Discussions
2 |
3 | Tentative Calendar, Spring 2017
4 |
5 |
6 |
7 |
8 |
9 |
10 | Week |
11 | Lab |
12 | Topic |
13 |
14 |
15 |
16 |
17 | 1 |
18 |
19 | Jan 23-24
20 | Jan 25-26 |
21 |
22 | Ch-3: A4,5,6, B2, C4
23 | Ch-4: A4,5,6,9, B3,4
24 | |
25 |
26 |
27 | 2 |
28 |
29 | Jan 03-31
30 | Feb 01-02 |
31 |
32 | Ch-4: C4,5, D1,2,6, E1,5,6,7
33 | Ch-5: A1,2, B1,2,5, F1, R11
34 | |
35 |
36 |
37 | 3 |
38 |
39 | Feb 06-07
40 | Feb 08-09 |
41 |
42 | Ch-8: A5,6, B4,7, D2,3
43 | Ch-9: A2-5,9, B1, C4, D1, E4
44 | |
45 |
46 |
47 | 4 |
48 |
49 | Feb 13-14
50 | Feb 15-16 |
51 |
52 | Ch-10: A1, B1,4, C1,5, D1, E1,2
53 | Ch-11: A4,6,8
54 | |
55 |
56 |
57 | 5 |
58 |
59 | Feb 21
60 | Feb 22-23
61 | Feb 24 |
62 |
63 | Ch-11: B1,2,3, D4,5,6,7, E2,3
64 | Ch-12: A2, B3,4,5
65 | Test 1
66 | |
67 |
68 |
69 | 6 |
70 |
71 | Feb 28
72 | Mar 02 |
73 |
74 | Ch-13: A1, B1,2, C1, D2,4,5,6
75 | Ch-14: B1,3, C3, D4 |
76 |
77 |
78 | 7 |
79 |
80 | Mar 07
81 | Mar 09 |
82 |
83 | Ch-15: A3-6, R1,5,6
84 | Ch-16: A3,4, B1, 3, C1, R6 |
85 |
86 |
87 | 8 |
88 |
89 | Mar 14
90 | Mar 16 |
91 |
92 | Ch-17: B1,3,5,6
93 | Ch-17: D4, R6,11 |
94 |
95 |
96 | 9 |
97 |
98 | Mar 21
99 | Mar 23 |
100 |
101 | Ch-18: B1; C2,7,8, R3,12,13,14
102 | Ch-19: A5,6,8,12, R6,12
103 | |
104 |
105 | 10 |
106 |
107 | Mar 28
108 | Mar 30 |
109 |
110 | Spring Break
111 | Spring Break
112 | |
113 |
114 |
115 | 11 |
116 |
117 | Apr 04
118 | Apr 06
119 | Apr 07 |
120 |
121 | Ch-20: A1,2, B1,2
122 | Ch-20: C1,3,5
123 | Test 2 |
124 |
125 |
126 | 12 |
127 |
128 | Apr 11
129 | Apr 13 |
130 |
131 | Ch-21: A4,5,6, B3, C4,5, D2, E1,2
132 | Ch-23: A1, B1, C3, D1,2,3,4, R3,8 |
133 |
134 |
135 | 13 |
136 |
137 | Apr 18
138 | Apr 20 |
139 |
140 | Ch-26: A4,5, B2, C4,5, D3
141 | Extra exercises
142 | |
143 |
144 |
145 | 14 |
146 |
147 | Apr 25
148 | Apr 27 |
149 |
150 | Ch-27: A4,5, B2,3,5, D5,6
151 | Ch-29: A2, B2,3,4,5,6,7, D2, E2 |
152 |
153 |
154 |
155 |
156 |
--------------------------------------------------------------------------------
/lectures/README.md:
--------------------------------------------------------------------------------
1 | ## Lectures
2 |
3 | Tentative Calendar, Spring 2017.
4 |
5 | Material based on __Statistics__ (4th edition) by Freedman, Pisani and Purves.
6 |
7 |
8 | | Week | Date | Monday | Wednesday | Friday |
9 | |------|--------|-----------------------------|-------------------------|-----------------------|
10 | | 0 | Jan-16 | | Data and variables | Intro to R & RStudio |
11 | | 1 | Jan-23 | Ch 3: Histograms | Ch 4: Average | Ch 4: Spread |
12 | | 2 | Jan-30 | Ch 5: Normal curve | Ch 5: Normal Curve | Ch 8: Correlation |
13 | | 3 | Feb-06 | Ch 9: More Correlation | History of Regression | Ch 10: Regression |
14 | | 4 | Feb-13 | Ch 11: RMS Error | Ch 12: Regression line | Regression in R |
15 | | 5 | Feb-20 | _Holiday_ | Review | __MIDTERM 1__ |
16 | | 6 | Feb-27 | Ch 13: Probability | Ch 14: More Probability | Ch 15: Binomial prob. |
17 | | 7 | Mar-06 | Ch 16: Law of Averages | Ch 16: Box Models | Ch 17: Expected Value |
18 | | 8 | Mar-13 | Ch 17: Standard Error | Ch 18: Normal Approx | Ch 18: Normal Approx |
19 | | 9 | Mar-20 | Ch 19: Sampling | Ch 19: Sampling | Ch 20: Chance Errors |
20 | | 10 | Mar-27 | _Spring Break_ | _Spring Break_ | _Spring Break_ |
21 | | 11 | Apr-03 | Review | Review | __MIDTERM 2__ |
22 | | 12 | Apr-10 | Ch 21: Accuracy Percentages | Ch 21: Conf. Intervals | Ch 23: Accuracy Averages|
23 | | 13 | Apr-17 | Ch 26: Significance Tests | Ch 26: z-test | Ch 26: t-test |
24 | | 14 | Apr-24 | Ch 27: Two-sample z-test | Ch 27: Two-sample z-test| Ch 29: More about tests |
25 | | 15 | May-01 | _RRR_ | _RRR_ | _RRR_ |
26 |
27 |
28 | - May-09: __Final Stat 131A__, 11:30-2:30pm in Birge 50
29 | - May-10: __Final Stat 20__, 3:00-6:00pm in VLSB 2050
30 |
31 | -----
32 |
33 | ## Slides and Scripts
34 |
35 | - Jan 18-20:
36 | + [Data and Variables](https://docs.google.com/presentation/d/1k0Ti3489qKExV-X9VzgOq0rCRk0EcjsEB800TDyvfG0/edit?usp=sharing)
37 | + In-class: [Getting started with R and RStudio](../scripts/01-R-introduction.pdf)
38 | + [Intro to R and RStudio](https://docs.google.com/presentation/d/1jtPoAMnT2-56REz-pFZQWSSSzFVHXOI069vrQCA0r6k/edit?usp=sharing) auxiliary slides
39 | + Practice: script about [data and variables](../scripts/02-data-variables.pdf)
40 | - Jan 23-27:
41 | + In-class: [Histograms](https://docs.google.com/presentation/d/1D_QNv8HPBRQGqy3ofiJDuLgOpB-awMwwpMchX9n0My4/edit?usp=sharing) slides
42 | + App: [ch03-histograms](../apps/03-histograms)
43 | + Practice: script about [Histograms in R](../scripts/03-histograms.pdf)
44 | + In-class: [Measures of Center (Average and Median)](https://docs.google.com/presentation/d/15jjBpSkQmYs99S8A2yvGGR4lwusUcJgBXZYU88158pE/edit?usp=sharing)
45 | + Practice: script about [Average and median in R](../scripts/04-measures-center.pdf)
46 | + In-class: [Measures of Spread (RMS, Standard Deviation)](https://docs.google.com/presentation/d/1olNOkShLZTBwEywn1AsuX92PvimntXoKMn7eRDh5MRE/edit?usp=sharing)
47 | + Practice: script about [measures of spread in R](../scripts/05-measures-spread.pdf)
48 | - JanFeb 30-03:
49 | + In-class: [Normal Curve](https://docs.google.com/presentation/d/1_6ZEhuTCDvxesw6H99nJxnJz7shMIU9Hzq4GzWzw0dE/edit?usp=sharing) slides
50 | + Practice: script about [normal curve in R](../scripts/06-normal-curve.pdf)
51 | + In-class: [Scatter Diagrams and Correlation](https://docs.google.com/presentation/d/1qLtoiX8CrpHL70lZ8LBQN0F-xHuwEnhpVNZalaBnSM8/edit?usp=sharing) slides
52 | + App: [ch08-corr-coeff-diagrams](../apps/ch08-corr-coeff-diagrams)
53 | + Practice: script about [scatter diagrams in R](../scripts/07-scatter-diagrams.pdf)
54 | - Feb 06-10:
55 | + In-class: [More about Correlation](https://docs.google.com/presentation/d/1TNmvkcGnhIpZ3N-XLEJwuOcG9tDd6KbdIDzU4K6wivE/edit?usp=sharing) slides
56 | + In-class: [A bit of history about origins of regression](https://docs.google.com/presentation/d/1VBdCiJn_QmfeTsCzP29RlL4ldjripPdrSXkUSYfq0Rc/edit?usp=sharing) auxiliary slides
57 | + In-class: [Intro to Regression Method](https://docs.google.com/presentation/d/10eQJ3DxVVuC00mQ5aEBNb0nWZh8oX-vJ5mCJRQH39VA/edit?usp=sharing) slides
58 | + App: [ch10-heights-data](../apps/ch10-heights-data)
59 | + Practice: script about [Regression Line with R](../scripts/09-regression-line.pdf)
60 | - Feb 13-17
61 | + In-class: [R.M.S. Error for Regression](https://docs.google.com/presentation/d/1KSws7X-9jr1YWtJwPUmdnooodMqBMzRLjDWhsgq04Iw/edit?usp=sharing) slides
62 | + App: [ch11-regression-residuals](../apps/ch10-heights-data)
63 | + Practice: script about [Predictions and Errors in Regression with R](../scripts/10-prediction-and-errors-in-regression.pdf)
64 | + In-class: [Regression Line](https://docs.google.com/presentation/d/1bEV8MWCZ6xE2zm5egZXq5wcXOGOnHDJiJvj2tTGMhyI/edit?usp=sharing) slides
65 | + App: [ch11-regression-strips](../apps/ch11-regression-strips)
66 | - Feb 20-24
67 | + Regression Line
68 | + __Midterm 1__ Friday Feb-24
69 | - FebMar 27-03
70 | + In-class: [Probability Rules (part 1)](https://docs.google.com/presentation/d/1cgU096Vr5Ep30rXoQ68940YbbCM7wvpznsC623Zx5N0/edit?usp=sharing)
71 | + In-class: [Probability Rules (part 2)](https://docs.google.com/presentation/d/1C-bEAHd3naLPxk_WDSrMuWHd9kMdVVo7vh2x9lWaFvc/edit?usp=sharing)
72 | + In-class: [Binomial Formula](https://docs.google.com/presentation/d/1M6Xk1xwAmdewO1K5lVIAOXz45LcIfvrZOgzQs9EXc1c/edit?usp=sharing)
73 | + Practice: script about [binomial probability in R](../scripts/11-binomial-formula.pdf)
74 | - Mar 06-10
75 | + In-class: [Law of Averages](https://docs.google.com/presentation/d/1WDS0RyPXBjo0kgYSC5AIR33Vr78lKbOURXqJ2TMXvtI/edit?usp=sharing)
76 | + Practice: script about simulating basic [chance process with R](../scripts/12-chance-processes.pdf)
77 | + App: [ch16-chance-errors](../apps/ch16-chance-errors)
78 | + In-class: [Expected Value and Standard Error](https://docs.google.com/presentation/d/1QCSwf7zN80253dLYUAkZ3C4h01M6rLTFJ33h1tFD9To/edit?usp=sharing)
79 | + App: [ch17-demere-games](../apps/ch17-demere-games)
80 | + App: [ch17-expected-value-std-error](../apps/ch17-expected-value-std-error)
81 | - Mar 13-17
82 | + In-class: [Probability Histograms and Normal Approximation](https://docs.google.com/presentation/d/1AZ61AYdl1mmT3Uy1XebT8qpTbbR7uqiP0y_n740Vp8E/edit?usp=sharing)
83 | + App: [ch18-roll-dice-sum](../apps/ch18-roll-dice-sum)
84 | + App: [ch18-roll-dice-product](../apps/ch18-roll-dice-product)
85 | + App: [ch18-coin-tossing](../apps/ch18-coin-tossing)
86 | - Mar 20-24
87 | + In-class: [Sample Surveys](https://docs.google.com/presentation/d/1n-zZKPrpCoNqhf1hnDlUNVx-XL_qxdZgwngeKWMULiM/edit?usp=sharing)
88 | + In-class: [Sample Designs](https://docs.google.com/presentation/d/1KWmjAxrSNM7hRjWPh9veTLl8_FLneJozhA-6OUSLqK8/edit?usp=sharing)
89 | + In-class: [Chance Errors in Sampling](https://docs.google.com/presentation/d/1jRFpoepvu7RWwl6fsxPD7wFkdhZk83dlLwzb9SNMXSE/edit?usp=sharing)
90 | + App: [ch20-sampling-men](../apps/ch20-sampling-men)
91 | - Apr 03-07
92 | + Review
93 | + __Midterm 2__ Friday Apr-07
94 | - Apr 10-14
95 | + In-class: [Accuracy of Percentages](https://docs.google.com/presentation/d/1Ia5dA9BuEHUTX0dxLRJ9RervShAHmtqk8Si8hXPak-0/edit?usp=sharing)
96 | + App: [ch21-accuracy-percentages](../apps/ch21-accuracy-percentages)
97 | + In-class: [Accuracy of Averages](https://docs.google.com/presentation/d/1FnUXMu_5qYST5Stou895O_vUjdAULxeyxhvrGImAVEA/edit?usp=sharing)
98 | + App: [ch23-accuracy-averages](../apps/ch23-accuracy-averages)
99 | - Apr 17-21
100 | + In-class: [Hypothesis Tests](https://docs.google.com/presentation/d/1FQN-qh-plq87aB1d2vOUoi3YVLl6LE28uYUhXS5RFcI/edit?usp=sharing)
101 | + In-class: [One sample z-test](https://docs.google.com/presentation/d/1HhVMfQ0n8iebx527qscSFtj3wHQAqFk2xSSEnitu91g/edit?usp=sharing)
102 | + In-class: [One sample t-test](https://docs.google.com/presentation/d/1GTWOiwk4Gkeh_nXnKKK47hCcE1sWMGT-s9Q313VTUFM/edit?usp=sharing)
103 | - Apr 24-28
104 | + In-class: [Two sample z-test](https://docs.google.com/presentation/d/19PpdMovtJSbydDAc1Mv1wh3Mu5YWT0dFCh5aPcdE6dU/edit?usp=sharing)
105 | - May 01-05
106 | + RRR Week
107 | - May 07-12
108 | + __Final: Stat 131A__ Tue May-09
109 | + __Final: Stat 20__ Wed May-10
110 |
--------------------------------------------------------------------------------
/other/Karl-Pearson-and-the-origins-of-modern-statistics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/Karl-Pearson-and-the-origins-of-modern-statistics.pdf
--------------------------------------------------------------------------------
/other/Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/Quetelet-and-the-emergence-of-the-behavioral-sciences.pdf
--------------------------------------------------------------------------------
/other/The-strange-science-of-Francis-Galton.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/The-strange-science-of-Francis-Galton.pdf
--------------------------------------------------------------------------------
/other/formula-sheet-final.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-final.pdf
--------------------------------------------------------------------------------
/other/formula-sheet-midterm1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-midterm1.pdf
--------------------------------------------------------------------------------
/other/formula-sheet-midterm2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/formula-sheet-midterm2.pdf
--------------------------------------------------------------------------------
/other/standard-normal-table.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/standard-normal-table.pdf
--------------------------------------------------------------------------------
/other/t-table.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/t-table.pdf
--------------------------------------------------------------------------------
/other/z-table.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/other/z-table.pdf
--------------------------------------------------------------------------------
/scripts/01-R-introduction.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Getting started with R"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | header-includes: \usepackage{float}
6 | output: html_document
7 | urlcolor: blue
8 | ---
9 |
10 | > ### Learning Objectives
11 | >
12 | > - Complete installation of R and RStudio
13 | > - Get started with R as a scientific calculator
14 | > - First steps using RStudio
15 | > - Getting help in R
16 | > - Installing packages
17 | > - Using R script files
18 | > - Using Rmd files
19 | > - Get to know markdown syntax
20 |
21 | ------
22 |
23 | ## R and RStudio
24 |
25 | - Install __R__
26 | - R for Mac: [https://cran.r-project.org/bin/macosx/](https://cran.r-project.org/bin/macosx/)
27 | - R for windows: [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/)
28 | - Install __RStudio__
29 | - RStudio download (desktop version): [https://www.rstudio.com/products/rstudio/download/](https://www.rstudio.com/products/rstudio/download/)
30 |
31 |
32 | ### Difference between R-GUI and RStudio
33 |
34 | The default installation of R comes with R-GUI which is a simple graphical
35 | user interface. In contrast, RStudio is an _Integrated Development Environment_
36 | (IDE). This means that RStudio is much more than a simple GUI, providing a nice
37 | working environment and development framework. In this course, you will use R
38 | mainly for doing computations and plots, not really for programming purposes.
39 | And you are going to interact with R via RStudio, using the so-called __Rmd__
40 | files.
41 |
42 | -----
43 |
44 | ## R as a scientific calculator
45 |
46 | Open RStudio and locate the _console_ (or prompt) pane.
47 | Let's start typing basic things in the console, using R as a scientific calculator:
48 |
49 | ```r
50 | # addition
51 | 1 + 1
52 | 2 + 3
53 |
54 | # subtraction
55 | 4 - 2
56 | 5 - 7
57 |
58 | # multiplication
59 | 10 * 0
60 | 7 * 7
61 |
62 | # division
63 | 9 / 3
64 | 1 / 2
65 |
66 | # power
67 | 2 ^ 2
68 | 3 ^ 3
69 | ```
70 |
71 |
72 | ### Functions
73 |
74 | R has many functions. To use a function, type its name followed by parenthesis.
75 | Inside the parenthesis you pass an input. Most functions will produce some
76 | type of output:
77 |
78 | ```r
79 | # absolute value
80 | abs(10)
81 | abs(-4)
82 |
83 | # square root
84 | sqrt(9)
85 |
86 | # natural logarithm
87 | log(2)
88 | ```
89 |
90 |
91 | ### Comments in R
92 |
93 | All programming languages use a set of characters to indicate that a
94 | specific part or lines of code are __comments__, that is, things that are
95 | not to be executed. R uses the hash or pound symbol `#` to specify comments.
96 | Any code to the right of `#` will not be executed by R.
97 |
98 | ```r
99 | # this is a comment
100 | # this is another comment
101 | 2 * 9
102 |
103 | 4 + 5 # you can place comments like this
104 | ```
105 |
106 |
107 | ### Variables and Assignment
108 |
109 | R is more powerful than a calculator, and you can do many more things than
110 | practically most scientific calculators. One of the things you will be
111 | doing a lot in R is creating variables or objects to store values.
112 |
113 | For instance, you can create a variable `x` and give it the value of 1.
114 | This is done using what is known as the __assignment operator__ `<-`,
115 | also known in R as the _arrow_ operator:
116 |
117 | ```r
118 | x <- 1
119 | x
120 | ```
121 |
122 | This is a way to tell R: "create an object `x` and store in it the number 1".
123 | Alternatively, you can use the equals sign `=` as an assignment operator:
124 |
125 | ```r
126 | y = 2
127 | y
128 | ```
129 |
130 | With variables, you can operate the way you do algebraic operations (addition, subtraction, multiplication, division, power, etc):
131 |
132 | ```r
133 | x + y
134 | x - y
135 | x * y
136 | x / y
137 | x ^ y
138 | ```
139 |
140 |
141 | ### Case Sensitive
142 |
143 | R is case sensitive. This means that `abs()` is not the same
144 | as `Abs()` or `ABS()`. Only the function `abs()` is the valid one.
145 |
146 | ```r
147 | # case sensitive
148 | x = 1
149 | X = 2
150 | x + x
151 | x + X
152 | X + X
153 | ```
154 |
155 |
156 | ### Some Examples
157 |
158 | Here are some examples that illustrate how to use R to define
159 | variables and perform basic calculations:
160 |
161 | ```r
162 | # convert Fahrenheit degrees to Celsius degrees
163 | fahrenheit = 50
164 | celsius = (fahrenheit - 32) * (5/9)
165 | celsius
166 |
167 |
168 | # compute the area of a rectangle
169 | rec_length = 10
170 | rec_height = 5
171 | rec_area = rec_length * rec_height
172 | rec_area
173 |
174 |
175 | # degrees to radians
176 | deg = 90
177 | rad = (deg * pi) / 180
178 | rad
179 | ```
180 |
181 | -----
182 |
183 | ## More about RStudio
184 |
185 | You will be working with RStudio a lot, and you will have time to learn
186 | many of the bells and whistles RStudio provides. Think about RStudio as
187 | your "workbench". Keep in mind that RStudio is NOT R. RStudio is an environment
188 | that makes it easier to work with R, while taking care of the little tasks that
189 | can be a hassle.
190 |
191 |
192 | ### A quick tour of RStudio
193 |
194 | - Understand the __pane layout__ (i.e. windows) of RStudio
195 | - Source
196 | - Console
197 | - Environment, History, etc
198 | - Files, Plots, Packages, Help, Viewer
199 | - Customize RStudio Appearance of source pane
200 | - font
201 | - size
202 | - background
203 |
204 |
205 | ### Using an R script file
206 |
207 | Most of the time you won't be working directly on the console.
208 | Instead, you will be typing your commands in some _source_ file.
209 | The basic type of source files are known as _R script files_.
210 | Open a new script file in the _source_ pane, and rewrite the
211 | previous commands.
212 |
213 | You can copy the commands in your source file and paste them in the
214 | console. But that's not very efficient. Find out how to run (execute)
215 | the commands (in your source file) and pass them to the console pane.
216 |
217 |
218 | ### Getting help
219 |
220 | Because we work with functions all the time, it's important to know certain
221 | details about how to use them, what input(s) is required, and what is the
222 | returned output.
223 |
224 | There are several ways to get help.
225 |
226 | If you know the name of a function you are interested in knowing more,
227 | you can use the function `help()` and pass it the name of the function you
228 | are looking for:
229 |
230 | ```r
231 | # documentation about the 'abs' function
232 | help(abs)
233 |
234 | # documentation about the 'mean' function
235 | help(mean)
236 | ```
237 |
238 | Alternatively, you can use a shortcut using the question mark `?` followed
239 | by the name of the function:
240 |
241 | ```r
242 | # documentation about the 'abs' function
243 | ?abs
244 |
245 | # documentation about the 'mean' function
246 | ?mean
247 | ```
248 |
249 | - How to read the manual documentation:
250 | - Title
251 | - Description
252 | - Usage of function
253 | - Arguments
254 | - Details
255 | - See Also
256 | - Examples!!!
257 |
258 | `help()` only works if you know the name of the function your are looking for.
259 | Sometimes, however, you don't know the name but you may know some keywords.
260 | To look for related functions associated to a keyword, use `help.search()` or
261 | simply `??`
262 |
263 | ```r
264 | # search for 'absolute'
265 | help.search("absolute")
266 |
267 | # alternatively you can also search like this:
268 | ??absolute
269 | ```
270 |
271 | Notice the use of quotes surrounding the input name inside `help.search()`
272 |
273 |
274 | ### Installing Packages
275 |
276 | R comes with a large set of functions and packages. A package is a collection
277 | of functions that have been designed for a specific purpose. One of the great
278 | advantages of R is that many analysts, scientists, programmers, and users
279 | can create their own pacakages and make them available for everybody to use them.
280 | R packages can be shared in different ways. The most common way to share a
281 | package is to submit it to what is known as __CRAN__, the
282 | _Comprehensive R Archive Network_.
283 |
284 | You can install a package using the `install.packages()` function.
285 | Just give it the name of a package, surrounded by quotes, and R will look for
286 | it in CRAN, and if it finds it, R will download it to your computer.
287 |
288 | ```r
289 | # installing
290 | install.packages("knitr")
291 | ```
292 |
293 | You can also install a bunch of packages at once:
294 |
295 | ```r
296 | install.packages(c("readr", "ggplot2"))
297 | ```
298 |
299 | The installation of a package needs to be done only once.
300 | After a package has been installed, you can start using its functions
301 | by _loading_ the package with the function `library()`
302 |
303 | ```r
304 | library(knitr)
305 | ```
306 |
307 |
308 | ### Your turn
309 |
310 | - Install packages `"stringr"`, `"RColorBrewer"`
311 | - Calculate: $3x^2 + 4x + 8$ when $x = 2$
312 | - Look for the manual (i.e. help) documentation of the function `exp`
313 | - Find out how to look for information about binary operators
314 | like `+` or `^`
315 | - There are several tabs in the pane `Files, Plots, Packages, Help, Viewer`.
316 | Find out what does the tab __Files__ is good for?
317 |
318 | -----
319 |
320 | ## Introduction to Rmd files
321 |
322 | Besides using R script files to write source code, you will be using other
323 | type of source files known as _R markdown_ files, simply called `Rmd` files.
324 | These files use a special syntax called
325 | [markdown](https://en.wikipedia.org/wiki/Markdown).
326 |
327 |
328 | ### Get to know the `Rmd` files
329 |
330 | In the menu bar of RStudio, click on __File__, then __New File__,
331 | and choose __R Markdown__. Select the default option "Document" (HTML output),
332 | and click __Ok__.
333 |
334 | __Rmd__ files are a special type of file, referred to as a _dynamic document_,
335 | that allows you to combine narrative (text) with R code. It is extremeley
336 | important that you quickly become familiar with this resource. One reason is
337 | that you can use Rmd files to write your homework assignments and convert them
338 | to HTML, Word, or PDF files.
339 |
340 | Locate the button __Knit__ (the one with a knitting icon) and click on it
341 | so you can see how `Rmd` files are rendered and displayed as HTML documents.
342 |
343 |
344 | ### Yet Another Syntax to Learn
345 |
346 | R markdown (`Rmd`) files use [markdown](https://daringfireball.net/projects/markdown/)
347 | as the main syntax to write content that is not R code. Markdown is a very
348 | lightweight type of markup language, and it is relatively easy to learn.
349 |
350 |
351 | ### Your turn
352 |
353 | If you are new to Markdown, please take a look at the following tutorials:
354 |
355 | - [www.markdown-tutorial.com](http://www.markdown-tutorial.com)
356 | - [www.markdowntutorial.com](http://www.markdowntutorial.com)
357 |
358 | -----
359 |
360 | ### Rmd basics
361 |
362 | - YAML header:
363 | - title
364 | - author
365 | - date
366 | - output: `html_document`, `word_document`, `pdf_document`
367 | - Code Chunks:
368 | - syntax
369 | - chunk options
370 | - graphics
371 | - Math notation:
372 | - inline `$z^2 = x^2 + y^2$`
373 | - paragraph `$$z^2 = x^2 + y^2$$`
374 |
375 | Example of inline equation: $z^2 = x^2 + y^2$
376 |
377 | Example of equation in its own paragraph:
378 | $$
379 | z^2 = x^2 + y^2
380 | $$
381 |
382 | RStudio has a basic tutorial about R Markdown files:
383 | [Rstudio markdown tutorial](http://rmarkdown.rstudio.com/)
384 |
385 | Rmd files are able to render math symbols and expressions written using LaTeX
386 | notation. There are dozens of online resources to learn about math notation and
387 | equations in LaTeX. Here's some documentation from [www.sharelatex.com/learn](https://www.sharelatex.com/learn/)
388 |
389 | - [Mathematical expressions](https://www.sharelatex.com/learn/Mathematical_expressions)
390 | - [Subscripts and superscripts](https://www.sharelatex.com/learn/Subscripts_and_superscripts)
391 | - [Brackets and Parentheses](https://www.sharelatex.com/learn/Brackets_and_Parentheses)
392 | - [Fractions and Binomials](https://www.sharelatex.com/learn/Fractions_and_Binomials)
393 | - [Integrals, sums and limits](https://www.sharelatex.com/learn/Integrals,_sums_and_limits)
394 | - [List of Greek letters and math symbols](https://www.sharelatex.com/learn/List_of_Greek_letters_and_math_symbols)
395 | - [Operators](https://www.sharelatex.com/learn/Operators)
396 |
--------------------------------------------------------------------------------
/scripts/01-R-introduction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/01-R-introduction.pdf
--------------------------------------------------------------------------------
/scripts/02-data-variables.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Data and Variables in R"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | header-includes: \usepackage{float}
6 | output: html_document
7 | urlcolor: blue
8 | ---
9 |
10 | > ### Learning Objectives
11 | >
12 | > - Basics of vectors
13 | > - Variables (as vectors and factors)
14 | > - Quantitative variables as numeric vectors
15 | > - Qualitative variables (as factors)
16 | > - Manipulating vectors
17 |
18 |
19 | ```{r setup, include=FALSE}
20 | knitr::opts_chunk$set(echo = TRUE)
21 | ```
22 |
23 | ## NBA Data
24 |
25 | In this Rmd script we'll consider some NBA data from the website
26 | _Basketball Reference_. More specifically, let's look at the Western Conference
27 | Standings (season 2015-2016) shown in the following screenshot:
28 |
29 | ```{r out.width='60%', echo = FALSE, fig.align='center'}
30 | knitr::include_graphics('images/western-conference-standings-2016.png')
31 | ```
32 |
33 | source: [http://www.basketball-reference.com/leagues/NBA_2016.html#all_confs_standings_E](http://www.basketball-reference.com/leagues/NBA_2016.html#all_confs_standings_E)
34 |
35 | The above table contains 15 rows with 8 columns. The first column contains the
36 | names of the teams in the Western Conference, and the rest of the columns are:
37 |
38 | - _W_: wins
39 | - _L_: losses
40 | - _W/L%_: win-loss percentange
41 | - _GB_: games behind (the top team)
42 | - _PS/G_: points per game
43 | - _PA/G_: opponent points per game
44 | - _SRS_: simple rating system
45 |
46 | From the statistical standpoint, we say that the table has 8 variables measured
47 | (or observed) on 15 individuals. In this case the "individuals" are the basketball
48 | teams.
49 |
50 |
51 | ## Basics of vectors
52 |
53 | In order to use R as the computational tool in this course, you need to learn
54 | how to input data. Before describing how to read in tables in R (we'll cover
55 | that later), we must talk about vectors.
56 |
57 | R vectors are the most basic structure to store data in R. Virtually all other
58 | data structures in R are based or derived from vectors. Using a vector is also
59 | the most basic way to manually input data.
60 |
61 | You can create vectors in several ways. The most common option is with the
62 | function `c()` (combine). Simply pass a series of values separated by commas.
63 | Here is how to create a vector `wins` with the first five values from the column
64 | _W_ of the conference standings table:
65 |
66 | ```{r}
67 | wins = c(73, 67, 55, 53, 44)
68 | ```
69 |
70 | Likewise, we can create a vector `losses` like this:
71 |
72 | ```{r}
73 | losses = c(9, 15, 27, 29, 38)
74 | ```
75 |
76 | Having the vectors `wins` and `losses`, we can use them to create another
77 | vector `win_loss_perc` for the column _W/L%_ (win-loss percentange):
78 |
79 | ```{r}
80 | win_loss_perc = wins / (wins + losses)
81 | win_loss_perc
82 | ```
83 |
84 | You can think of vectors as variables. The previous vectors `wins`, `losses`,
85 | and `win_loss_perc` are what it's known as __quantitative__ variables. This
86 | means that each value in those variables (the numbers) reflect a quantity.
87 |
88 | Not all variables are quantitative. For instance, the first column of the table
89 | does not contain numbers but names. The name of a basketball team is referred
90 | to as a __qualitative__ variable.
91 |
92 | In R you can create a vector of names using a character vector. Again, we use
93 | the `c()` function and we pass it names surrounded by either single or double
94 | quotes. Here's how to create a vector `teams` with the names of the first five
95 | teams in the standings table:
96 |
97 | ```{r}
98 | teams = c('GSW', 'SAS', 'OCT', 'LAC', 'PTB')
99 | ```
100 |
101 | The vector `teams` is referred in R to as a __character vector__ because it
102 | is formed by characters.
103 |
104 |
105 | ## Manipulating Vectors: Subsetting
106 |
107 | In addition to creating variables, you should also learn how to do some basic
108 | manipulation of vectors. The most common type of manipulation is called
109 | _subsetting_ which refers to extracting elements of a vector (or another R object).
110 | To do so, you use what is known as __bracket notation__. This implies using
111 | (square) brackets `[ ]` to get access to the elements of a vector. Inside the
112 | brackets you can specify one or more numeric values that correspond to the
113 | position(s) of the vector element(s):
114 |
115 | ```r
116 | # first element of 'wins'
117 | wins[1]
118 |
119 | # third element of 'losses'
120 | losses[3]
121 |
122 | # last element of teams
123 | teams[5]
124 | ```
125 |
126 | Some common functions that you can use on vectors are:
127 |
128 | - `length()` gives the number of values
129 | - `sort()` sorts the values in increasing or decreasing ways
130 | - `rev()` reverses the values
131 |
132 | ```r
133 | length(teams)
134 | teams[length(teams)]
135 | sort(wins, decreasing = TRUE)
136 | rev(wins)
137 | ```
138 |
139 |
140 |
141 | ### Subsetting with Logical Indices
142 |
143 | In addition to using numbers inside the brackets, you can also do
144 | _logical subsetting_. This type of subsetting involves using a __logical__
145 | vector inside the brackets. A logical vector is a particular type of vector
146 | that takes the special values `TRUE` and `FALSE`, as well as `NA`
147 | (Not Available).
148 |
149 | This type of subsetting is very powerful because it allows you to
150 | extract elements based on some logical condition.
151 | Here are some examples of logical subsetting:
152 |
153 | ```r
154 | # wins of Golden State Warriors
155 | wins[teams == 'GSW']
156 |
157 | # teams with wins > 40
158 | teams[wins > 40]
159 |
160 | # name of teams with losses between 10 and 29
161 | teams[losses >= 10 & losses <= 29]
162 | ```
163 |
164 |
165 | ## Factors and Qualitative Variables
166 |
167 | As mentioned before, vectors are the most essential type of data structure
168 | in R. Related to vectors, there is another important data structure in R called
169 | __factor__. Factors are data structures exclusively designed to handle
170 | qualitative or categorical data.
171 |
172 | The term _factor_ as used in R for handling categorical variables, comes from
173 | the terminology used in _Analysis of Variance_, commonly referred to as ANOVA.
174 | In this statistical method, a categorical variable is commonly referred to as
175 | _factor_ and its categories are known as _levels_.
176 |
177 | To create a factor you use the homonym function `factor()`, which takes a
178 | vector as input. The vector can be either numeric, character or logical.
179 |
180 | ```{r}
181 | # numeric vector
182 | num_vector <- c(1, 2, 3, 1, 2, 3, 2)
183 |
184 | # creating a factor from num_vector
185 | first_factor <- factor(num_vector)
186 |
187 | first_factor
188 | ```
189 |
190 | You can take the `teams` vector and convert it as a factor:
191 |
192 | ```{r}
193 | teams = factor(teams)
194 | teams
195 | ```
196 |
197 |
198 |
199 | ## Sequences
200 |
201 | It is very common to generate sequences of numbers. For that R provides:
202 |
203 | - the colon operator `":"`
204 | - sequence function `seq()`
205 |
206 | ```r
207 | # colon operator
208 | 1:5
209 | 1:10
210 | -3:7
211 | 10:1
212 | ```
213 |
214 | ```r
215 | # sequence function
216 | seq(from = 1, to = 10)
217 | seq(from = 1, to = 10, by = 1)
218 | seq(from = 1, to = 10, by = 2)
219 | seq(from = -5, to = 5, by = 1)
220 | ```
221 |
222 |
223 | ### Repeated Vectors
224 |
225 | There is a function `rep()`. It takes a vector as the main input, and then it
226 | optionally takes various arguments: `times`, `length.out`, and `each`.
227 |
228 | ```{r}
229 | rep(1, times = 5) # repeat 1 five times
230 | rep(c(1, 2), times = 3) # repeat 1 2 three times
231 | rep(c(1, 2), each = 2)
232 | rep(c(1, 2), length.out = 5)
233 | ```
234 |
235 | Here are some more complex examples:
236 |
237 | ```r
238 | rep(c(3, 2, 1), times = 3, each = 2)
239 | ```
240 |
241 |
242 | ## From vectors to data frames
243 |
244 | Now that we've seen how to create some vectors and do some basic manipulation,
245 | we can describe how to combine them in a table in R. The standard tabular
246 | structure in R is a __data frame__. To manually create a data frame you use
247 | the function `data.frame()` and you pass it one or more vectors. Here's how
248 | to create a small data frame `dat` with the vectors `teams`, `wins`, `losses`,
249 | and `win_loss_perc`:
250 |
251 | ```{r}
252 | dat = data.frame(
253 | Teams = teams,
254 | Wins = wins,
255 | Losses = losses,
256 | WLperc = win_loss_perc
257 | )
258 |
259 | dat
260 | ```
261 |
262 | Manipulating data frames is more complex than manipulating vectors. However,
263 | manipulating the column of a data frame is essentially the same as manipulating
264 | a vector.
265 |
266 | There are a couple of ways to "select" a column of a data frame. One option
267 | consists of using the dollar `$` operator. This involves typing the name of
268 | the data frame, followed by the `$`, followed by the name of the column.
269 | For instance, to extract the values in column `Teams` simply type:
270 |
271 | ```{r}
272 | dat$Teams
273 | ```
274 |
275 | Moreover, you can use bracket notation on the extracted column like with any
276 | type of vector:
277 |
278 | ```{r}
279 | dat$Wins[1]
280 | dat$Wins[5]
281 | ```
282 |
283 | Likewise, you can do logical subsetting:
284 |
285 | ```r
286 | # wins of Golden State Warriors
287 | dat$Wins[dat$Teams == 'GSW']
288 |
289 | # teams with wins > 40
290 | dat$Teams[dat$Wins > 40]
291 |
292 | # name of teams with losses between 10 and 29
293 | dat$Teams[dat$Losses >= 10 & dat$Losses <= 29]
294 | ```
295 |
296 |
297 | ## Your Turn
298 |
299 | Refer to the table of Western Conference Standings shown at the beginning of
300 | this document. Your mission consists of creating a data frame `standings`.
301 | In order to create such data frame, you will have to first create the following
302 | eight vectors:
303 |
304 | - `teams`
305 | - `wins`
306 | - `losses`
307 | - `win_loss_perc`
308 | - `games_behind`
309 | - `points_scored`
310 | - `points_against`
311 | - `rating`
312 |
313 | You can create the vector `games_behind` by taking the won games of Golden
314 | State Warriors and subtracting the wins of the rest of the teams, that is:
315 |
316 | ```r
317 | wins[1] - wins
318 | ```
319 |
320 | Once you have the previous listed vectors, use the function `data.frame()`
321 | to build `standings`.
322 |
323 | Select the _Points Scored_ from the table `standings` and sort it both in
324 | increasing as well as decreasing order.
325 |
326 |
--------------------------------------------------------------------------------
/scripts/02-data-variables.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/02-data-variables.pdf
--------------------------------------------------------------------------------
/scripts/03-histograms.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/03-histograms.pdf
--------------------------------------------------------------------------------
/scripts/04-measures-center.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Measures of Center"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | urlcolor: blue
7 | ---
8 |
9 | > ### Learning Objectives
10 | >
11 | > - Compute the average
12 | > - Become familiar with the function `mean()`
13 | > - Interpret the average as the balancing point
14 |
15 | ```{r setup, include=FALSE}
16 | knitr::opts_chunk$set(echo = TRUE)
17 | ```
18 |
19 | As we mentioned in the previous script, the first part of the course has to do
20 | with __Descriptive Statistics__. The main idea is to make a "large" or
21 | "complicated" dataset more compact and easier to understand by using three
22 | major tools:
23 |
24 | - summary and frequency tables
25 | - charts and graphics
26 | - key numeric summaries
27 |
28 | In this script we will focus on various numeric summaries that are typically
29 | used to condense information of quantitative variables.
30 |
31 | One common way to classify numeric summaries is in 1) measures of center, and
32 | 2) measures of spread or variability.
33 | The idea of both types of measures is to obtain one or more numeric values that
34 | reflect a "central" value, and the amount of "spread".
35 |
36 | - Measures of Center
37 | + average or mean
38 | + median
39 | - Measures of Spread
40 | + range
41 | + interquartile range
42 | + standard deviation (and variance)
43 |
44 |
45 |
46 | ## The Average
47 |
48 | Perhaps the most common type of measure of center is the average or mean.
49 | Consider a list of numbers formed by: 0, 1, 2, 3, 5, and 7. The average is
50 | calculated as the sum of all values divided the number of values:
51 |
52 | $$
53 | average = \frac{0 + 1 + 2 + 3 + 5 + 7}{6} = 3
54 | $$
55 |
56 | You can use R to compute the previous average:
57 |
58 | ```{r}
59 | (0 + 1 + 2 + 3 + 5 + 7) / 6
60 | ```
61 |
62 | Algebraically, we typically denote a set of values by $x_1, x_2, \dots, x_n$,
63 | in which the index $n$ represents the total number of values. Using this
64 | notation, the formula of the average is expressed as:
65 |
66 | $$
67 | average = \frac{x_1 + x_2 + \dots + x_n}{n}
68 | $$
69 |
70 | Using summation notation, the average can be compactly expressed as:
71 |
72 | $$
73 | average = \sum_{i=1}^{n} \frac{x_i}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i
74 | $$
75 |
76 | Summation notation uses the uppercase Greek letter $\Sigma$ (sigma),
77 | is used as an abbreviation for the phrase "the sum of". So, in place of
78 | $x_1 + x_2 + \dots + x_n$, we can use summation notation as "the sum of the
79 | observations of the variable $x$."
80 |
81 | In R, you can create a vector `x` to store the previous numbers:
82 |
83 | ```{r}
84 | x = c(0, 1, 2, 3, 5, 7)
85 | ```
86 |
87 | Then, you can use the function `sum()` to add all the values in `x`, and
88 | compute the average as:
89 |
90 | ```{r}
91 | sum(x) / length(x)
92 | ```
93 |
94 | An alternative way to compute the average in R is using the `mean()` function:
95 |
96 | ```{r}
97 | mean(x)
98 | ```
99 |
100 |
101 | ### The Average is the balancing point
102 |
103 | Usually, the average of a set of $n$ values $x_1, x_2, \dots, x_n$ is expressed as
104 | $\bar{x}$ (pronounced _x-bar_).
105 |
106 | To understand how the average is a type of central or mid-value, we need to
107 | talk about __deviations__. A deviation is the difference between an observed
108 | value $x_i$ and another value of reference $ref$, that is, $(x_i - ref)$.
109 |
110 | Taking the average value $\bar{x}$ as a reference value, we can calculate the
111 | deviations of all observations from the average: $(x_i - \bar{x})$
112 |
113 | Given a reference value $ref$, we can also compute the sum of all deviations
114 | around such value:
115 |
116 | $$
117 | \sum_{i=1}^{n} (x_i - ref)
118 | $$
119 |
120 | It turns out that the average is the ONLY reference value such that the sum
121 | of deviations around it becomes zero:
122 |
123 | $$
124 | \sum_{i=1}^{n} (x_i - \bar{x}) = 0
125 | $$
126 |
127 | Let's verify that in R
128 |
129 | ```{r}
130 | avg = mean(x)
131 | deviations = x - avg
132 | deviations
133 | ```
134 |
135 | The sum of the deviations around the mean should be zero:
136 |
137 | ```{r}
138 | sum(deviations)
139 | ```
140 |
141 | This is the reason why we say that the average is one type of center or mid-value.
142 | In simpler terms, you can think of the average as the balance point of a
143 | distribution. The average is that point that cancels out the sum of deviations
144 | around it.
145 |
146 |
147 |
148 | ### Your turn
149 |
150 | We know that the average of `x` is `r mean(x)`. What happens to this average if:
151 |
152 | - you add a constant $b$ to all values in `x`?
153 | - you multiply the values in `x` times a constant $a$?
154 |
155 | For instance, let's add 2 to all vaues in `x`?
156 |
157 | ```{r}
158 | mean(x + 2)
159 | ```
160 |
161 | Now, let's multiply by 2 all values in `x`:
162 |
163 | ```{r}
164 | mean(x * 2)
165 | ```
166 |
167 | Spend some time in R to examine what happens to the average of $x + k$ and
168 | $k \times x$ with several choices of $k$, e.g. -2, 5, 100.
169 |
170 | Now, let's see what happens to the average when you add a constant $b$ to all
171 | values in `x`, and multiply them times some constant $a$?
172 |
173 | ```{r}
174 | mean(x)
175 | a = 2
176 | b = 3
177 | mean(a*x + b)
178 | ```
179 |
180 | Again, spend some time in R trying different values for `a` and `b`.
181 | What's your conclusion?
182 |
183 |
184 | ## The Median
185 |
186 | Another common type of measure of center is the __median__. The median is the
187 | literal middle value of an ordered distribution. By _middle value_ we mean that
188 | half of observations are below the median, and the other half of observations
189 | are above it.
190 |
191 | The easiest way to calculate the median in R is with the homonym function
192 | `median()`. Consider again the numbers in the vector `x`, the median of
193 | this set of values is:
194 |
195 | ```{r}
196 | x = c(0, 1, 2, 3, 5, 7)
197 |
198 | median(x)
199 | ```
200 |
201 |
202 | The median depends on the number of values. If you have a variable with an
203 | even number of values, then the median is the average of the two middle-values.
204 | If you have a variable with an odd number of values, then the median is the
205 | middle-value.
206 |
207 |
208 |
209 | ## More numeric summaries
210 |
211 | Another interesting function in R that you can use to obtain descriptive
212 | information about a variable is `summary()`. When you use this function on a
213 | numeric vector (i.e. quantitative variable), the returned output includes:
214 |
215 | - `Min.`: minimum
216 | - `1st Qu.`: first quartile
217 | - `Median`: median
218 | - `Mean`: average
219 | - `3rd Qu.`: third quartile
220 | - `Max.`: maximum
221 |
222 | ```{r}
223 | x = c(0, 1, 2, 3, 5, 7)
224 |
225 | summary(x)
226 | ```
227 |
228 |
229 | ## Average -vs- Median
230 |
231 | Consider a new vector `x` that contains 25 numbers: five 1's, five 2's, five 3's,
232 | five 4's, and five 5's:
233 |
234 | ```{r}
235 | x = rep(1:5, each = 5)
236 | ```
237 |
238 | As you can tell, all values in `x` occur with the same frequency. And if you
239 | get a histogram, R will plot all bars with the same height:
240 |
241 | ```{r out.width='50%', fig.align='center'}
242 | hist(x, breaks = c(0, 1, 2, 3, 4, 5), las = 1, col = 'gray80')
243 | ```
244 |
245 | In this data, the average and the median are the same. In fact, this happens
246 | all the time you have a perfect symmetric distribution:
247 |
248 | ```{r}
249 | mean(x)
250 | median(x)
251 | ```
252 |
253 |
254 | Now let's add one more observation to `x` with a value of 10, and obtain the
255 | average and the median:
256 |
257 | ```{r}
258 | y = c(x, 10)
259 | mean(y)
260 | median(y)
261 | ```
262 |
263 | Note that the average increased from `r mean(x)` to `r round(mean(y), 2)`,
264 | while the median remained unchanged.
265 |
266 | Let's make it more extreme and instead of adding a value of 10 let's add a
267 | value of 100 to `x`. The average and the median are:
268 |
269 | ```{r}
270 | z = c(x, 100)
271 | mean(z)
272 | median(z)
273 | ```
274 |
275 | You can look at the distributions of `x`, `y`, and `z` using the default plots
276 | produced by `hist()`:
277 |
278 | ```{r out.width='90%', fig.height=3, echo = FALSE, fig.align='center'}
279 | op = par(mfrow = c(1, 3))
280 | hist(x, las = 1, col = 'gray80')
281 | hist(y, las = 1, col = 'gray80')
282 | hist(z, las = 1, col = 'gray80')
283 | par(op)
284 | ```
285 |
286 | This is a toy example that illustrates one difference between the median and the
287 | average. The median is more resistant (or robust) to extreme values,
288 | but not the average. Small and large values affect the average of a distribution.
289 |
290 |
291 | ### Example
292 |
293 | Here's one more example that shows you how to use R to solve a typical textbook
294 | exercise. The average and median of the first 99 values of a data set of 198
295 | values are all equal to 120. If the average and median of the final 99 values
296 | are all equal to 100, what can you say about the average of the entire data set.
297 | What can you say about the median?
298 |
299 | You can solve theis type of questions analytically, or you can use R. Here's how.
300 | The problem deals with a data set of 198 values formed by two
301 | sets of numbers: the first 99 values are all equal to 120, the final 99 values
302 | are all equal to 100. You can create two R vectors to build the two sets of
303 | 99 values. This is achieved with the function `rep()` that allows
304 | you to __repeat__ one or more numeric values given a number of times:
305 |
306 | ```{r}
307 | # first 99 values equal to 120
308 | first_values = rep(120, times = 99)
309 |
310 | # final 99 values equal to 100
311 | final_values = rep(100, times = 99)
312 |
313 | # all values
314 | all_values = c(first_values, final_values)
315 | ```
316 |
317 | Having defined `first_values` and `final_values`, we build the entire list of
318 | 198 values by combining them in the vector `all_values`. The next step involves
319 | finding the average and the median:
320 |
321 | ```{r}
322 | # average
323 | mean(all_values)
324 |
325 | # median
326 | median(all_values)
327 | ```
328 |
329 |
330 |
331 |
--------------------------------------------------------------------------------
/scripts/04-measures-center.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/04-measures-center.pdf
--------------------------------------------------------------------------------
/scripts/05-measures-spread.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Measures of Spread"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | urlcolor: blue
7 | ---
8 |
9 | > ### Learning Objectives
10 | >
11 | > - Becoming familiar with various measures of spread
12 | > - Intro to the functions `range()`, `IQR()`, and `sd()`
13 | > - Understand the concept of r.m.s. size of a list of numbers
14 | > - Be aware of the difference between SD and SD+
15 |
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = TRUE)
19 | ```
20 |
21 | ## Introduction
22 |
23 | Quantitative variables can be summarized using two groups of measures:
24 | 1) center, and 2) spread. Just like there are various measures of center
25 | (e.g. average, median, mode), we also have several measures of spread or
26 | variability:
27 |
28 | - range
29 | - interquartile range
30 | - standard deviation (and variance)
31 |
32 |
33 | ## Range
34 |
35 | The most basic type of measure of spread is the __range__. The range is obtained
36 | as the difference between the maximum value and the minimum value.
37 |
38 | For example, let's consider the values 0, 5, -8, 7, and -3 used in the textbook
39 | (page 66). To find the range, you need to determine the smallest and largest
40 | values which in this case are 7 and -8, respectively. And then obtain the
41 | difference:
42 |
43 | $$
44 | range = 7 - (-8) = 15
45 | $$
46 |
47 | For illustration purposes, let's implement this minimalist example in R. First
48 | we create a vector `x` with the five values. You can use the functions `max()`
49 | and `min()` to get the largest and smallest values in `x`:
50 |
51 | ```{r}
52 | x = c(0, 5, -8, 7, -3)
53 | maximum = max(x)
54 | minimum = min(x)
55 |
56 | # range
57 | maximum - minimum
58 | ```
59 |
60 | Actually, there is a `range()` function in R, which gives you the maximum and
61 | minimum value (but not the subtraction):
62 |
63 | ```{r}
64 | # range: max value, and min value
65 | range(x)
66 | ```
67 |
68 | The range is one type of measure of variability. It tells you the _length_
69 | of the scatter in the data. The issue with the range is that extreme values
70 | may have a considerable effect on it. For example, if you add a value of 20 to
71 | `x` the new range becomes:
72 |
73 | ```{r}
74 | y = c(x, 20)
75 |
76 | # range
77 | max(y) - min(y)
78 | ```
79 |
80 | The presence of outliers will affect the magnitude of the range.
81 |
82 |
83 | ## Interquartile Range (IQR)
84 |
85 | To overcome the limitations of the range we can use a different type of range
86 | called the __interquartile range__ or _IQR_. This is a range based not on the
87 | minimum and maximum values but on the first and third quartiles.
88 |
89 | One way to compute quartiles in R is with the function `quantile()`. There are
90 | slightly different formulas to compute quartiles. To find the quartiles---as
91 | discussed in most introductory statistics books---you need to use the argument
92 | `type = 2` inside the `quantile()` function:
93 |
94 | ```{r}
95 | x = c(0, 5, -8, 7, -3)
96 |
97 | # 1st quartile
98 | Q1 = quantile(x, probs = 0.25, type = 2)
99 |
100 | # 3rd quartile
101 | Q3 = quantile(x, probs = 0.75, type = 2)
102 |
103 | # IQR
104 | Q3 - Q1
105 | ```
106 |
107 | You can also use the dedicated function `IQR()` to compute the interquartile
108 | range:
109 |
110 | ```{r}
111 | IQR(x, type = 2)
112 | ```
113 |
114 | Compared to the classic range, the IQR is more resistant to outliers because
115 | it does not consider the entire set of values, just those between the first
116 | and third quartile. If we add a extreme negative value -50, and an extreme
117 | positive value of 40 to `x`, the IQR should not be affected:
118 |
119 | ```{r}
120 | y = c(x, -50, 40)
121 |
122 | # IRQ
123 | IQR(y, type = 2)
124 | ```
125 |
126 |
127 | ## The Root Means Square (RMS)
128 |
129 | Another measure of spread is the Standard Deviation (SD). However, in order to
130 | talk about the SD, I will follow the same approach of the FPP book and I will first talk about the __Root Mean Square__ or RMS.
131 |
132 | The values in our toy example are 0, 5, -8, 7, and -3. To find a central value
133 | we can use either the average or the median:
134 |
135 | ```{r results='hide'}
136 | x = c(0, 5, -8, 7, -3)
137 |
138 | mean(x)
139 | median(x)
140 | ```
141 |
142 | What about a measure of _size_? In other words, how would you find a measure
143 | of how small or how big the values in `x` are? Is it possible to obtain a
144 | quantity that tells you something about the representative _magnitude_ of values
145 | in `x`?
146 |
147 | To answer this question about a typical magnitude of values we need to ignore
148 | the signs. One way to do this is by looking at the absolute values, and then
149 | compute the average:
150 |
151 | ```{r}
152 | abs(x)
153 |
154 | mean(abs(x))
155 | ```
156 |
157 | For convenience reasons (e.g. algebraic manipulation and nice mathematical properties), statisticians prefer to square the values instead of using the
158 | absolute values. And then compute the average of such squares:
159 |
160 | ```{r}
161 | # square value
162 | x^2
163 |
164 | # average of square values
165 | sum(x^2) / length(x)
166 | ```
167 |
168 | The issue with using square values is that now you end up working with
169 | square units, and with a larger number that has little to do with a typical
170 | magnitude of the original values. To tackle this problem, we take the square
171 | root:
172 |
173 | ```{r}
174 | # root-mean-square (r.m.s)
175 | sqrt(sum(x^2) / length(x))
176 |
177 | # equivalent to
178 | sqrt(mean(x^2))
179 | ```
180 |
181 | The value `r round(sqrt(mean(x^2)), 2)` is referred to as the _r.m.s. size_ of
182 | the numbers in `x`. The RMS size provides a numeric summary for the magnitude
183 | of the data. It is not really the average magnitude, but you can think of it
184 | as such.
185 |
186 |
187 | ## Standard Deviation (SD)
188 |
189 | Now that we have introduced the concept of r.m.s. size of a list of numbers,
190 | we can talk about a third measure of spread known as the
191 | __Standard Deviation__ (SD). Simply put, the Standard Deviation is a measure
192 | of spread that quantifies the amount of variation around the average.
193 |
194 | A keyword is the term __deviation__. In the previous script tutorial---about measures of center---I introduced the concept of _deviations_.
195 | If we denote a set of $n$ values with $x_1, x_2, \dots, x_n$, and a reference
196 | value by $ref$, a deviation is the difference between an observed
197 | value $x_i$ and the value of reference $ref$, that is, $(x_i - ref)$.
198 |
199 | A special type of deviation is when the reference value becomes the average.
200 | If $avg$ represents the average of $x_1, x_2, \dots, x_n$, we can calculate the
201 | deviations of all observations from the average: $(x_i - avg)$.
202 |
203 | The Standard Deviation is based on these deviations. To be more precise, it
204 | is based on the R.M.S. size of deviations from the average:
205 |
206 | $$
207 | SD = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - avg)^2 }
208 | $$
209 |
210 | The SD says how far away numbers $x_1, x_2, \dots x_n$ are from their average.
211 | In this sense, you can think of the SD as the typical magnitude of scatter
212 | around the average.
213 |
214 |
215 | ### The `sd()` function
216 |
217 | All statistical packages come with a function that allows you to calculate
218 | the Standard Deviation. In R, there is the function `sd()`. However, the way
219 | `sd()` works is by using a slightly different formula:
220 |
221 | $$
222 | SD^{+} = \sqrt{ \frac{1}{n-1} \sum_{i=1}^{n} (x_i - avg)^2 }
223 | $$
224 |
225 | Note that $SD^{+}$ divides by $n-1$ instead of $n$. When the number of values
226 | $n$ is big, $\sqrt{n-1}$ is very close to $\sqrt{n}$. However, for relatively
227 | small values of $n$, there diference between $\sqrt{n-1}$ and $\sqrt{n}$ can
228 | be considerable.
229 |
230 | If you want to use `sd()` to obtain $SD$, you need to multiply the output by a
231 | correction factor of $\frac{\sqrt{n-1}}{n}$:
232 |
233 | ```{r}
234 | x = c(0, 5, -8, 7, -3)
235 | n = length(x)
236 |
237 | # SD
238 | sqrt((n-1)/n) * sd(x)
239 |
240 | # SD+
241 | sd(x)
242 | ```
243 |
244 |
245 |
--------------------------------------------------------------------------------
/scripts/05-measures-spread.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/05-measures-spread.pdf
--------------------------------------------------------------------------------
/scripts/06-normal-curve.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "The Normal Curve"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | urlcolor: blue
7 | ---
8 |
9 | > ### Learning Objectives
10 | >
11 | > - Becoming familiar with the normal curve
12 | > - Intro to the functions `dnorm()`, `pnorm()`, and `qnorm()`
13 | > - How to find areas under the normal curve using R
14 | > - Converting values to standard units
15 |
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = TRUE)
19 | ```
20 |
21 |
22 | ## Introduction
23 |
24 | Let's look at the distributions of some variables in the data of NBA players:
25 |
26 | ```{r}
27 | # assembling the URL of the CSV file
28 | repo = 'https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/'
29 | datafile = 'master/data/nba_players.csv'
30 | url = paste0(repo, datafile)
31 | # read in data set
32 | nba = read.csv(url)
33 | ```
34 |
35 | More specifically, let's take a peek at the histograms of variables `height`,
36 | `weight`, `age`, `points2_percent`
37 |
38 | ```{r echo = FALSE, out.width='95%', fig.align='center'}
39 | variables = c('height', 'weight', 'age', 'points2_percent')
40 | op = par(mfrow = c(2, 2))
41 | for (i in variables) {
42 | hist(nba[ ,i], xlab = i,
43 | col = 'gray80', las = 1,
44 | main = paste('Histogram of', i))
45 | }
46 | par(op)
47 | ```
48 |
49 | - `height` seems to have a slightly left skewed distribution.
50 | - `weight` looks roughly symmetric.
51 | - `age` has a right skewed distribution.
52 | - `points2_percent` appears to be fairly symmetric.
53 |
54 | These distributions are examples of some of the possible patterns that
55 | you will find when describing data in real life. If you are lucky, you may
56 | even get to see a perfect symmetric distribution one day.
57 |
58 | Among the wide range of distribution shapes that we encounter when looking
59 | at data, one special pattern has received most of the attention: the so-called
60 | _symmetric bell-shaped_ or mound-shaped distribution, like that of
61 | `points2_percent` and `weight`. It is true that these two histograms are far
62 | from perfect symmetry, but we can put them within the _fairly_ bell-shaped
63 | category.
64 |
65 |
66 | ## Normal Curve
67 |
68 | It turns out that there is one mathematical function that fits (density)
69 | histograms having a symmetric bell-shaped pattern: the famous __normal curve__
70 | given by the following equation
71 |
72 | $$
73 | y = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} x^2}
74 | $$
75 |
76 | This equation, also known as the Laplace-Gaussian curve, was first discovered by
77 | Abraham de Moivre (circa 1720) while working on the first problems about
78 | probability. However, his work around the normal equation went unnoticed for
79 | many years. By the time historians realized he had been the first person to
80 | come up with the normal equation, most people had attributed authorship to
81 | either French scholar Pierre-Simon Laplace and/or German mathematician
82 | Carl Friedrich Gauss.
83 |
84 | In the past, before the 1880s, the curve was referred to as the _Error curve_,
85 | because of its application around the errors from measurements in astronomy.
86 | The name _normal_ appeared around the late 1870s and early 1880s, where British
87 | biometricians like Francis Galton, and later on his disciple Karl Pearson,
88 | together with Ronald Fisher, popularized the word _normal_. Galton never
89 | explained why he used the term "normal" although it seems that he was implying
90 | the sense of conforming to a norm (i.e. a standard, model, pattern, type).
91 |
92 |
93 | ### Plotting the Normal Curve in R
94 |
95 | You can use R to obtain a graph of the normal curve. One approach is to generate
96 | values for the x-axis, and then use the equation of the normal curve to obtain
97 | values for the y-axis:
98 |
99 | ```{r out.width='60%', fig.align='center'}
100 | x = seq(from = -3, to = 3, by = 0.01)
101 | y = (1/sqrt(2 * pi)) * exp(-(x^2)/2)
102 |
103 | plot(x, y, type = "l", lwd = 3, col = "blue")
104 | ```
105 |
106 | First we generate a vector `x` with some values for the x-axis ranging from
107 | -3 to 3. Then we use `x` to find the heights of the `y` variable. Finally,
108 | we use the values in `x` and `y` as coordinates of the `plot()`. The argument
109 | `type = 'l'` is used to graph a line instead of dots. The argument `lwd` allows
110 | you to define the width of the line. And `col` lets you define a color.
111 |
112 |
113 | ## Normal Distribution Functions
114 |
115 | Instead of working with the equation `y = (1/sqrt(2 * pi)) * exp(-(x^2)/2)`,
116 | R has a family of four functions dedicated to the normal curve:
117 |
118 | - `dnorm()` density function
119 | - `pnorm()` distribution function
120 | - `qnorm()` quantile function
121 | - `rnorm()` random number generator function
122 |
123 |
124 | ### Heights of the curve with `dnorm()`
125 |
126 | The function `dnorm()` is the __density__ function. This is actually the function
127 | that lets you find the height of the curve (i.e. $y$ values). Instead of
128 | manually coding the normal equation, you can use `dnorm()` and get the
129 | previously obtained graph like this:
130 |
131 | ```{r out.width='60%', fig.align='center'}
132 | x = seq(from = -3, to = 3, by = 0.01)
133 | y = dnorm(x)
134 |
135 | plot(x, y, type = "l", lwd = 3, col = "blue")
136 | ```
137 |
138 |
139 | ### Areas under the curve with `pnorm()`
140 |
141 | The function `pnorm()` is the distribution function. By default, `pnorm()`
142 | returns the area under the curve to the __left__ of a specified `x` value. For
143 | instance, the area to the left of 0 is 0.5 or 50%:
144 |
145 | ```{r}
146 | pnorm(0)
147 | ```
148 |
149 | Try `pnorm()` with these values
150 |
151 | ```{r eval = FALSE}
152 | pnorm(-2)
153 | pnorm(-1)
154 | pnorm(1)
155 | pnorm(2)
156 | ```
157 |
158 | You can also use `pnorm()` to find areas under the normal curve to the __right__
159 | of a specific `x` value. This is done by using the argument `lower.tail = FALSE`:
160 |
161 | ```{r}
162 | # area to the right of 1
163 | pnorm(1, lower.tail = FALSE)
164 | ```
165 |
166 | Try finding the areas to the right of:
167 |
168 | ```{r eval = FALSE}
169 | pnorm(-2.5, lower.tail = FALSE)
170 | pnorm(-2, lower.tail = FALSE)
171 | pnorm(0.5, lower.tail = FALSE)
172 | pnorm(1.5, lower.tail = FALSE)
173 | ```
174 |
175 |
176 | Sometimes you need to find areas in between two $z$ values. For instance, the
177 | area between -1 and 1 (which is about 68%). Finding this type of areas involves
178 | subtracting the larger area to the left of 1 minus the smaller area to the
179 | left of -1:
180 |
181 | ```{r}
182 | # area between -1 and 1
183 | pnorm(1) - pnorm(-1)
184 | ```
185 |
186 | What abot the area between -2 and 2?
187 |
188 | ```{r}
189 | # area between -2 and 2
190 | pnorm(2) - pnorm(-2)
191 | ```
192 |
193 |
194 |
195 | ### Z values of a given area with `qnorm()`
196 |
197 | The function `qnorm()` is the quantile function. You can think of this function
198 | as the inverse of `pnorm()`. That is, for a given area under the curve, use
199 | `qnorm()` to find what is the corresponding `z` value (i.e. value on the x-axis):
200 |
201 | ```{r}
202 | # z-value such that the area to its left is 0.5
203 | qnorm(0.5)
204 |
205 | # z-value such that the area to its left is 0.3
206 | qnorm(0.3)
207 | ```
208 |
209 | Likewise, you can use the argument `lower.tail = FALSE` to find values given
210 | a right-tail area:
211 |
212 | ```{r}
213 | # z-value such that the area to its right is 0.5
214 | qnorm(0.5, lower.tail = FALSE)
215 |
216 | # z-value such that the area to its right is 0.3
217 | qnorm(0.3, lower.tail = FALSE)
218 | ```
219 |
220 |
221 |
222 | ## Standard Units
223 |
224 | In real life, most variables will be measured in some scale: `height` measured
225 | in inches, `weight` measured in ounces, `age` measured in years,
226 | `points2_percent` measured in percentage. To be able to use the normal curve
227 | as an approximation for symmetric bell-shaped distributions, you will need
228 | to convert the original units into __standard units__ (SU).
229 |
230 | Recall that the conversion formula from $x$ to standard units is:
231 |
232 | $$
233 | SU = \frac{x - avg}{SD}
234 | $$
235 |
236 | Let's see how you could convert `weight` values to SU using R. First we need
237 | to obtain find the average and standard deviation of `weight`:
238 |
239 | ```{r}
240 | # average weight
241 | avg_weight = mean(nba$weight)
242 | avg_weight
243 |
244 | # SD weight
245 | # (remember to use correction factor)
246 | n = nrow(nba)
247 | sd_weight = sqrt((n-1)/n) * sd(nba$weight)
248 | sd_weight
249 | ```
250 |
251 | To convert the weights of the players to standard units, subtract the average
252 | and then divide by the SD:
253 |
254 | ```{r}
255 | # weight in SU
256 | su_weight = (nba$weight - avg_weight) / sd_weight
257 |
258 | # weights in SU of first 5 players
259 | su_weight[1:5]
260 | ```
261 |
262 |
263 | How does the histogram for `su_weight` look like?
264 |
265 | ```{r out.width='60%', fig.align='center'}
266 | # density histogram
267 | hist(su_weight, las = 1, col = 'gray80', probability = TRUE,
268 | ylim = c(0, 0.5), xlim = c(-3.5, 3.5),
269 | main = 'Histogram of Weight in SU', xlab = 'standard units')
270 | ```
271 |
272 |
273 | An alternative picture of the distribution of `su_weight` can be obtained by
274 | plotting a kernel density curve:
275 |
276 | ```{r out.width='60%', fig.align='center'}
277 | dens_weight = density(su_weight)
278 | plot(dens_weight, axes = FALSE, ylim = c(0, 0.5), xlim = c(-3.5, 3.5),
279 | main = 'Density Curve', xlab = 'standard units', lwd = 2, col = 'blue')
280 | # x-axis
281 | axis(side = 1)
282 | # y-axis
283 | axis(side = 2, las = 1)
284 | ```
285 |
286 | Looking at both the histogram, and the kernel density curve, the shape of the
287 | distribution is symmertric but it does not have a peak around zero.
288 | You can say that it has moero of a plateau or flat peak.
289 |
290 |
291 | ## Using Normal Approximation
292 |
293 | Although `weight` does not have the central peak, we can try to see how good
294 | the normal curve approximates its distribution. From the attributes of the
295 | normal curve, we know that 50% of players should have a height below
296 | `avg_weight`. We can directly check what is the proportion of players below
297 | `avg_weight`:
298 |
299 | ```{r}
300 | # proportion of players below average weight
301 | sum(nba$weight <= avg_weight) / n
302 | ```
303 |
304 | This confirms that `weight` (and `su_weight`) does have a symmetric shape.
305 |
306 | From the empirical 68-95-99.7 rule, we know that about 68% of players should
307 | have weights between `r round(avg_weight, 2)` plus-minus
308 | `r round(sd_weight, 2)`, that is, between `r round(avg_weight - sd_weight, 2)`
309 | and `r round(avg_weight + sd_weight, 2)`
310 |
311 | ```{r}
312 | weight_minus = avg_weight - sd_weight
313 | weight_plus = avg_weight + sd_weight
314 |
315 | # proportion of players within 1 SD from average weight
316 | sum(nba$weight <= weight_plus & nba$weight >= weight_minus) / n
317 | ```
318 |
319 | As you can tell, the proportion of players around 1 SD is not 68% but 65%.
320 | However, the difference of 3% is not that big.
321 |
322 |
323 | ### Asumming Normality ...
324 |
325 | Let's pretend that `wheight` does have a symmetric bell-shaped distribution,
326 | and that we are interested in finding the proportion of players with weights
327 | below 200 pounds.
328 |
329 | You can use `pnorm()` to find such proportion, and without having to convert
330 | to standard units. All you need to do is specify the `mean` and `sd` arguments
331 | with the corresponding average and SD values, respectively:
332 |
333 | ```{r}
334 | # proportion of players with weight below 200 pounds
335 | pnorm(200, mean = avg_weight, sd = sd_weight)
336 | ```
337 |
338 | To find the proportion of players with weights above 230 pounds, include the
339 | argument `lower.tail = FALSE`:
340 |
341 | ```{r}
342 | # proportion of players with weight above 230 pounds
343 | pnorm(230, mean = avg_weight, sd = sd_weight, lower.tail = FALSE)
344 | ```
345 |
346 | You can also use `qnorm()` to find what would be the corresponding weight
347 | such that 60% of players are below it:
348 |
349 | ```{r}
350 | qnorm(0.6, mean = avg_weight, sd = sd_weight)
351 | ```
352 |
353 | Or what is the weight wuch that 35% of players are above it:
354 |
355 | ```{r}
356 | qnorm(0.35, mean = avg_weight, sd = sd_weight, lower.tail = FALSE)
357 | ```
358 |
--------------------------------------------------------------------------------
/scripts/06-normal-curve.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/06-normal-curve.pdf
--------------------------------------------------------------------------------
/scripts/07-scatter-diagrams.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Scatter Diagrams"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | urlcolor: blue
7 | ---
8 |
9 | > ### Learning Objectives
10 | >
11 | > - How to use `plot()` to create scatter diagrams
12 | > - Adding points with `points()`
13 | > - Adding lines with `abline()`
14 | > - How to use `ggplot()` to create scatter diagrams
15 |
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = TRUE)
19 | ```
20 |
21 | ## Introduction
22 |
23 | The easiest way to plot scatter diagrams in R is with the `plot()` function.
24 | I should say that `plot()` produces different kinds of plots depending on the
25 | type of input(s) that you pass to it.
26 |
27 | If you pass two numeric variables (i.e. two R vectors) `x` and `y`, `plot()`
28 | will produce a scatter diagram. For example, consider the `height` and `weight`
29 | variables of the following toy data table:
30 |
31 | ```{r echo = FALSE}
32 | library(xtable)
33 | dat = data.frame(
34 | name = c('Luke', 'Leia', 'Obi-Wan', 'Yoda', 'Chebacca'),
35 | sex = c('male', 'female', 'male', 'male', 'male'),
36 | height = c(172, 150, 182, 66, 228),
37 | weight = c(77, 49, 44, 78, 112)
38 | )
39 | ```
40 |
41 | ```{r, echo=FALSE, results='asis', message=FALSE}
42 | xtb <- xtable(dat, digits = 2)
43 | print(xtb, comment = FALSE, type = 'latex',
44 | include.rownames = FALSE)
45 | ```
46 |
47 | To make a scatter diagram with `height` and `weight`, you can create two
48 | vectors and pass them to `plot()`:
49 |
50 | ```{r out.width='50%', fig.align='center', fig.width=3, fig.height=3.5}
51 | height = c(172, 150, 182, 66, 228) # in centimeters
52 | weight = c(77, 49, 44, 78, 112) # in kilograms
53 |
54 | # default scatter diagram
55 | plot(height, weight)
56 | ```
57 |
58 | If you pass a factor to `plot()` it will produce a bar-chart:
59 |
60 | ```{r out.width='50%', fig.align='center', fig.width=2.5, fig.height=3.5}
61 | # qualitative variable (as an R factor)
62 | sex = factor(c('male', 'female', 'male', 'male', 'male'))
63 |
64 | # default scatter diagram
65 | plot(sex)
66 | ```
67 |
68 |
69 | Note that `plot()` displays a very simple, and kind of ugly, scatter diagram.
70 | This not an accident. In fact, the basic plots in R follow a "quick and dirty"
71 | approach. They are not publication quality, but that is OK. The default display
72 | of `plot()` was not designed to produce pretty graphics, but rather to produce
73 | visualizations that quickly allow you to explore the data, identify patterns,
74 | help you ask new research questions, and then move on with more visualizations
75 | or to the next analytical stages.
76 |
77 | Although `plot()` produces a basic graph, you can use several arguments, or
78 | graphical parameters, to obtain a nicer chart. To find more information about
79 | the available graphical parameters for `plot()`, take a look at the documentation
80 | provided by `help(plot)`.
81 |
82 | The following code uses various graphical parameters to display a more visually
83 | appealing scatter diagram:
84 |
85 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
86 | # nicer scatter diagram
87 | plot(height, weight,
88 | las = 1, # orientation of y-axis tick marks
89 | pch = 19, # filled dots
90 | col = '#598CDD', # color of dots
91 | xlab = 'Height (cm)', # x-axis label
92 | ylab = 'Weight (kg)', # y-axis label
93 | main = 'Height -vs- Weight scatter diagram')
94 | ```
95 |
96 |
97 | ## Adding points and lines
98 |
99 | Often, you may want to add more points and/or line(s) to a given plot. When
100 | you use `plot()`, you add points with `points()`, and lines with `abline()`.
101 |
102 | For example, say you want to add the point of averages. First, get the
103 | averages:
104 |
105 | ```{r}
106 | avg_height = mean(height)
107 | avg_weight = mean(weight)
108 | ```
109 |
110 | Once you have the coordinates of the point of averages, you can `plot()` again
111 | the scatter diagram, adding the point of averages with `points()`:
112 |
113 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
114 | # scatter diagram
115 | plot(height, weight,
116 | las = 1, # orientation of y-axis tick marks
117 | pch = 19, # filled dots
118 | col = '#598CDD', # color of dots
119 | xlab = 'Height (cm)', # x-axis label
120 | ylab = 'Weight (kg)', # y-axis label
121 | main = 'Height -vs- Weight scatter diagram')
122 | # point of averages
123 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato")
124 | ```
125 |
126 |
127 | Another common task involves adding one or more lines to a scatter diagram
128 | produced by `plot()`. One option to achieve this task is via the `abline()`
129 | function. Here's an example showing the previous scatter diagram, with two
130 | guide lines corresponding to the point of averages
131 |
132 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
133 | # scatter diagram
134 | plot(height, weight,
135 | las = 1, # orientation of y-axis tick marks
136 | pch = 19, # filled dots
137 | col = '#598CDD', # color of dots
138 | xlab = 'Height (cm)', # x-axis label
139 | ylab = 'Weight (kg)', # y-axis label
140 | main = 'Height -vs- Weight scatter diagram')
141 | # guide lines for point of avgs
142 | abline(h = avg_weight, v = avg_height, col = "tomato")
143 | # point of averages
144 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato")
145 | ```
146 |
147 | The argument `h` is used to specify the y-value for _horizontal_ lines;
148 | the argument `v` is used to specify the x-value for _vertical_ lines.
149 |
150 | If what you want is to specify a line with intercept `a` and slope `b`, then
151 | specify these arguments inside `abline()`:
152 |
153 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4.5}
154 | # scatter diagram
155 | plot(height, weight,
156 | las = 1, # orientation of y-axis tick marks
157 | pch = 19, # filled dots
158 | col = '#598CDD', # color of dots
159 | xlab = 'Height (cm)', # x-axis label
160 | ylab = 'Weight (kg)', # y-axis label
161 | main = 'Height -vs- Weight scatter diagram')
162 | # guide lines for point of avgs
163 | abline(h = avg_weight, v = avg_height, col = "tomato")
164 | # line with intercep and slope
165 | abline(a = 40, b = 0.3, col = "gray50", lty = 2, lwd = 2)
166 | # point of averages
167 | points(avg_height, avg_weight, pch = 19, cex = 2, col = "tomato")
168 | ```
169 |
170 |
171 |
172 | ## Scatter diagrams with `ggplot2`
173 |
174 | Another approach to create scatter diagrams in R is to use functions from the
175 | package `"ggplot2"`. This package provides a different philosophy to define
176 | graphs, and it also produces plots with visual attributes carefully chosen
177 | to provide prettier plots.
178 |
179 | You should have the package `"ggplot2"` already installed, since you were
180 | supposed to use it for HW02. Assuming that this is the case, you need to load
181 | `"ggplot2"` with the function `library()` in order to start using its functions:
182 |
183 | ```{r warning=FALSE, message=FALSE}
184 | # load ggplot2
185 | library(ggplot2)
186 | ```
187 |
188 | One of the major differences between basic plots---like those produced by `plot()`---and graphics with `ggplot()`, is that the latter requires the data
189 | to be in the form of a data frame:
190 |
191 | ```{r}
192 | dat = data.frame(
193 | name = c('Luke', 'Leia', 'Obi-Wan', 'Yoda', 'Chewbacca'),
194 | sex = c('male', 'female', 'male', 'male', 'male'),
195 | height = c(172, 150, 182, 66, 228),
196 | weight = c(77, 49, 44, 78, 112)
197 | )
198 | ```
199 |
200 | To create a scatter diagram with `"ggplot2"`, type the following commands:
201 |
202 | ```{r out.width='40%', fig.align='center', fig.width=3, fig.height=3}
203 | ggplot(data = dat, aes(x = height, y = weight)) +
204 | geom_point()
205 | ```
206 |
207 | - The main input of `ggplot()` is `data` which takes the name of the data
208 | frame containing the variables.
209 | - The `aes()` function---inside `ggplot()`---allows you to specify which
210 | variables will be used for the `x` and `y` positions.
211 | - The `+` operator is used to add a _layer_, in this case, the layer corresponds
212 | to `geom_point()`
213 | - The function `geom_point()` specifies the type of geometric object to be
214 | displayed: points (since we want a scatter diagram with dots).
215 |
216 | As you can tell, the default chart produced by `ggplot()` is nicer than the
217 | one produced with `plot()`. You can customize the previous graph to add more
218 | details:
219 |
220 | ```{r out.width='50%', fig.align='center', fig.width=4, fig.height=4}
221 | ggplot(data = dat, aes(x = height, y = weight)) +
222 | geom_point(size = 3) +
223 | theme_bw() +
224 | ggtitle("Height -vs- Weight scatter diagram")
225 | ```
226 |
227 | Here's another example of a scatter diagram that includes labels for each dot:
228 |
229 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4}
230 | ggplot(data = dat, aes(x = height, y = weight)) +
231 | geom_point(size = 3) +
232 | geom_text(aes(label = name), hjust=0, vjust=0) +
233 | xlim(0, 300) +
234 | theme_bw() +
235 | ggtitle("Height -vs- Weight scatter diagram")
236 | ```
237 |
238 | Adding specific points with `ggplot()` is a bit trickier. This is because
239 | you need to provide data to `ggplot()` in the form of a data.frame. In order
240 | to plot the point of averages with `ggpot()`, we need to create a data frame
241 | for such a point:
242 |
243 | ```{r}
244 | # data frame for the point of averages
245 | avgs = data.frame(height = avg_height, weight = avg_weight)
246 | avgs
247 | ```
248 |
249 | One way to add the point of averages is to use `geom_point()` twice: one for
250 | the heighths and weights of the individuals, and the second time for the
251 | point of averages:
252 |
253 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4}
254 | ggplot(data = dat, aes(x = height, y = weight)) +
255 | geom_point(size = 3) +
256 | geom_point(data = avgs, aes(x = height, y = weight),
257 | col = "tomato", size = 4) +
258 | geom_text(aes(label = name), hjust=0, vjust=0) +
259 | xlim(0, 300) +
260 | theme_bw() +
261 | ggtitle("Height -vs- Weight scatter diagram")
262 | ```
263 |
264 | Finally, here's how to add guide lines for the point of averages:
265 |
266 | ```{r out.width='60%', fig.align='center', fig.width=5, fig.height=4}
267 | ggplot(data = dat, aes(x = height, y = weight)) +
268 | geom_point(size = 3) +
269 | geom_point(data = avgs, aes(x = height, y = weight),
270 | col = "tomato", size = 4) +
271 | geom_vline(xintercept = avg_height, col = 'tomato') +
272 | geom_hline(yintercept = avg_weight, col = 'tomato') +
273 | geom_text(aes(label = name), hjust=0, vjust=0) +
274 | xlim(0, 300) +
275 | theme_bw() +
276 | ggtitle("Height -vs- Weight scatter diagram")
277 | ```
278 |
--------------------------------------------------------------------------------
/scripts/07-scatter-diagrams.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/07-scatter-diagrams.pdf
--------------------------------------------------------------------------------
/scripts/08-correlation.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Correlation Coefficient"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | urlcolor: blue
7 | ---
8 |
9 | > ### Learning Objectives
10 | >
11 | > - Using scatter diagrams to visualize association of two variables
12 | > - Using R to "manually" compute the correlation coefficient
13 | > - Getting to know the function `cor()`
14 | > - Understanding how change of scales affect the correlation
15 |
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = TRUE)
19 | ```
20 |
21 | ## Introduction
22 |
23 | In the previous script we talked about how to plot scatter diagrams in R using
24 | two different approaches: 1) the basic `plot()` function, and 2) the more
25 | advanced graphics package `"ggplot2"`.
26 | Knowing how to create scatter diagrams will help us introduce the ideas that
27 | have to do with the analysis of two quantitative variables.
28 |
29 | Describing and summarizing a single (quantitative) variable is usually the
30 | firts step of any data analysis. This should allow you to get to know the data
31 | by looking at the distributions of the variables, and reducing the numerical
32 | information in the data to a set of measures of center and spread.
33 |
34 | After performing a univariate analysis, the next step will usually consist of
35 | exploring how two variables may be associated, determine the type of association,
36 | how strong is the association (if any), and how to summarize such association.
37 |
38 |
39 | ## Anscombe Data Set
40 |
41 | In this tutorial we are going to use a special data set known as the _Anscombe_
42 | data or _Anscombe's Quartet_. This data was created by Francis Anscombe in
43 | the early 1970s to illustrate statistical similarities and differences between
44 | four pairs of $x-y$ values. This is one of the many data sets that come
45 | in R, and it is available in the object `anscombe`
46 |
47 | ```{r}
48 | # Anscombe's Quartet
49 | anscombe
50 | ```
51 |
52 | The data frame `anscombe` contains 8 variables: 4 `x`'s and 4 `y`'s. The way you should handle these variables is: `x1` with `y1`, `x2` with `y2`, and so on.
53 |
54 |
55 | ### Histograms
56 |
57 | Let's begin a univariate analysis by looking at the histograms of the `x`
58 | variables:
59 |
60 | ```{r xhistograms, eval = FALSE}
61 | # historgams of x-variables in 2x2 layout
62 | op = par(mfrow = c(2, 2))
63 | hist(anscombe$x1, col = 'gray80', las = 1)
64 | hist(anscombe$x2, col = 'gray80', las = 1)
65 | hist(anscombe$x3, col = 'gray80', las = 1)
66 | hist(anscombe$x4, col = 'gray80', las = 1)
67 | par(op)
68 | ```
69 |
70 | ```{r xhistograms, echo = FALSE, fig.height=6}
71 | ```
72 |
73 | Note that `x1`, and `x2`, and `x3` have the exact same histogram. If you look
74 | at the data frame, this is explained by the fact that these variables have the
75 | same values. In contrast, `x4` has almost all of its vallues equal to 8,
76 | except for one value of 19.
77 |
78 | Now let's look at the histograms of the `y` variables:
79 |
80 | ```{r yhistograms, eval = FALSE}
81 | # historgams of y-variables in 2x2 layout
82 | op = par(mfrow = c(2, 2))
83 | hist(anscombe$y1, col = 'gray80', las = 1)
84 | hist(anscombe$y2, col = 'gray80', las = 1)
85 | hist(anscombe$y3, col = 'gray80', las = 1)
86 | hist(anscombe$y4, col = 'gray80', las = 1)
87 | par(op)
88 | ```
89 |
90 | ```{r yhistograms, echo = FALSE, fig.height=6}
91 | ```
92 |
93 | ### Measures of Center and Spread
94 |
95 | To get various summary statistics, you can use the function `summary()`
96 |
97 | ```{r}
98 | # basic summary of x-variables
99 | summary(anscombe[ ,1:4])
100 |
101 | # SD+ of x-variables
102 | apply(anscombe[, 1:4], MARGIN = 2, FUN = sd)
103 | ```
104 |
105 | Again, note that the `summary()` output for `x1`, and `x2`, and `x3` is the same.
106 | As for the standard deviation ($SD^+$), all `x`-variables have identical values.
107 | To calculate all the standard deviations at once, we are using the function
108 | `apply()`. This function allows you to _apply_ a function, e.g. `sd()`, to the
109 | columns (`MARGIN = 2`) of the input data `anscombe[, 1:4]`.
110 |
111 | Now let's get the summary indicators and standard deviation for the `y` variables:
112 |
113 | ```{r}
114 | # basic summary of y-variables
115 | summary(anscombe[ ,5:8])
116 |
117 | # SD+ of y-variables
118 | apply(anscombe[, 5:8], MARGIN = 2, FUN = sd)
119 | ```
120 |
121 | Can you notice anything special? Here's a hint: look at the averages and SDs.
122 | All four `y` variables have pretty much the same averages and SDs. But they
123 | have different ranges, quartiles, and medians. And if you take a peek at their
124 | histograms, their distirbutions also have different shapes.
125 |
126 |
127 | ## Scatter Diagrams
128 |
129 | The real interest in the Anscombe data set has to do with studying the
130 | association between each pair of $x-y$ values. The best way to start exploring
131 | pairwise associations is by looking at the scatter diagrams of each
132 | pair of points. How would you describe the shapes and patterns in each plot?
133 |
134 | ```{r scatterplots, eval = FALSE}
135 | # scatter diagrams in 2x2 layout
136 | op = par(mfrow = c(2, 2), mar = c(4.5, 4, 1, 1))
137 | plot(anscombe$x1, anscombe$y1, pch = 20)
138 | plot(anscombe$x2, anscombe$y2, pch = 20)
139 | plot(anscombe$x3, anscombe$y3, pch = 20)
140 | plot(anscombe$x4, anscombe$y4, pch = 20)
141 | par(op)
142 | ```
143 |
144 | ```{r scatterplots, fig.height=5.5, echo = FALSE}
145 | ```
146 |
147 | - The first set `x1` and `y1` shows some degree of linear association. Although
148 | the dots do not lie on a line, we can say that they follow a linear pattern.
149 |
150 | - The second set clearly has a non-linear pattern; instead, the dots follow
151 | some type of curve (perhaps quadratic) or a polynomial of degree greater than 1.
152 |
153 | - The third set is almost perfectly linear except for the observation
154 | corresponding to $x = 13$ which falls outside the pattern of the rest of $y$
155 | values.
156 |
157 | - The fourth set is similar to the third one in the sense that there is one
158 | observation (an outlier?) that does not follow the pattern of the other values.
159 | Most dots follow a vertical line at $x=8$ except for the dot at $x=19$.
160 |
161 |
162 | ## Correlation Coefficient
163 |
164 | In addition to the visual inspection of the scatter diagrams, statisticians
165 | use a summary measure to quantify the degree of _linear association_ between
166 | two quantitative variables: the __coefficient of correlation__.
167 |
168 | One way to obtain the correlation coefficient of two variables $x$ and $y$ is
169 | as the average of the product of $x$ and $y$ in standard units.
170 |
171 | Let's consider `x1` and `y1` from the `anscobe` data set, and use R to "manually"
172 | calculate the correlation coefficient. This involves obtaining the average and
173 | the standard deviation $SD$, and then converting values to standard units:
174 |
175 | ```{r}
176 | # number of observations
177 | n = nrow(anscombe)
178 |
179 | # x1 in SU
180 | x1_avg = mean(anscombe$x1)
181 | x1_sd = sqrt((n-1)/n) * sd(anscombe$x1)
182 | x1su = (anscombe$x1 - x1_avg) / x1_sd
183 |
184 | # y1 in SU
185 | y1_avg = mean(anscombe$y1)
186 | y1_sd = sqrt((n-1)/n) * sd(anscombe$y1)
187 | y1su = (anscombe$y1 - y1_avg) / y1_sd
188 |
189 | # correlation: average of products
190 | mean(x1su * y1su)
191 | ```
192 |
193 | Here's some good news. You don't really need to "manually" calcualte the
194 | correlation coefficient. R actually has a function to compute the correlation
195 | of two variables: `cor()`
196 |
197 | ```{r}
198 | # correlation coefficient
199 | cor(anscombe$x1, anscombe$y1)
200 | ```
201 |
202 | Now let's get the correlation coefficients for all four pairs of variables:
203 |
204 | ```{r}
205 | cor(anscombe$x1, anscombe$y1)
206 | cor(anscombe$x2, anscombe$y2)
207 | cor(anscombe$x3, anscombe$y3)
208 | cor(anscombe$x4, anscombe$y4)
209 | ```
210 |
211 | Any surprises? As you can tell, all four pairs of $x,y$ variables have basically
212 | the same correlation of `r round(cor(anscombe$x1, anscombe$y1), 3)`.
213 | But not all of them have scatter diagrams in which the points clustered around
214 | a line.
215 |
216 | The take home message is that the correlation coefficient can be misleading in
217 | the presence of outliers or non-linear association.
218 |
219 |
220 | ## Properties of the Correlation Coefficient
221 |
222 | One of the properties of the correlation coefficient is that it is a symmetric
223 | measure. By this we mean that the order of the variables is not important.
224 | You can interchange between $x$ and $y$, and the correlation between them
225 | is unchanged:
226 |
227 | $$
228 | cor(x,y) = cor(y,x)
229 | $$
230 |
231 | To illustrate this property, let's create two variables:
232 |
233 | ```{r}
234 | # two variables
235 | x = c(1, 3, 4, 5, 7, 6)
236 | y = c(5, 9, 7, 8, 9, 10)
237 | ```
238 |
239 | ```{r scatterdiags, eval = FALSE}
240 | op = par(mfrow = c(1,2))
241 | plot(x, y, pch = 20, col = "blue", las = 1, cex = 1.5)
242 | plot(y, x, pch = 20, col = "blue", las = 1, cex = 1.5)
243 | par(op)
244 | ```
245 |
246 | ```{r scatterdiags, out.width='80%', fig.align='center', fig.width = 8, fig.height=4}
247 | ```
248 |
249 | The scatter diagram changes depending on what variable is on each axis.
250 | However, the correlation coefficient in both cases is the same:
251 |
252 | ```{r}
253 | # symmetric
254 | cor(x, y)
255 | cor(y, x)
256 | ```
257 |
258 |
259 | ### Change of Scale
260 |
261 | The other properties of the correlation coefficient have to do with what the
262 | FPP book calls _change of scale_. To be more precise, the considered change
263 | of scales involve __linear__ change of scales (i.e. linear transformation).
264 | Typical operations that result in a linear change of scale are:
265 |
266 | - Adding a scalar: $x + 3, y$
267 | - Multiplying times a positive scalar: $2x, y$
268 | - Multiplying times a negative scalar: $-2x, y$
269 | - Adding and multiplying: $2x + 3, y$
270 |
271 | ```{r change-scale, eval = FALSE}
272 | # scatter diagrams in 2x2 layout
273 | op = par(mfrow = c(2, 2), mar = c(4.5, 4, 1, 1))
274 | plot(x + 3, y, pch = 20, col = "orange", las = 1, cex = 1.5)
275 | plot(2 * x, y, pch = 20, col = "green3", las = 1, cex = 1.5)
276 | plot((-2) * x, y, pch = 20, col = "violet", las = 1, cex = 1.5)
277 | plot(2 * x + 3, y, pch = 20, col = "red", las = 1, cex = 1.5)
278 | par(op)
279 | ```
280 |
281 | ```{r change-scale, echo = FALSE, fig.height=6}
282 | ```
283 |
284 | ```{r correlations, eval = FALSE}
285 | cor(x, y)
286 | cor(x + 3, y)
287 | cor(2 * x, y)
288 | cor(-2 * x, y)
289 | cor(2 * x + 3, y)
290 | ```
291 |
292 | ```{r correlations, echo = FALSE}
293 | ```
294 |
295 | Wat can you conclude from the change of scales? In which case the correlation
296 | coefficient is affected by such changes?
297 |
--------------------------------------------------------------------------------
/scripts/08-correlation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/08-correlation.pdf
--------------------------------------------------------------------------------
/scripts/09-regression-line.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/09-regression-line.pdf
--------------------------------------------------------------------------------
/scripts/10-prediction-and-errors-in-regression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Predictions and Errors in Regression"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | fontsize: 11pt
7 | urlcolor: blue
8 | ---
9 |
10 | > ### Learning Objectives
11 | >
12 | > - Calculating predicted values with the regression method
13 | > - Looking at the regression residuals
14 | > - Calculating r.m.s. error for regression
15 |
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = TRUE)
19 | ```
20 |
21 |
22 | ## Introduction
23 |
24 | In the previous script, you learned about the function `lm()` to obtain a simple lienar regression model. Specifically, we looked at the regression `coefficients`: the intercept and the slope. You also learned how to plot a scatter diagram with the regression line, via the `abline()` function, as well as how to "manually" calculate the intercept and slope with the formulas:
25 |
26 | $$
27 | slope = r \times \frac{SD_y}{SD_x}
28 | $$
29 |
30 | In turn, Chapter 12 presents the formula of the intercept as:
31 |
32 | $$
33 | intercept = avg_y - slope \times avg_x
34 | $$
35 |
36 |
37 | ## Regression with Height Data Set
38 |
39 | To cotinue our discussion, we'll keep using the data set in the file csv file `pearson.csv` (in the github repository):
40 |
41 | ```{r}
42 | # assembling the URL of the CSV file
43 | # (otherwise it won't fit within the margins of this document)
44 | repo = 'https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/'
45 | datafile = 'master/data/pearson.csv'
46 | url = paste0(repo, datafile)
47 |
48 | # read in data set
49 | dat = read.csv(url)
50 | ```
51 |
52 | The data frame `dat` contains `r nrow(dat)` rows, and `r ncol(dat)` columns:
53 |
54 | - `Father`: height of the father (in inches)
55 | - `Son`: height of the son (in inches)
56 |
57 | Here's a reminder on how to use the function `lm()` to regress `Son` on `Father`:
58 |
59 | ```{r}
60 | # run regression analysis
61 | reg = lm(Son ~ Father, data = dat)
62 | reg
63 | ```
64 |
65 | You can compare the coefficients given by `lm()` with your own calculated
66 | $b_1$ and $b_0$ according to the previous formulas. First let's get the main
67 | ingredients:
68 |
69 | ```{r}
70 | # number of values (to be used for correcting SD+)
71 | n = nrow(dat)
72 |
73 | # averages
74 | avg_x = mean(dat$Father)
75 | avg_y = mean(dat$Son)
76 |
77 | # SD (corrected SD+)
78 | sd_x = sqrt((n-1)/n) * sd(dat$Father)
79 | sd_y = sqrt((n-1)/n) * sd(dat$Son)
80 |
81 | # correlation coefficient
82 | r = cor(dat$Father, dat$Son)
83 | ```
84 |
85 | Now let's compute the slope and intercept, and compare them with
86 | `reg$coefficients`
87 |
88 | ```{r}
89 | # slope
90 | b1 = r * (sd_y / sd_x)
91 | b1
92 |
93 | # intercept
94 | b0 = avg_y - (b1 * avg_x)
95 | b0
96 |
97 | # compared with coeffs
98 | reg$coefficients
99 | ```
100 |
101 |
102 | ## Predicting Values
103 |
104 | As I mentioned in the last tutorial, regression tools are
105 | mainly used for prediction purposes. This means that we can use the estimated
106 | regression line $\mathtt{Son} \approx b_0 + b_1 \mathtt{Father}$, to predict
107 | the height of Son given a particular Father's height.
108 |
109 | For example, if a father has a height of 71 inches, what is the predicted
110 | son's height?
111 |
112 | __Option a)__ One way to answer this question is with the regression method described in chapter 10 of FPP. The first step consists of converting $x$ in standard units, then multiplying times $r$ to get the predicted $\hat{y}$ in standard units, and finally rescaling the predicted value to the original units.
113 |
114 | ```{r}
115 | # height of father in standard units
116 | height = 71
117 | height_su = (height - avg_x) / sd_x
118 | height_su
119 | ```
120 |
121 | ```{r}
122 | # predicted Son's height in standard units
123 | prediction_su = r * height_su
124 | prediction_su
125 | ```
126 |
127 | ```{r}
128 | # rescaled to original units
129 | prediction = prediction_su * sd_y + avg_y
130 | prediction
131 | ```
132 |
133 |
134 | __Option b)__ Another way to find the predicted son's height when the height of the father is 71 is by using the equation of the regression line:
135 |
136 | ```{r}
137 | # predict height of son with a 71 in. tall father
138 | b0 + b1 * 71
139 | ```
140 |
141 | __Option c)__ A third option is with the `predict()` function. The first
142 | argument must be an `"lm"` object; the second argument must be a data frame
143 | containing the values for `Fater`:
144 |
145 | ```{r}
146 | # new data (must be a data frame)
147 | newdata = data.frame(Father = 71)
148 |
149 | # predict son's height
150 | predict(reg, newdata)
151 | ```
152 |
153 | If you want to know the predicted values based on several `Father`'s heights,
154 | then do something like this:
155 |
156 | ```{r}
157 | more_data = data.frame(Father = c(65, 66.7, 67, 68.5, 70.5, 71.3))
158 |
159 | predict(reg, more_data)
160 | ```
161 |
162 |
163 | ## R.M.S. Error for Regression
164 |
165 | The predictions given by the regression line will tend to be off. There is
166 | usually some difference between the observed values $y$ and the predicted
167 | values $\hat{y}$. This difference is called __residual__. The residuals are
168 | part of the `"lm"` object `reg`.
169 | You can take a peek at such residuals with `head()`
170 |
171 | ```{r}
172 | # first six residuals
173 | head(reg$residuals)
174 | ```
175 |
176 | By how much the predicted values will be off?
177 | To find the answer, you need to calculate the _Root Mean Square_ (RMS) error
178 | for regression. In other words, you need to take the residuals
179 | (i.e. difference between actual values and predicted values), and get the
180 | square root of the average of their squares.
181 |
182 | ```{r}
183 | # r.m.s. error for regression
184 | rms = sqrt(mean(reg$residuals^2))
185 | rms
186 | ```
187 |
188 | The r.m.s. value tells you the typical size of the residuals. This means that
189 | the typical predicted heights of sons will be off by about `r round(rms, 2)`
190 | inches.
191 |
192 |
193 | ## Are residuals homoscedastic?
194 |
195 | As you know, the main assumption in a simple regression analysis is that $X$
196 | and $Y$ are approximately linearly related. This means that we can
197 | use a line as a good summary for the cloud of points. For a line to able to do
198 | a good summarizing job, the amount of spread around the regression line should
199 | be fairly the same (i.e. constant). This requirement has a very
200 | specific---and rather ugly---name: __homoscedasticity__; which simply means
201 | "same scatter". Visually, homoscedascity comes in the form of the so-called
202 | football-shaped cloud of points. Or in a more geometric sense, cloud of points
203 | with a chiefly elliptical shape.
204 |
205 | The `"lm"` object `reg` contains the vector of redisuals (see `reg$residuals`).
206 | The residuals from the regression line must average out to 0. To confirm this,
207 | let's get their average:
208 |
209 | ```{r}
210 | mean(reg$residuals)
211 | ```
212 |
213 | You can take a look at the _residual plot_ by running this command:
214 |
215 | ```{r out.width='60%', fig.align='center', fig.width=6, fig.height=5}
216 | # residuals plot
217 | plot(reg, which = 1)
218 | ```
219 |
220 | which is equivalent to this other command:
221 |
222 | ```{r eval = FALSE}
223 | # equivalently
224 | plot(reg$fitted.values, reg$residuals)
225 | ```
226 |
227 | This residual plot is not exactly the same that the book describes (pages 187-188).
228 | To plot the residuals like the book does, you would need to use the `Father`
229 | variable in the x-axis:
230 |
231 | ```{r out.width='60%', fig.align='center', fig.width=6, fig.height=5}
232 | # residuals plot (as in FPP)
233 | plot(dat$Father, reg$residuals)
234 | abline(h = 0, lty = 2) # horizontal dashed line
235 | ```
236 |
237 | The difference is only in the scale of the horizontal axis. But the important
238 | part in both plots is the shape of the cloud.
239 | As you look across the residual plot, there is no systematic tendency for the
240 | points to drift up or down. The red line displayed by `plot(reg, which = 1)`,
241 | is a regression line for the residuals. When residuals are homoscedastic, this
242 | line is basically a horizontal line. This is what you want to see when
243 | inspecting the residual plot. Why? Because it supports the appropriate use of
244 | the regression line.
245 |
246 |
247 | ## Summary output
248 |
249 | `reg` is an object of class `"lm"`---linear model. For this type of R object,
250 | you can use the `summary()` function to get additional information and diagnostics:
251 |
252 | ```{r}
253 | # summarized linear model
254 | sum_reg = summary(reg)
255 | sum_reg
256 | ```
257 |
258 | The information displayed by `summary()` is the typical output that most
259 | statistical programs provide about a simple linear regression model. There
260 | are four major parts:
261 |
262 | - `Call`: the command used when invoking `lm()`.
263 | - `Residuals`: summary indicators of the residuals.
264 | - `Coefficients`: table of regression coefficients.
265 | - Additional statistics: more diagnostics toosl.
266 |
267 | In the same way that `lm()` produces `"lm"` objects, `summary()` of `"lm"`
268 | objects produce `"summary.lm"` objects. This type of objects also contain
269 | more information than what is displayed by default. To see the list of all the
270 | components in `sum_reg`, you can use again the function `names()`:
271 |
272 | ```{r}
273 | names(sum_reg)
274 | ```
275 |
276 |
--------------------------------------------------------------------------------
/scripts/10-prediction-and-errors-in-regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/10-prediction-and-errors-in-regression.pdf
--------------------------------------------------------------------------------
/scripts/11-binomial-formula.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/11-binomial-formula.pdf
--------------------------------------------------------------------------------
/scripts/12-chance-process.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Chance Processes and Variability"
3 | subtitle: "Intro to Stats, Spring 2017"
4 | author: "Prof. Gaston Sanchez"
5 | output: html_document
6 | fontsize: 11pt
7 | urlcolor: blue
8 | ---
9 |
10 | > ### Learning Objectives
11 | >
12 | > - How to use R to simulate chance processes
13 | > - Getting to know the function `sample()`
14 | > - Simulate flipping a coin
15 | > - Simulate rolling a die
16 | > - Simulate drawing tickets from a box
17 |
18 |
19 | ```{r setup, include=FALSE}
20 | knitr::opts_chunk$set(echo = TRUE)
21 | ```
22 |
23 | ## Introduction
24 |
25 | In this tutorial we will see how to use R to simulate basic chance processes
26 | like tossing a coin, rolling a die, or drawing tickets from a box. The aim is
27 | to give you some tools that allow you to better understanding and visualize
28 | fundamental concepts such as the law of large numbers, the law of averages,
29 | and the central limit theorem.
30 |
31 |
32 | ## Coins, Dice, and Boxes with Tickets
33 |
34 | Chance processes, also referred to as chance experiments, have to do with
35 | actions in which the resulting outcome turns out to be different in each
36 | occurrence.
37 |
38 | Typical examples of basic chance processes are tossing one or more coins,
39 | rolling one or more dice, selecting one or more cards from a deck of cards,
40 | and in general, things that can be framed in terms of drawing tickets out of
41 | a box (or any other type of container: bag, urn, etc.).
42 |
43 | You can use your computer, and R in particular, to simulate chances processes.
44 | In order to do that, the first step consists of learning how to create a
45 | virtual coin, or die, or box with tickets.
46 |
47 |
48 | ### Creating a coin
49 |
50 | The simplest way to create a coin with two sides, `"heads"` and `"tails"`, is
51 | with an R vector via the _combine_ function `c()`
52 |
53 | ```{r}
54 | coin = c("heads", "tails")
55 | ```
56 |
57 | You can also create a _numeric_ coin that shows `0` and `1` instead of
58 | `"heads"` and `"tails"`:
59 |
60 | ```{r}
61 | coin = c(0, 1)
62 | ```
63 |
64 |
65 | ### Creating a die
66 |
67 | What about simulating a die in R? Pretty much the same way you create a coin:
68 | simply define a vector with numbers representing the number of spots in a die.
69 |
70 | ```{r}
71 | die = c(1, 2, 3, 4, 5, 6)
72 |
73 | # equivalent
74 | die = 1:6
75 | ```
76 |
77 |
78 | ### Creating a box with tickets
79 |
80 | Likewise, you can create a general box with tickets. For instance, say you have
81 | a box with tickets labeled 1, 2, 3 and 4; this can be implemented in R as:
82 |
83 | ```{r}
84 | tickets = c(1, 2, 3, 4)
85 | ```
86 |
87 |
88 |
89 | ## Drawing tickets with `sample()`
90 |
91 | Once you have an object that represents the _box with tickets_, the next step
92 | involves learning how to draw tickets from the box. One way to simulate drawing
93 | tickets from a box in R is with the function `sample()` which lets you draw
94 | random samples, with or without replacement, from an input vector.
95 |
96 | For example, consider a "box" with tickets 1, 2, 3.
97 | To draw one ticket, use `sample()` like this:
98 |
99 | ```{r}
100 | # box with tickets
101 | tickets = c(1, 2, 3)
102 |
103 | # draw one ticket
104 | sample(tickets, size = 1)
105 | ```
106 |
107 | By default, `sample()` draws each ticket with the same probability. In other
108 | words, ecah ticket is assigned the same probability of being chosen. Another
109 | default behavior of `sample()` is to take a sample of the specified `size`
110 | without replacement. If `size = 1`, it does not really matter whether the sample
111 | is done with or without replacement.
112 |
113 | To draw two tickets WITHOUT replacement, use `sample()` like this:
114 |
115 | ```{r}
116 | # draw 2 tickets without replacement
117 | sample(tickets, size = 2)
118 | ```
119 |
120 | To draw two tickets WITH replacement, use `sample()` and specify its argument
121 | `replace = TRUE`, like this:
122 |
123 | ```{r}
124 | # draw 2 tickets with replacement
125 | sample(tickets, size = 2, replace = TRUE)
126 | ```
127 |
128 | The way `sample()` works is by taking a random sample from the input vector.
129 | This means that every time you invoke `sample()` you will likely get a different
130 | output.
131 |
132 | In order to make the examples replicable (so you can get the same output as me),
133 | you need to specify what is called a __random seed__. This is done with the
134 | function `set.seed()`. By setting a _seed_, every time
135 | you use one of the random generator functions, like `sample()`, you will get
136 | the same values.
137 |
138 | ```{r}
139 | # set random seed
140 | set.seed(1257)
141 |
142 | # draw 4 tickets with replacement
143 | sample(tickets, size = 4, replace = TRUE)
144 | ```
145 |
146 | Try the code above. You should get the exact same sample.
147 |
148 | Last but not least, `sample()` comes with the argument `prob` which allows you
149 | to provide specific probabilities for each element in the input vector.
150 |
151 | By default, `prob = NULL`, which means that every element has the same
152 | probability of being drawn. In the example of tossing a coin, the command
153 | `sample(coin)` is equivalent to `sample(coin, prob = c(0.5, 0.5))`. In the
154 | latter case we explicitly specify a probability of 50% chance of heads, and
155 | 50% chance of tails:
156 |
157 | ```{r echo = FALSE}
158 | # tossing a fair coin
159 | coin = c("heads", "tails")
160 |
161 | sample(coin)
162 | sample(coin, prob = c(0.5, 0.5))
163 | ```
164 |
165 | However, you can provide different probabilities for each of the elements in
166 | the input vector. For instance, to simulate a __loaded__ coin with chance of
167 | heads 20%, and chance of tails 80%, set `prob = c(0.2, 0.8)` like so:
168 |
169 | ```{r}
170 | # tossing a loaded coin (20% heads, 80% tails)
171 | sample(coin, size = 5, replace = TRUE, prob = c(0.2, 0.8))
172 | ```
173 |
174 |
175 | -----
176 |
177 |
178 | ## Simulating tossing a coin
179 |
180 | Now that we've talked about `sample()`, let's use R to implement code that
181 | simulates tossing a fair coin one or more times.
182 |
183 | __Recap.__ To toss a coin using R, we first need an object that plays the role
184 | of a coin. A simple way to create a `coin` is using a vector with two elements:
185 | `"heads"` and `"tails"`. Then, to simulate tossing a coin one or more times,
186 | we use the `sample()` function.
187 | Here's how to simulate a coin toss using `sample()` to take a random sample of
188 | size 1 from `coin`:
189 |
190 | ```{r coin-vector}
191 | # coin object
192 | coin <- c("heads", "tails")
193 |
194 | # one toss
195 | sample(coin, size = 1)
196 | ```
197 |
198 | To simulate multiple tosses, just change the `size` argument, and specify
199 | sampling with replacement (`replace = TRUE`):
200 |
201 | ```{r various-tosses}
202 | # 3 tosses
203 | sample(coin, size = 3, replace = TRUE)
204 |
205 | # 6 tosses
206 | sample(coin, size = 6, replace = TRUE)
207 | ```
208 |
209 |
210 | ### Coin Simulations
211 |
212 | Now that we have all the elements to toss a coin with R, let's simulate flipping
213 | a coin 100 times, and use the function `table()` to count the resulting number
214 | of `"heads"` and `"tails"`:
215 |
216 | ```{r}
217 | # number of flips
218 | num_flips = 100
219 |
220 | # flips simulation
221 | coin = c('heads', 'tails')
222 | flips = sample(coin, size = num_flips, replace = TRUE)
223 |
224 | # number of heads and tails
225 | freqs = table(flips)
226 | freqs
227 | ```
228 |
229 | In my case, I got `r freqs[1]` heads and `r freqs[2]` tails. Your results will
230 | probably be different than mine. Some of you will get more `"heads"`, some of
231 | you will get more `"tails"`, and some will get exactly 50 `"heads"` and 50
232 | `"tails"`.
233 |
234 | Run another series of 100 flips, and find the frequency of `"heads"` and `"tails"`:
235 |
236 | ```{r}
237 | flips = sample(coin, size = num_flips, replace = TRUE)
238 | freqs = table(flips)
239 | freqs
240 | ```
241 |
242 | Let's make things a little bit more complex but also more interesting. The idea
243 | is to repeat 100 flips 1000 times. To carry out this simulation, we are going
244 | to use a programming structure called a `for` loop. This is one way to tell
245 | the computer to repeat the same action a given number of times.
246 | Don't worry about this. Just execute the following lines of code:
247 |
248 | ```{r}
249 | # total number of repetitions
250 | times = 1000
251 |
252 | # "empty" vectors to store number of heads and tails in each repetition
253 | heads = c(0, times)
254 | tails = c(0, times)
255 |
256 | # 100 flips of a coin, repeated 1000 times
257 | for (i in 1:times) {
258 | flips = sample(coin, size = 100, replace = TRUE)
259 | freqs = table(flips)
260 | heads[i] = freqs[1]
261 | tails[i] = freqs[2]
262 | }
263 | ```
264 |
265 | What the code above is doing is simulating 100 flips of a coin, not once,
266 | not twice, but 1000 times. In each repetition, we count how many `"heads"`
267 | and how many `"tails"`, and store those counts in the vectors `heads` and
268 | `tails`, respectively.
269 |
270 | Each vector, `heads` and `tails`, contains 1000 values. Moreover, we can get
271 | a histogram to see the empirical relative frequency:
272 |
273 | ```{r fig.align='center', out.width='75%', fig.height=4.5}
274 | barplot(table(heads)/1000, las = 1, cex.names = 0.5, border = NA,
275 | main = "Frequency of number of heads in 100 flips")
276 | ```
277 |
278 | ```{r fig.align='center', out.width='75%', fig.height=4.5}
279 | barplot(table(tails)/1000, las = 1, cex.names = 0.5, border = NA,
280 | main = "Frequency of number of tails in 100 flips")
281 | ```
282 |
283 |
284 |
285 | ## Frequencies
286 |
287 | Typical probability problems that have to do with coin tossing, require
288 | to compute the total proportion of `"heads"` and `"tails"`:
289 |
290 | ```{r five-tosses}
291 | # five tosses
292 | five <- sample(coin, size = 5, replace = TRUE)
293 |
294 | # proportion of heads
295 | sum(five == "heads") / 5
296 |
297 | # proportion of tails
298 | sum(five == "tails") / 5
299 | ```
300 |
301 | It is also customary to compute the relative frequencies of `"heads"` and
302 | `"tails"` in a series of tosses:
303 |
304 | ```{r relative-freqs}
305 | # relative frequencies of heads
306 | cumsum(five == "heads") / 1:length(five)
307 |
308 | # relative frequencies of tails
309 | cumsum(five == "tails") / 1:length(five)
310 | ```
311 |
312 | Likewise, it is common to look at how the relative frequencies of heads or
313 | tails change over a series of tosses:
314 |
315 | ```{r plot-freqs}
316 | set.seed(5938)
317 | hundreds <- sample(coin, size = 500, replace = TRUE)
318 | head_freqs = cumsum(hundreds == "heads") / 1:500
319 |
320 | plot(1:500, head_freqs, type = "l", ylim = c(0, 1), las = 1,
321 | col = "#3989f8", lwd = 2,
322 | xlab = 'number of tosses',
323 | ylab = 'frequency of heads')
324 | # reference line at 0.5
325 | abline(h = 0.5, col = 'gray50', lwd = 1.5, lty = 2)
326 | ```
327 |
328 | So far we have written code in R that simulates tossing a coin one or more
329 | times. We have included commands to compute proportion of heads and tails,
330 | as well the relative frequencies of heads (or tails) in a series of tosses.
331 | In addition, we have produced a plot of the relative frequencies and see
332 | how, as the number of tosses increases, the frequency of heads (and tails)
333 | approach 0.5.
334 |
335 |
336 | -----
337 |
338 | ## Simulating rolling a die
339 |
340 | Now that you know how to simulate flipping a coin one or more times, you can
341 | do the same to simulate rolling a die:
342 |
343 | ```{r}
344 | die = 1:6
345 |
346 | # rolling a die once
347 | sample(die, size = 1)
348 |
349 | # rolling a pair of dice
350 | sample(die, size = 2, replace = TRUE)
351 |
352 | # rolling a die 5 times
353 | sample(die, size = 5, replace = TRUE)
354 | ```
355 |
356 |
357 |
--------------------------------------------------------------------------------
/scripts/12-chance-process.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/12-chance-process.pdf
--------------------------------------------------------------------------------
/scripts/Makefile:
--------------------------------------------------------------------------------
1 | # input files
2 | RMDS = $(wildcard *.Rmd)
3 |
4 | # output files
5 | PDFS = $(patsubst %.Rmd, %.pdf, $(RMDS))
6 | HTMLS = $(patsubst %.Rmd, %.html, $(RMDS))
7 |
8 |
9 | .PHONY: all htmls clean
10 |
11 |
12 | all: $(PDFS)
13 |
14 |
15 | htmls: $(HTMLS)
16 |
17 |
18 | %.pdf: %.Rmd
19 | Rscript -e "library(rmarkdown); render('$<', output_format = 'pdf_document')"
20 |
21 |
22 | %.html: %.Rmd
23 | Rscript -e "library(rmarkdown); render('$<', output_format = 'html_document')"
24 |
25 |
26 | clean:
27 | rm -rf *.pdf *.html
28 |
--------------------------------------------------------------------------------
/scripts/README.md:
--------------------------------------------------------------------------------
1 | # Intro Stats Scripts
2 |
3 | This folder contains the Rmd scripts used in lecture as well as out-of-class for the introductory courses to Probability and Statistics at UC Berkeley.
4 |
5 |
6 |
--------------------------------------------------------------------------------
/scripts/images/karl-pearson.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/images/karl-pearson.jpg
--------------------------------------------------------------------------------
/scripts/images/western-conference-standings-2016.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/scripts/images/western-conference-standings-2016.png
--------------------------------------------------------------------------------
/syllabus/README.md:
--------------------------------------------------------------------------------
1 | ## Syllabus
2 |
3 | > - [Stat 20](syllabus-stat20.md)
4 | > - [Stat 131A](syllabus-stat131A.md)
5 |
6 | 
--------------------------------------------------------------------------------
/syllabus/mrs-mutner-rules.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ucb-introstat/introstat-spring-2017/e86aa87a0808f1752944037a166d01ab9cab525e/syllabus/mrs-mutner-rules.jpg
--------------------------------------------------------------------------------
/syllabus/syllabus-stat131A.md:
--------------------------------------------------------------------------------
1 | ## Course Syllabus Stat 131A
2 |
3 | Stat 131A: Introduction to Probability and Statistics for Life Scientists, Spring 2017
4 |
5 | - __Instructor:__ Gaston Sanchez, gaston.stat[at]gmail.com
6 | - __Class Time:__ MWF 2-3pm in 50 Birge
7 | - __Session Dates:__ 01/18/17 - 05/05/17
8 | - __Code #:__ 23461
9 | - __Units:__ 4 (more info [here](http://classes.berkeley.edu/content/2017-spring-stat-131A-001-lec-001))
10 | - __Office Hours:__ TuTh 11:30am-12:30pm in 309 Evans (or by appointment)
11 | - __Text:__ Statistics, 4th edition (by Freedman, Pisani, and Purves)
12 | - __Final:__ Tue, May-09, 11:30am-2:30pm
13 | - __GSIs:__ Office hours of the GSIs will be posted on the bCourses page. You can go to the office hours of __any__ GSI, not just your own.
14 |
15 | | Discussion | Date | Room | GSI |
16 | |------------|-----------|------------------|--------------|
17 | | 101 | MW 3-4pm | 250 Sutardja Dai | Shuhui Huang |
18 | | 102 | MW 3-4pm | B51 Hildebrand | Andy Mao |
19 | | 103 | MW 4-5pm | 9 Evans | Shuhui Huang |
20 | | 104 | MW 5-6pm | 70 Evans | Andy Mao |
21 |
22 |
23 | ### Description
24 |
25 | __Statistics 131A__ is a course designed primarily as an introductory course for statistical thinking. You do need to be comfortable with math at the level of intermediate algebra (e.g., the equation of a straight line, plotting points, taking powers and roots, percentages).
26 |
27 | The emphasis of the course is critical thinking about quantitative evidence. Topics include reasoning and fallacies, descriptive statistics, association, correlation, regression, elements of probability, set theory, chance variability, random variables, expectation, standard error, sampling, hypothesis tests, confidence intervals, experiments and observational studies.
28 |
29 |
30 | ### Methods of Instruction
31 |
32 | Using a combination of lecture and student participation, each class session will focus on learning the fundamentals. The required textbook is the classic __Statistics__ 4th edition by Freedman, Pisani and Purves.
33 |
34 | I firmly believe that one cannot do statistical computations without the help of good statistical software. In this course, you will be asked to do various assignments and practical work using the [statistical software R](https://www.r-project.org/). The main idea is to use R as a supporting tool to help you apply the key concepts of the textbook.
35 |
36 |
37 | ### Homework Assignments
38 |
39 | - Homework assignments will be assigned almost every week (about 13 HW).
40 | - You will submit your homework via bCourses electronically (as a word, text, pdf, or html file).
41 | - I will drop your lowest HW score.
42 | - Don't wait until the last hour to do an assignment. Plan ahead and pace yourself.
43 | - Don't wait until the last minute to submit your assignment.
44 | - __No late assignments__ will be accepted, for any reason, including, but not limited to, theft or any extraordinary circumnstances (e.g. illnes, exhaustion, mourning, loss of internet connection, bCourses is down, broken computer).
45 | - Note that answers to non-review questions are in the back of the book so you can check that your answer is correct.
46 | - Solutions to the review exercises will be posted on bCourses.
47 | - If you collaborate with other students when working on a HW assignment, please include the names of those students in your submission.
48 | - You must write your own answers (using your own words). Copy and plagiarism will not be tolerated (see _Academic Honesty_ policy).
49 |
50 |
51 | ### Discussion
52 |
53 | - Discussion is an important part of the class and is meant to supplement lecture.
54 | - Your GSI will review and expand on concepts introduced in lecture and encourage you to problem solve in groups.
55 | - There will be about 4/5 short quizzes given in discussion to test your understanding.
56 | - Your quiz scores __will NOT__ be part of your grade.
57 | - Students must attend the discussion group they are officially registered in.
58 |
59 |
60 | ### Exams
61 |
62 | - There will be two 50-minute in-class midterms, and one 3-hour final exam.
63 | - The tentative dates of the midterms are Friday Feb-24, and Friday Apr-07.
64 | - The final exam is currently scheduled for Tuesday, May 9th from 11:30am-2:30pm. (classroom to be announced).
65 | - If you do not take the final, you will NOT pass the class.
66 | - There will be __no early or makeup exams__.
67 | - ~~To ask for regrading, you must answer a test using pen. Tests answered with pencil will not be accepted for regrading.~~
68 | - We will use _gradescrope_ to grade your tests (so you can use pen or pencil).
69 | - When asking for regrading, please clearly state the reasons that make you think you deserve a higher score.
70 | - You will have one full week after grades are published on gradescope to request a regrade.
71 | - After the regrade deadline, no requests will be considered.
72 |
73 |
74 | ### Grading Structure
75 |
76 | - 20% homework (lowest 1 dropped)
77 | - 25% midterm 1
78 | - 25% midterm 2
79 | - 30% final
80 |
81 | No individual letter grades will be given for midterm, or final. You will get a letter grade for the course that is based on your overall score. Your final grade will be graded on a 30/30/30/10 (A/B/C/DF) scale.
82 |
83 |
84 | ### Calculator Policy
85 |
86 | - You will need one that adds, subtracts, multiplies, divides, takes square roots, raises numbers to a power, and preferably also computes factorials. Statistical calculators are unnecessary.
87 | - However, no graphing calculators, phone calculators or tablet calculators are allowed.
88 | - If you do not bring a calculator to a midterm or the final, do your computations by hand (you won't be allowed to borrow someone else's calculator).
89 |
90 |
91 | ### Academic Honesty
92 |
93 | I (Gaston Sanchez) expect you to do your own work and to uphold the standards of intellectual integrity. Collaborating on homework is fine and I encourage you to work together---but copying is not, nor is having somebody else submit assignments for you. Cheating will not be tolerated. Anyone found cheating will receive an F and will be reported to the [Center for the Student Conduct](http://sa.berkeley.edu/conduct). If you are having trouble with an assignment or studying for an exam, or if you are uncertain about permissible and impermissible conduct or collaboration, please come see me with your questions.
94 |
95 |
96 | ### Email Policy
97 |
98 | - You should try to use email as a tool to set up a one-on-one meeting with me if office hours conflict with your schedule.
99 | - Use the subject line __Stat 131 Meeting Request__.
100 | - Your message should include at least two times when you would like to meet and a brief (one-two sentence) description of the reason for the meeting.
101 | - Do NOT expect me to reply right away (I may not reply on time).
102 | - If you have an emergency, talk to me later during class or office hours.
103 | - I strongly encourage you to ask questions about the syllabus, covered material, and assignments during class time or lab discussions.
104 | - I prefer to have conversations in person rather than via email, thus allowing us to get to know each other better and fostering a more collegial learning atmosphere.
105 |
106 |
107 | ### Accommodation Policy
108 |
109 | Students needing accommodations for any physical, psychological, or learning disability, should speak with me during the first two weeks of the semester, either after class or during office hours and see [http://dsp.berkeley.edu](http://dsp.berkeley.edu) to learn about Berkeley’s policy. If you are a DSP student, please contact me at least three weeks prior to a midterm or final so that we can work out acceptable accommodations via the DSP Office.
110 |
111 | If you are an athlete or Cal band member, please check your calendar and come see me as soon as possible to OH during the first two weeks of the semester. Please try your best to be present at each of the midterms as I cannot guarantee accommodation for a late exam.
112 |
113 |
114 | ### Safe, Supportive, and Inclusive Environment
115 |
116 | Whenever a faculty member, staff member, post-doc, or GSI is responsible for
117 | the supervision of a student, a personal relationship between them of a
118 | romantic or sexual nature, even if consensual, is against university policy.
119 | Any such relationship jeopardizes the integrity of the educational process.
120 |
121 | Although faculty and staff can act as excellent resources for students, you
122 | should be aware that they are required to report any violations of this campus
123 | policy. If you wish to have a confidential discussion on matters related to this
124 | policy, you may contact the _Confidential Care Advocates_ on campus for support
125 | related to counseling or sensitive issues. Appointments can be
126 | made by calling (510) 642-1988.
127 |
128 | The classroom, lab, and work place should be safe and inclusive environments
129 | for everyone. The _Office for the Prevention of Harassment and Discrimination_
130 | (OPHD) is responsible for ensuring the University provides an environment for
131 | faculty, staff and students that is free from discrimination and harassment on
132 | the basis of categories including race, color, national origin, age, sex, gender,
133 | gender identity, and sexual orientation. Questions or concerns?
134 | Call (510) 643-7985, email ask_ophd@berkeley.edu, or go to
135 | [http://survivorsupport.berkeley.edu/](http://survivorsupport.berkeley.edu/).
136 |
137 |
138 | ### Incomplete Policy
139 |
140 | Under emergency/special circumstances, students may petition me to receive an Incomplete grade. By University policy, for a student to get an Incomplete requires (i) that the student was performing passing-level work until the time that (ii) something happened that---through no fault of the student---prevented the student from completing the coursework. If you take the final, you completed the course, even if you took it while ill, exhausted, mourning, etc. The time to talk to me about incomplete grades is BEFORE you take the final, when the situation that prevents you from finishing presents itself. Please clearly state your reasoning in your comments to me.
141 |
142 | It is your responsibility to develop good time management skills, good studying habits, know your limits, and learn to ask for professional help.
143 | Life happens. Social, family, cultural, scholar, and individual circumstances can affect your performance (both positive and negatively). If you find yourself in a situation that raises concerns about passing the course, please come see me as soon as possible. Above all, please do not wait till the end of the semester to share your concerns about passing the course because it will be too late by then.
144 |
145 |
146 | ### Letters of Recommendation
147 |
148 | Unless I have known you at least one year, and we have developed a good collegial relationship, I do not provide letters of recommendation.
149 |
150 |
151 | ### Additional Course Policies
152 |
153 | - Be sure to pay attention to deadlines.
154 | - In consideration to everybody in the classroom, please turn off your cell phone during class and lab time.
155 |
156 |
157 |
158 | ### Fine Print
159 |
160 | The course deadlines, assignments, exam times and material are subject to change at the whim of the professor.
161 |
162 |
--------------------------------------------------------------------------------
/syllabus/syllabus-stat20.md:
--------------------------------------------------------------------------------
1 | ## Course Syllabus Stat 20
2 |
3 | Stat 20: Introduction to Probability and Statistics, Spring 2017
4 |
5 | - __Instructor:__ Gaston Sanchez, gaston.stat[at]gmail.com
6 | - __Class Time:__ MWF 12-1pm in 2050 VLSB
7 | - __Session Dates:__ 01/18/17 - 05/05/17
8 | - __Code #:__ 23407
9 | - __Units:__ 4 (more info [here](http://classes.berkeley.edu/content/2017-spring-stat-20-001-lec-001))
10 | - __Office Hours:__ TuTh 11:30am-12:30pm in 309 Evans (or by appointment)
11 | - __Text:__ Statistics, 4th edition (by Freedman, Pisani, and Purves)
12 | - __Final:__ Wed, May-10, 3:00-6:00pm
13 | - __GSIs:__ Office hours of the GSIs will be posted on the bCourses page. You can go to the office hours of __any__ GSI, not just your own.
14 |
15 | | Discussion | Date | Room | GSI |
16 | |------------|--------------|--------------|-----------------|
17 | | 101 | TuTh 9-10A | 332 Evans | Yoni Ackerman |
18 | | 102 | TuTh 9-10A | 334 Evans | Yizhou Zhao |
19 | | 103 | TuTh 10-11A | 332 Evans | Yoni Ackerman |
20 | | 104 | TuTh 10-11A | 334 Evans | Yizhou Zhao |
21 | | 105 | TuTh 11-12P | 332 Evans | Mingjia Chen |
22 | | 106 | TuTh 11-12P | 334 Evans | Jill Berkin |
23 | | 107 | TuTh 12-1P | 332 Evans | Mingjia Chen |
24 | | 108 | TuTh 1-2P | 332 Evans | Yanli Fan |
25 | | 109 | TuTh 2-3P | 334 Evans | Yanli Fan |
26 | | 110 | TuTh 2-3P | 340 Evans | Rohit Bahirwani |
27 | | 111 | TuTh 3-4P | 334 Evans | Shalika Gupta |
28 | | 112 | TuTh 3-4P | 340 Evans | Rohit Bahirwani |
29 | | 113 | TuTh 4-5P | 334 Evans | Calvin Chi |
30 | | 114 | TuTh 5-6P | 334 Evans | Calvin Chi |
31 | | 115 | TuTh 5-6P | 205 Dwinelle | Jill Berkin |
32 | | 116 | TuTh 5-6P | 187 Dwinelle | Shalika Gupta |
33 |
34 |
35 | ### Description
36 |
37 | __Statistics 20__ is a course designed primarily as an introductory course for statistical thinking. You do need to be comfortable with math at the level of intermediate algebra (e.g., the equation of a straight line, plotting points, taking powers and roots, percentages).
38 |
39 | The emphasis of the course is critical thinking about quantitative evidence. Topics include reasoning and fallacies, descriptive statistics, association, correlation, regression, elements of probability, set theory, chance variability, random variables, expectation, standard error, sampling, hypothesis tests, confidence intervals, experiments and observational studies.
40 |
41 |
42 | ### Methods of Instruction
43 |
44 | Using a combination of lecture and student participation, each class session will focus on learning the fundamentals. The required textbook is the classic __Statistics__ 4th edition by Freedman, Pisani and Purves.
45 |
46 | I firmly believe that one cannot do statistical computations without the help of good statistical software. In this course, you will be asked to do various assignments and practical work using the [statistical software R](https://www.r-project.org/). The main idea is to use R as a supporting tool to help you apply the key concepts of the textbook.
47 |
48 |
49 | ### Homework Assignments
50 |
51 | - Homework assignments will be assigned almost every week (about 13 HW).
52 | - You will submit your homework via bCourses electronically (as a word, text, pdf, or html file).
53 | - I will drop your lowest HW score.
54 | - Don't wait until the last hour to do an assignment. Plan ahead and pace yourself.
55 | - Don't wait until the last minute to submit your assignment.
56 | - __No late assignments__ will be accepted, for any reason, including, but not limited to, theft or any extraordinary circumnstances (e.g. illnes, exhaustion, mourning, loss of internet connection, bCourses is down, broken computer).
57 | - Note that answers to non-review questions are in the back of the book so you can check that your answer is correct.
58 | - Solutions to the review exercises will be posted on bCourses.
59 | - If you collaborate with other students when working on a HW assignment, please include the names of those students in your submission.
60 | - You must write your own answers (using your own words). Copy and plagiarism will not be tolerated (see _Academic Honesty_ policy).
61 |
62 |
63 | ### Discussion
64 |
65 | - Discussion is an important part of the class and is meant to supplement lecture.
66 | - Your GSI will review and expand on concepts introduced in lecture and encourage you to problem solve in groups.
67 | - There will be about 4/5 short quizzes given in discussion to test your understanding.
68 | - Your quiz scores __will NOT__ be part of your grade.
69 | - Students must attend the discussion group they are officially registered in.
70 |
71 |
72 | ### Exams
73 |
74 | - There will be two 50-minute in-class midterms, and one 3-hour final exam.
75 | - The tentative dates of the midterms are Friday Feb-24, and Friday Apr-07.
76 | - The final exam is currently scheduled for Wednesday, May 10th from 3:00pm-6:00pm. (classroom to be announced).
77 | - If you do not take the final, you will NOT pass the class.
78 | - There will be __no early or makeup exams__.
79 | - ~~To ask for regrading, you must answer a test using pen. Tests answered with pencil will not be accepted for regrading.~~
80 | - We will use _gradescrope_ to grade your tests (so you can use pen or pencil).
81 | - When asking for regrading, please clearly state the reasons that make you think you deserve a higher score.
82 | - You will have one full week after grades are published on gradescope to request a regrade.
83 | - After the regrade deadline, no requests will be considered.
84 |
85 |
86 |
87 | ### Grading Structure
88 |
89 | - 20% homework (lowest 1 dropped)
90 | - 25% midterm 1
91 | - 25% midterm 2
92 | - 30% final
93 |
94 | No individual letter grades will be given for midterm, or final. You will get a letter grade for the course that is based on your overall score. Your final grade will be graded on a 30/30/30/10 (A/B/C/DF) scale.
95 |
96 |
97 | ### Calculator Policy
98 |
99 | - You will need one that adds, subtracts, multiplies, divides, takes square roots, raises numbers to a power, and preferably also computes factorials. Statistical calculators are unnecessary.
100 | - However, no graphing calculators, no phone calculators or tablet calculators are allowed.
101 | - If you do not bring a calculator to a midterm or the final, do your computations by hand (you won't be allowed to borrow someone else's calculator).
102 |
103 |
104 | ### Academic Honesty
105 |
106 | I (Gaston Sanchez) expect you to do your own work and to uphold the standards of intellectual integrity. Collaborating on homework is fine and I encourage you to work together---but copying is not, nor is having somebody else submit assignments for you. Cheating will not be tolerated. Anyone found cheating will receive an F and will be reported to the [Center for the Student Conduct](http://sa.berkeley.edu/conduct). If you are having trouble with an assignment or studying for an exam, or if you are uncertain about permissible and impermissible conduct or collaboration, please come see me with your questions.
107 |
108 |
109 | ### Email Policy
110 |
111 | - You should try to use email as a tool to set up a one-on-one meeting with me if office hours conflict with your schedule.
112 | - Use the subject line __Stat 20 Meeting Request__.
113 | - Your message should include at least two times when you would like to meet and a brief (one-two sentence) description of the reason for the meeting.
114 | - Do NOT expect me to reply right away (I may not reply on time).
115 | - If you have an emergency, talk to me later during class or office hours.
116 | - I strongly encourage you to ask questions about the syllabus, covered material, and assignments during class time or lab discussions.
117 | - I prefer to have conversations in person rather than via email, thus allowing us to get to know each other better and fostering a more collegial learning atmosphere.
118 |
119 |
120 | ### Accommodation Policy
121 |
122 | Students needing accommodations for any physical, psychological, or learning disability, should speak with me during the first two weeks of the semester, either after class or during office hours and see [http://dsp.berkeley.edu](http://dsp.berkeley.edu) to learn about Berkeley’s policy. If you are a DSP student, please contact me at least three weeks prior to a midterm or final so that we can work out acceptable accommodations via the DSP Office.
123 |
124 | If you are an athlete or Cal band member, please check your calendar and come see me as soon as possible to OH during the first two weeks of the semester. Please try your best to be present at each of the midterms as I cannot guarantee accommodation for a late exam.
125 |
126 |
127 | ### Safe, Supportive, and Inclusive Environment
128 |
129 | Whenever a faculty member, staff member, post-doc, or GSI is responsible for
130 | the supervision of a student, a personal relationship between them of a
131 | romantic or sexual nature, even if consensual, is against university policy.
132 | Any such relationship jeopardizes the integrity of the educational process.
133 |
134 | Although faculty and staff can act as excellent resources for students, you
135 | should be aware that they are required to report any violations of this campus
136 | policy. If you wish to have a confidential discussion on matters related to this
137 | policy, you may contact the _Confidential Care Advocates_ on campus for support
138 | related to counseling or sensitive issues. Appointments can be
139 | made by calling (510) 642-1988.
140 |
141 | The classroom, lab, and work place should be safe and inclusive environments
142 | for everyone. The _Office for the Prevention of Harassment and Discrimination_
143 | (OPHD) is responsible for ensuring the University provides an environment for
144 | faculty, staff and students that is free from discrimination and harassment on
145 | the basis of categories including race, color, national origin, age, sex, gender,
146 | gender identity, and sexual orientation. Questions or concerns?
147 | Call (510) 643-7985, email ask_ophd@berkeley.edu, or go to
148 | [http://survivorsupport.berkeley.edu/](http://survivorsupport.berkeley.edu/).
149 |
150 |
151 | ### Incomplete Policy
152 |
153 | Under emergency/special circumstances, students may petition me to receive an Incomplete grade. By University policy, for a student to get an Incomplete requires (i) that the student was performing passing-level work until the time that (ii) something happened that---through no fault of the student---prevented the student from completing the coursework. If you take the final, you completed the course, even if you took it while ill, exhausted, mourning, etc. The time to talk to me about incomplete grades is BEFORE you take the final, when the situation that prevents you from finishing presents itself. Please clearly state your reasoning in your comments to me.
154 |
155 | It is your responsibility to develop good time management skills, good studying habits, know your limits, and learn to ask for professional help.
156 | Life happens. Social, family, cultural, scholar, and individual circumstances can affect your performance (both positive and negatively). If you find yourself in a situation that raises concerns about passing the course, please come see me as soon as possible. Above all, please do not wait till the end of the semester to share your concerns about passing the course because it will be too late by then.
157 |
158 |
159 | ### Letters of Recommendation
160 |
161 | Unless I have known you at least one year, and we have developed a good collegial relationship, I do not provide letters of recommendation.
162 |
163 |
164 | ### Additional Course Policies
165 |
166 | - Be sure to pay attention to deadlines.
167 | - In consideration to everybody in the classroom, please turn off your cell phone during class and lab time.
168 |
169 |
170 |
171 | ### Fine Print
172 |
173 | The course deadlines, assignments, exam times and material are subject to change at the whim of the professor.
174 |
175 |
--------------------------------------------------------------------------------