├── .RData
├── Fig 1.png
├── Fig 2.png
├── v59i10.pdf
├── 02_Working_with_data.docx
├── KRG_elegant_logo_for_light_BG.png
├── 09_Nonparametric_tests_files
└── figure-html
│ ├── Box plot-1.png
│ └── Scatter plot-1.png
├── 05_Sampling_distributions_files
└── figure-html
│ ├── The t distribution-1.png
│ ├── Repeat study 10 times-1.png
│ ├── Repeat study 100 times-1.png
│ ├── Repeat study 1000 times-1.png
│ ├── The chi-squared distribution-1.png
│ ├── Drawing random samples 10000 times-1.png
│ └── Two-thousand values from a discrete uniform distribution-1.png
├── 06_A_parametric_comparison_of_means_files
└── figure-html
│ ├── unnamed-chunk-1-1.png
│ ├── unnamed-chunk-2-1.png
│ ├── Out two-tailed hypothesis-1.png
│ ├── Indicating the critical t values-1.png
│ └── Box plot of ages for each answer group-1.png
├── 08_Assumptions_for_parametric_tests_files
└── figure-html
│ ├── QQ plot of crp-1.png
│ ├── Plotting the data-1.png
│ ├── A QQ plot of the normal data-1.png
│ └── Creating and plotting normal data-1.png
├── README.md
├── __Template.Rmd
├── .Rhistory
├── Cumulative distributions functions.Rmd
├── ProjectIIData.csv
├── 08_Assumptions_for_parametric_tests.Rmd
├── 10_Analyzing_categorical_data.Rmd
├── 09_Nonparametric_tests.Rmd
├── 09_Nonparametric_tests_full.Rmd
├── 07_Correlation_and_linear_models.Rmd
├── 03_Summary_statistics.Rmd
├── 11_Logistic_regression.Rmd
├── ProjectData.csv
├── 04 Visualizing data.Rmd
├── 05_Sampling_distributions.Rmd
├── 06_A_parametric_comparison_of_means.Rmd
├── 04_Inbuilt_R_plots.Rmd
└── 02_Working_with_data.Rmd
/.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/.RData
--------------------------------------------------------------------------------
/Fig 1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/Fig 1.png
--------------------------------------------------------------------------------
/Fig 2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/Fig 2.png
--------------------------------------------------------------------------------
/v59i10.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/v59i10.pdf
--------------------------------------------------------------------------------
/02_Working_with_data.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/02_Working_with_data.docx
--------------------------------------------------------------------------------
/KRG_elegant_logo_for_light_BG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/KRG_elegant_logo_for_light_BG.png
--------------------------------------------------------------------------------
/09_Nonparametric_tests_files/figure-html/Box plot-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/09_Nonparametric_tests_files/figure-html/Box plot-1.png
--------------------------------------------------------------------------------
/09_Nonparametric_tests_files/figure-html/Scatter plot-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/09_Nonparametric_tests_files/figure-html/Scatter plot-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/The t distribution-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/The t distribution-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/Repeat study 10 times-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/Repeat study 10 times-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/Repeat study 100 times-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/Repeat study 100 times-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/Repeat study 1000 times-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/Repeat study 1000 times-1.png
--------------------------------------------------------------------------------
/06_A_parametric_comparison_of_means_files/figure-html/unnamed-chunk-1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/06_A_parametric_comparison_of_means_files/figure-html/unnamed-chunk-1-1.png
--------------------------------------------------------------------------------
/06_A_parametric_comparison_of_means_files/figure-html/unnamed-chunk-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/06_A_parametric_comparison_of_means_files/figure-html/unnamed-chunk-2-1.png
--------------------------------------------------------------------------------
/08_Assumptions_for_parametric_tests_files/figure-html/QQ plot of crp-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/08_Assumptions_for_parametric_tests_files/figure-html/QQ plot of crp-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/The chi-squared distribution-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/The chi-squared distribution-1.png
--------------------------------------------------------------------------------
/08_Assumptions_for_parametric_tests_files/figure-html/Plotting the data-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/08_Assumptions_for_parametric_tests_files/figure-html/Plotting the data-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/Drawing random samples 10000 times-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/Drawing random samples 10000 times-1.png
--------------------------------------------------------------------------------
/06_A_parametric_comparison_of_means_files/figure-html/Out two-tailed hypothesis-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/06_A_parametric_comparison_of_means_files/figure-html/Out two-tailed hypothesis-1.png
--------------------------------------------------------------------------------
/08_Assumptions_for_parametric_tests_files/figure-html/A QQ plot of the normal data-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/08_Assumptions_for_parametric_tests_files/figure-html/A QQ plot of the normal data-1.png
--------------------------------------------------------------------------------
/06_A_parametric_comparison_of_means_files/figure-html/Indicating the critical t values-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/06_A_parametric_comparison_of_means_files/figure-html/Indicating the critical t values-1.png
--------------------------------------------------------------------------------
/08_Assumptions_for_parametric_tests_files/figure-html/Creating and plotting normal data-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/08_Assumptions_for_parametric_tests_files/figure-html/Creating and plotting normal data-1.png
--------------------------------------------------------------------------------
/06_A_parametric_comparison_of_means_files/figure-html/Box plot of ages for each answer group-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/06_A_parametric_comparison_of_means_files/figure-html/Box plot of ages for each answer group-1.png
--------------------------------------------------------------------------------
/05_Sampling_distributions_files/figure-html/Two-thousand values from a discrete uniform distribution-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Medical-statistics-using-R/master/05_Sampling_distributions_files/figure-html/Two-thousand values from a discrete uniform distribution-1.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Medical-statistics-using-R
2 | This repository contains the RMD files and other resources for my _on-campus_ course on __medical statistics using the R language for statistical analysis__.
3 | The course takes place in the Division of General Surgery Seminar Room, on J-floor in the Old Main Building of Groote Schuur Hospital. Tutorials start at 17H00 every Tuesday and Thursday evenings, the latter being a repeat of the Tuesday tutorial for those that cannot attend the first session.
4 |
--------------------------------------------------------------------------------
/__Template.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "0"
3 | author: "Dr Juan H Klopper"
4 | date: "February 21, 2019"
5 | output:
6 | html_document:
7 | toc: true
8 | number_sections: false
9 | ---
10 |
11 |
16 |
17 | 
18 |
19 | ```{r setup, include=FALSE}
20 | knitr::opts_chunk$set(echo = TRUE)
21 | #setwd(getwd())
22 | ```
23 |
24 | ## Preamble
25 |
26 |
--------------------------------------------------------------------------------
/.Rhistory:
--------------------------------------------------------------------------------
1 | knitr::opts_chunk$set(echo = TRUE)
2 | setwd(getwd())
3 | curve(dt(x, df = 30),
4 | from = -3, to = 3,
5 | lwd = 3,
6 | ylab = "Distribution",
7 | xlab = "_t_ statistic")
8 | ind <- c(1, 2, 3, 5, 10)
9 | for (i in ind) curve(dt(x, df = i), -3, 3, add = TRUE)
10 | curve(dt(x, df = 30),
11 | from = -3, to = 3,
12 | lwd = 3,
13 | ylab = "Distribution",
14 | xlab = "t statistic")
15 | ind <- c(1, 2, 3, 5, 10)
16 | for (i in ind) curve(dt(x, df = i), -3, 3, add = TRUE)
17 | curve(dchisq(x,
18 | df = 1,
19 | from = 0,
20 | to = 10,
21 | lwd = 3))
22 | curve(dchisq(x,
23 | df = 1),
24 | from = 0,
25 | to = 10,
26 | lwd = 3))
27 | curve(dchisq(x,
28 | df = 1),
29 | from = 0,
30 | to = 10,
31 | lwd = 3)
32 | curve(dchisq(x,
33 | df = 1),
34 | from = 0,
35 | to = 10,
36 | lwd = 3)
37 | ind <- c(1, 2, 3, 5, 10)
38 | for (i in ind) curve(dchisq(x,
39 | df = i),
40 | -3,
41 | 3,
42 | add = TRUE)
43 | curve(dchisq(x,
44 | df = 1),
45 | from = 0,
46 | to = 10,
47 | lwd = 3)
48 | ind <- c(1, 2, 3, 5, 10)
49 | for (i in ind) curve(dchisq(x,
50 | df = i),
51 | 0,
52 | 10,
53 | add = TRUE)
54 | curve(dchisq(x,
55 | df = 1),
56 | from = 0,
57 | to = 10,
58 | lwd = 3,
59 | ylab = "Distribution",
60 | xlab = "Statistic")
61 | ind <- c(1, 2, 3, 5, 10)
62 | for (i in ind) curve(dchisq(x,
63 | df = i),
64 | 0,
65 | 10,
66 | add = TRUE)
67 | curve(dchisq(x,
68 | df = 1),
69 | from = 0,
70 | to = 10,
71 | lwd = 3,
72 | ylab = "Distribution",
73 | xlab = "Statistic")
74 | ind <- c(2, 3, 5, 10)
75 | for (i in ind) curve(dchisq(x,
76 | df = i),
77 | 0,
78 | 10,
79 | add = TRUE)
80 | install.packages("pastecs")
81 | install.packages("car")
82 | knitr::opts_chunk$set(echo = TRUE)
83 | setwd(getwd())
84 | set.seed(123) # Seeding the pseudo random number generator to ensure reproducible values
85 | hb <- rnorm(100, # On hundred values from a normal distrbution
86 | mean = 15, # Mean of 15
87 | sd = 3) # Standard deviationof three
88 | hist(hb, prob = TRUE, main = "Histogram of hemoglobin values", las = 1, xlab = "Hemoglobin")
89 | lines(density(hb)) # A density plot line
90 | qqnorm(hb, main = "QQ plot of hemoglobin values")
91 | qqline(hb)
92 | set.seed(123)
93 | crp = rgamma(100, 2, 2)
94 | hist(crp,
95 | prob = TRUE,
96 | main = "Histogram of c-reactive protein values",
97 | las = 1,
98 | xlab = "CRP",
99 | ylim = c(0, 0.9))
100 | lines(density(crp))
101 | qqnorm(crp, main = "QQ plot of CRP values")
102 | qqline(crp)
103 | summary(hb)
104 | mean(hb)
105 | library(pastecs)
106 | round(stat.desc(hb, basic = FALSE, norm = TRUE), digits = 3)
107 | shapiro.test(hb)
108 | shapiro.test(crp)
109 | df = read.csv(file = "ProjectIIData.csv", header = TRUE)
110 | head(df)
111 | library(car)
112 | leveneTest(df$CRP, df$Group, center = mean)
113 |
--------------------------------------------------------------------------------
/Cumulative distributions functions.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Cumulative distribution functions"
3 | output: html_document
4 | ---
5 |
6 | ```{r setup, include=FALSE}
7 | knitr::opts_chunk$set(echo = TRUE)
8 | setwd(getwd())
9 | ```
10 |
11 | from https://chemicalstatistician.wordpress.com/2013/06/25/exploratory-data-analysis-2-ways-of-plotting-empirical-cumulative-distribution-functions-in-r/
12 |
13 | ## Data
14 |
15 | ```{r The air quality dataset}
16 | data("airquality")
17 | ```
18 |
19 | ## The empirical cumulative distribution function
20 |
21 | The empirical cumulative distribution function (eCDF) is a statistical function that Assigns each data point value for a numerical variable a probability of $\frac{1}{n}$. The data point values are then sorted in ascending order. For any given data point value, the assigned probabilities for all the values less than the given value is summed. This step-function is denoted in two ways as shown in equation (1).
22 |
23 | $$\begin{align} &\hat{F}_n \left( x \right) \\ &\hat{P}_n \left( X \le x \right) \end{align}
24 | \tag{1}$$
25 |
26 | Since we are stepping through each data point vale, $x_i$, we can assign a value of $0$ or $1$ to each, as shown in equation (2). This is obvious from context, as we are simply counting the probabilities of all the values less than the given data point value. This step-function is known as the _indicator function_.
27 |
28 | $$I \left( x_i \le x \right) = \begin{cases} 1, \text{when } x_i \le x \\ 0, \text{when } x_i > x \end{cases} \tag{2}$$
29 |
30 | Finally, we have the eCDF in equation (3).
31 |
32 | $$\hat{P}_n \left( X \le x \right) = \frac{\sum_{i = 1}^n I \left( x_i \le x \right)}{n} \tag{3}$$
33 |
34 | In essence, we have just complicated the fact that we are simply counting how many values are less than the given data point value and divide that by the smaple size!
35 |
36 | ## Creating random data point values
37 |
38 | Below, we create $100$ data point values form a normal distribution with a mean of $100$ and a standard deviation of $10$.
39 |
40 | ```{r Creating random values form a normal distribution}
41 | set.seed(123) # Seed the pseudo random number generator for reproducible results
42 | normal_vals <- rnorm(100,
43 | mean = 100,
44 | sd = 10)
45 | ```
46 |
47 | We can view a simple histogram of these values.
48 |
49 | ```{r Histogram of normally distributed random values}
50 | hist(normal_vals,
51 | main = "Normal distribution",
52 | xlab = "Values",
53 | ylab = "Counts",
54 | ylim = c(0, 30),
55 | labels = TRUE,
56 | col = "orange")
57 | ```
58 |
59 | ## Creating an eCDF plot
60 |
61 | First we need to use equation (3).
62 |
63 | ```{r Creating the values for the eCDF}
64 | normal_vals_ecdf <- ecdf(normal_vals)
65 | ```
66 |
67 | These values are then in actual fact, the quantile value for each data point value.
68 |
69 | ```{r eCDF plot}
70 | plot(normal_vals_ecdf,
71 | main = "eCDF",
72 | xlab = "Values",
73 | ylab = "Quantile")
74 | ```
75 |
76 |
--------------------------------------------------------------------------------
/ProjectIIData.csv:
--------------------------------------------------------------------------------
1 | CRP,Triglyceride,Cholesterol,Improvement,Group
2 | 11.7,230,7.7,Yes,A
3 | 9.1,346,5.1,Yes,C
4 | 9.2,274,6.1,Yes,C
5 | 6.1,395,3,No,C
6 | 11,274,4,Yes,C
7 | 10.3,305,5.1,Yes,A
8 | 11.2,340,5.8,No,A
9 | 9.2,271,7.8,No,C
10 | 9.3,262,6.3,Yes,A
11 | 10.5,325,4.4,No,A
12 | 12.2,315,9,No,A
13 | 12.2,299,4.7,Yes,C
14 | 8.6,308,6.1,No,C
15 | 7.7,312,3.9,No,C
16 | 11.9,263,2.8,No,C
17 | 9,302,8.2,No,A
18 | 11.3,278,5.7,No,A
19 | 9,229,6.4,No,C
20 | 8.3,325,6.4,No,A
21 | 8.2,220,2.5,Yes,A
22 | 10.3,274,3.2,Yes,A
23 | 9,260,7,Yes,C
24 | 8.2,288,3.2,No,C
25 | 10.4,274,8,Yes,C
26 | 13,363,6.4,Yes,C
27 | 9.4,304,4.9,Yes,A
28 | 9.4,389,2.6,Yes,A
29 | 9.3,243,3.5,Yes,C
30 | 6.7,316,3.8,No,A
31 | 10.8,238,2.3,Yes,A
32 | 9.3,347,2.1,Yes,A
33 | 13.2,276,11.7,No,C
34 | 9.8,275,3.9,No,C
35 | 7.4,269,5.9,Yes,C
36 | 12.4,236,3.2,No,C
37 | 6.8,308,3.8,No,A
38 | 6.8,434,5.1,Yes,A
39 | 10.5,284,9.4,No,C
40 | 8.3,306,7.5,No,A
41 | 6.6,367,3,No,A
42 | 11.7,398,2.1,No,A
43 | 10.2,424,5.1,No,C
44 | 12.2,291,6.8,No,C
45 | 9,399,2.7,No,C
46 | 9.3,226,2.3,Yes,C
47 | 9.9,324,3.6,Yes,A
48 | 8,321,3.7,Yes,A
49 | 9.6,355,6.3,No,C
50 | 9.2,377,5,Yes,A
51 | 10.9,229,4.1,Yes,A
52 | 7.1,350,3.5,Yes,A
53 | 12.4,402,4.8,Yes,C
54 | 7.8,246,2.6,Yes,C
55 | 8.6,322,3.3,No,C
56 | 8.7,345,8.1,Yes,C
57 | 8.2,283,4.6,Yes,A
58 | 9.1,281,2.9,No,A
59 | 9.5,287,8.4,No,C
60 | 6,272,4.4,Yes,A
61 | 12,301,5.4,No,A
62 | 12.5,354,11.5,No,A
63 | 11,290,9.4,Yes,C
64 | 8.2,192,7.3,No,C
65 | 6.6,311,5.6,No,C
66 | 10.2,398,4.4,No,C
67 | 10.3,278,6.3,No,A
68 | 9.8,273,3.5,No,A
69 | 9.4,315,7.5,No,C
70 | 12.1,312,4.3,No,A
71 | 11.5,443,11.4,Yes,A
72 | 11.1,263,7.5,Yes,A
73 | 14.6,316,3.3,Yes,C
74 | 9.8,249,2.3,No,C
75 | 10.1,367,2.3,Yes,C
76 | 10.2,368,3.2,Yes,C
77 | 11.3,303,11.2,Yes,A
78 | 8.9,336,8.8,Yes,A
79 | 10.4,228,2.8,Yes,C
80 | 10.7,415,3.2,No,A
81 | 7.7,189,3.3,Yes,A
82 | 14.8,230,8.7,Yes,A
83 | 10.9,253,3.9,No,C
84 | 12.4,215,2.6,No,C
85 | 8.5,252,7.1,Yes,C
86 | 11.3,293,7.3,No,C
87 | 9,358,8.8,No,A
88 | 11.6,284,2.2,Yes,A
89 | 11,355,7.9,No,C
90 | 8.3,409,7.7,No,A
91 | 8.9,343,7,No,A
92 | 6.2,298,3.5,No,A
93 | 9.3,312,3.8,No,C
94 | 11.4,271,3,No,C
95 | 10.4,215,6,No,C
96 | 10.5,309,3.2,Yes,C
97 | 8.7,275,2.3,Yes,A
98 | 9.2,305,8.5,Yes,A
99 | 12.8,359,12.7,No,C
100 | 14,310,13.4,Yes,A
101 | 13.9,307,6.3,Yes,A
102 | 7.8,315,5.6,Yes,A
103 | 11.6,255,9.8,Yes,C
104 | 7.9,217,4.7,Yes,C
105 | 10.9,331,3.7,No,C
106 | 8.8,306,5.1,Yes,C
107 | 11.6,312,6.5,Yes,A
108 | 8.5,308,6.5,No,A
109 | 5.2,290,2.1,No,C
110 | 4.3,370,2.2,Yes,A
111 | 9.3,332,6,No,A
112 | 10.1,338,5.3,No,A
113 | 11.3,231,2.5,Yes,C
114 | 11.1,280,2.8,No,C
115 | 7.7,398,4.3,No,C
116 | 9.7,265,7.8,No,C
117 | 10.5,229,9.9,No,A
118 | 12,384,6.8,No,A
119 | 10.4,303,8.2,No,C
120 | 12.2,352,2.6,No,A
121 | 9.3,383,6.7,Yes,A
122 | 10.4,292,6.7,Yes,A
123 | 11.7,391,4.8,Yes,C
124 | 9.5,270,3.1,No,C
125 | 8,326,4.4,Yes,C
126 | 9.7,317,3.3,Yes,C
127 | 6.4,339,2.8,Yes,A
128 | 5.7,268,4.4,Yes,A
129 | 8.3,308,3,Yes,C
130 | 9.4,293,4.4,No,A
131 | 7.3,291,3.9,Yes,A
132 | 15.1,179,11.3,Yes,A
133 | 9.1,356,8.5,No,C
134 | 9.5,257,2.3,No,C
135 | 7.7,374,2.5,Yes,C
136 | 8.3,443,5.7,No,C
137 | 7.4,258,7.3,No,A
138 | 9.7,320,2.5,Yes,A
139 | 8.9,338,8.8,No,C
140 | 10.4,308,5.2,No,A
141 | 14.6,353,13.1,No,A
142 | 8.8,302,8,No,A
143 | 8.3,304,3,No,C
144 | 14.3,274,10.5,No,C
145 | 8.9,343,5.8,No,C
146 | 8.5,346,4.4,Yes,C
147 | 11.3,317,7.3,Yes,A
148 | 12.1,354,7.8,Yes,A
149 | 8,353,4.4,No,C
150 | 12.6,320,11.3,Yes,A
151 | 9.8,310,8.6,Yes,A
152 | 11.9,300,3.1,Yes,A
153 | 8.4,318,4.4,Yes,C
154 | 10.8,332,3,Yes,C
155 | 10.4,284,7.4,No,C
156 | 9.8,229,5.1,Yes,C
157 | 10.2,335,4.9,Yes,A
158 | 10.4,262,10,No,A
159 | 10.5,287,6.5,No,C
160 | 10,263,6.8,Yes,A
161 | 8.8,321,3.4,No,A
162 | 10,210,5.7,No,A
163 | 8.1,297,4.1,Yes,C
164 | 8.4,199,3,No,C
165 | 10.2,288,4.1,No,C
166 | 13.1,316,6.5,No,C
167 | 8.9,293,3.6,No,A
168 | 8.2,356,3.2,No,A
169 | 9.8,387,7.2,No,C
170 | 7.4,291,2.4,No,A
171 | 8.7,299,2.1,Yes,A
172 | 10,394,7.6,Yes,A
173 | 8,361,3.2,Yes,C
174 | 12.3,288,10.1,No,C
175 | 8.9,278,2.7,Yes,C
176 | 6.9,347,3.2,Yes,C
177 | 9,307,2.8,Yes,A
178 | 10.4,338,3.2,Yes,A
179 | 13.2,319,9.3,Yes,C
180 | 8.2,317,2.9,No,A
181 | 9.9,249,3,Yes,A
182 | 10.3,380,7.8,Yes,A
183 | 13,280,4.2,No,C
184 | 10.8,313,3.8,No,C
185 | 9.1,316,6.6,Yes,C
186 | 13.7,272,4.1,No,C
187 | 8.3,367,5.9,No,A
188 | 9.3,364,5.8,Yes,A
189 | 11,258,6.4,No,C
190 | 8.9,309,8.3,No,A
191 | 12.2,393,8.5,No,A
192 | 14.7,257,8.8,No,A
193 | 10.5,292,3.1,No,C
194 | 15,332,15,No,C
195 | 8.5,344,3,No,C
196 | 12.5,331,10.2,Yes,C
197 | 10,309,2.3,Yes,A
198 | 10,300,2.9,Yes,A
199 | 12.3,273,7.3,No,C
200 | 12,290,5.3,Yes,A
201 | 12.3,313,10.8,Yes,A
202 | 9.7,320,2.3,Yes,A
203 | 9,240,2.5,Yes,C
204 | 10.1,358,9.2,Yes,C
205 | 9.8,397,5.8,No,C
206 | 11.2,314,3.1,Yes,C
207 | 8.9,250,2.1,Yes,A
208 | 7.5,370,7.1,No,A
209 | 10.8,343,4.7,No,C
210 | 9.1,358,3.7,Yes,A
211 | 11.8,328,3.8,No,A
212 | 10.1,288,5.7,No,A
213 | 6.5,292,2.6,Yes,C
214 | 12.6,292,12.5,No,C
215 | 9.5,375,4.1,No,C
216 | 10.5,338,3.4,No,C
217 | 9.3,232,4.3,No,A
218 | 10.8,318,3.1,No,A
219 | 13.5,288,8.9,No,C
220 | 9.4,317,7.3,No,A
221 | 15.4,351,8.3,Yes,A
222 | 9.6,325,5.3,Yes,A
223 | 6.3,448,4.3,Yes,C
224 | 8.3,232,2.7,No,C
225 | 8.8,408,5.1,Yes,C
226 | 11.1,285,5.5,Yes,C
227 | 8.4,343,4,Yes,A
228 | 8.1,316,2.5,Yes,A
229 | 10.2,355,4.2,Yes,C
230 | 12.9,258,3.8,No,A
231 | 8.2,269,3.9,Yes,A
232 | 11.9,373,11.5,Yes,A
233 | 9.9,237,7.4,No,C
234 | 12.4,356,4.9,No,C
235 | 8,232,3.3,Yes,C
236 | 7.4,327,3.9,No,C
237 | 6.6,310,4.8,No,A
238 | 11.1,266,2.2,Yes,A
239 | 12.2,354,10.8,No,C
240 | 8.8,218,5.4,No,A
241 | 3.7,356,3.5,No,A
242 | 13.8,253,6,No,A
243 | 11.6,232,3,No,C
244 | 8,285,3.2,No,C
245 | 11.5,380,8.1,No,C
246 | 5.6,308,3.5,Yes,C
247 | 9.9,305,9,Yes,A
248 | 8.5,350,7.2,Yes,A
249 | 8.6,276,7.9,No,C
250 | 9.6,278,2.6,Yes,A
251 | 7.4,300,2.3,Yes,A
252 | 10.7,301,9.4,Yes,A
253 | 11.7,278,4,Yes,C
254 | 6.5,243,6,Yes,C
255 | 7.4,267,3,No,C
256 | 10.1,297,7,Yes,C
257 | 12.7,350,8.1,Yes,A
258 | 7.4,337,3.4,No,A
259 | 8.8,261,4.4,No,C
260 | 13.7,294,12.6,Yes,A
261 | 11.5,364,11,No,A
262 | 11.2,381,6.5,No,A
263 | 10.6,265,8.8,Yes,C
264 | 10,322,9.2,No,C
265 | 8.6,279,5.2,No,C
266 | 6.8,330,2.9,No,C
267 | 14.2,304,5.2,No,A
268 | 9.6,290,2.5,No,A
269 | 14.8,234,9.7,No,C
270 | 11,312,2.5,No,A
271 | 6.6,273,2.1,Yes,A
272 | 5.2,336,2.9,Yes,A
273 | 11.6,242,6.9,Yes,C
274 | 8.2,355,2.8,No,C
275 | 10.1,378,10,Yes,C
276 | 10.2,334,3.6,Yes,C
277 | 7.4,327,4.6,Yes,A
278 | 7.7,292,3.6,Yes,A
279 | 9.6,338,4.6,Yes,C
280 | 9.3,331,3.6,No,A
281 | 13.4,412,11.7,Yes,A
282 | 7.1,389,6.8,Yes,A
283 | 6.7,264,4.8,No,C
284 | 11.4,260,9.5,No,C
285 | 7.2,269,3.3,Yes,C
286 | 10.6,343,4.4,No,C
287 | 11.4,244,9.6,No,A
288 | 9.6,306,5.6,Yes,A
289 | 9.2,292,4,No,C
290 | 10.3,249,10.2,No,A
291 | 9.4,270,7.9,No,A
292 | 10.5,291,8,No,A
293 | 4.6,309,3.7,No,C
294 | 10.9,363,5.4,No,C
295 | 6.9,299,4.4,No,C
296 | 10.4,438,7.6,Yes,C
297 | 11.8,383,9.6,Yes,A
298 | 10.6,239,4.1,Yes,A
299 | 9.4,278,6.9,No,C
300 | 10,355,8.7,Yes,A
301 | 11.7,325,4,Yes,A
302 |
--------------------------------------------------------------------------------
/08_Assumptions_for_parametric_tests.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Assumptions for parametric tests"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | ```{r, echo=FALSE}
17 | knitr::opts_chunk$set(echo = TRUE)
18 | setwd(getwd())
19 | ```
20 |
21 | 
22 |
23 | ## Introduction
24 |
25 | In the last chapter, we learned that the use of parametric tests for numerical variable comparisons (such as Students _t_ test) is ubiquitous. In _most_ cases, it is also robust. It will detect differences despite the distribution of the data point values used. Robustness is not guaranteed and, in fact, assumptions underlie the use of these tests. This means that we cannot always use these tests. The assumptions must be met before we can use them. Checking for these assumptions can be tedious, so much so that some researchers ignore them and simply always use parametric tests. This is wrong, It is important to be aware of the assumptions and to test them before the final decision to use parametric tests is made.
26 |
27 | What are these assumptions, then? Below follows a short description of four of the most important assumptions. We will see that when some of them are not met, we use alternative forms of parametetric tests and is some cases we move on to nonparametric tests.
28 |
29 | ## Assumptions for the use of parametric tests
30 |
31 | ### Normal distribution
32 |
33 | Perhaps most important is the fact that the data point values for the variable must be normally distributed in the underlying population. In regression analysis and in general linear models, it is the errors (residuals) that need to be normally distributed.
34 |
35 | Unfortunetaly, we do not have access to the data for the whole population. Therefor, we make use of tests to investigate the sample data and make inference with respect to the distribution in the population.
36 |
37 | If this assumption is not met, it is important to use nonparametric alternatives.
38 |
39 | ### Homogeneity of variance
40 |
41 | This refers to the need for a similarity in the variance throughout the data. This means that the variable in the populations from which the samples were taken have a similar variance in these populations. In the case of regression, the variance of one variable should be the same as all the other variables.
42 |
43 | We already met Levene's test that can be used to test this assumption. We use the unequal variance _t_ test when comparing two groups with unequal variances. Nonparametric tests can also be considered.
44 |
45 | ### Variable data type
46 |
47 | It is obvious that the data point values should be for a numerical variable and measured at this level. We should not use _t_ tests and analysis of variance for ordinal categorical variables even if expressed as numbers.
48 |
49 | ### Independence
50 |
51 | Data point values for variables for different groups should be independent of each other. In regression analysis, the errors should likewise be independent.
52 |
53 | Now that we have a better idea of the assumptions that parametric test use demands, we take a closer look at the first two assumptions. Violating the first assumption must lead to the use of non-parametric tests and violating the latter should lead to the use of modifed _t_ tests.
54 |
55 | ## The assumption of normality
56 |
57 | This assumption is arguably of most importance. It can be checked visually or numerically. Histograms and quantile-quantile (QQ) plots serve as visual markers and various statistical tests, such as the Shapiro-Wilk test, serve as numerical tests.
58 |
59 | ### Visual tests
60 |
61 | In the code below, 100 data point values for a variable named `hb` is created. The values are taken from a normal distribution with a mean of 15 and a standard deviation of 3. A histogram with the default bin size is created so as to visualize the frequency distribution of the data. A kernel density estimate is provided using the `lines()` command.
62 |
63 | ```{r Creating and plotting normal data}
64 | set.seed(123) # Seeding the pseudo random number generator to ensure reproducible values
65 | hb <- rnorm(100, # On hundred values from a normal distrbution
66 | mean = 15, # Mean of 15
67 | sd = 3) # Standard deviationof three
68 | hist(hb, prob = TRUE, main = "Histogram of hemoglobin values", las = 1, xlab = "Hemoglobin")
69 | lines(density(hb)) # A density plot line
70 | ```
71 |
72 | From the plot above, it seems obvious that the data are normally distributed. The QQ plot below plots the sample quantile of each data point value against its theoretical quantile. A line is added for clarity. The closer the data point values follow the line, the more likely that our assumption has been met, i.e. the data comes from a population in which the variable is normally distributed.
73 |
74 |
75 | ```{r A QQ plot of the normal data}
76 | qqnorm(hb, main = "QQ plot of hemoglobin values")
77 | qqline(hb)
78 | ```
79 |
80 | The next computer variable is named `crp` and takes on a gamma distribution. Once again, $100$ data point values are created. Following this is the accompanying histogram and QQ plot.
81 |
82 | ```{r One hundred random values from a gamma distribution}
83 | set.seed(123)
84 | crp = rgamma(100, 2, 2)
85 | ```
86 |
87 | ```{r Plotting the data}
88 | hist(crp,
89 | prob = TRUE,
90 | main = "Histogram of c-reactive protein values",
91 | las = 1,
92 | xlab = "CRP",
93 | ylim = c(0, 0.9))
94 | lines(density(crp))
95 | ```
96 |
97 | It is clear that the data point values are not normally distributed. A QQ plot will see the markers __NOT__ follow the straight line.
98 |
99 | ```{r QQ plot of crp}
100 | qqnorm(crp, main = "QQ plot of CRP values")
101 | qqline(crp)
102 | ```
103 |
104 | As expected, the visual indication is that the assumption of normality is not met.
105 |
106 | ### Numerical tests
107 |
108 | Simply describing the data point values of a variable can give a good understanding of the underlying distribution. The `summary()` function returns the basic descriptive statistics, including the minimum, maximum, and the quartile values.
109 |
110 | ```{r Summary statistics of hb}
111 | summary(hb)
112 | ```
113 |
114 | If the mean of the data point values is approximately the same as the median value, it can be an indication of the normality of the data.
115 |
116 | ```{r Average and median of hb}
117 | mean(hb)
118 | median(hb)
119 | ```
120 |
121 | The Shapiro-Wilk test is often used to test for normality. The null hypothesis of this test states that the data point values from our samples are for a variable which is normally distributed in the population from which they were taken. The alternative hypothesis states that the underlying population distribution is not normal. Below the `hb` and `crp` variables are passed as argument to the `shapito.test()` command, resulting in the same test statistic and *p* value as above.
122 |
123 | ```{r Shapiro Wilk tests for the two variables}
124 | shapiro.test(hb)
125 | shapiro.test(crp)
126 | ```
127 |
128 | The `crp` list of values is certainly not from a population in which the variable is normally distributed.
129 |
130 | ## The assumption of homogeneity of variance
131 |
132 | The Levene test is used to test for homogeneity of variance. The null hypothesis states equality of variances. In order to conduct Levene's test, the _Companion to Applied Regression_, `car`, package is required. We used this function in the chapter on parametric comparison of means.
133 |
134 | The `leveneTest()` command requires the use of a `data.frame` object. The code below imports a `csv` file and prints the first six rows and a summary to the screen.
135 |
136 | ```{r Importing a data file}
137 | df = read.csv(file = "ProjectIIData.csv", header = TRUE)
138 | head(df)
139 | ```
140 |
141 | Note that there are two variables, namely `CRP` and `Group`. The first variable is ratio-type numerical and the second is nominal categorical (a `factor` in `R`). The `leveneTest()` function requires specification of a factor by which to group the numerical variable under consideration. The third argument used in the function is `center =` and can take the value `median` (default) or `mean`. We are interested in the mean. Using the median creates the Brown-Forsythe test for variances.
142 |
143 | The code below imports the `car` package and runs the Levene test.
144 |
145 | ```{r Levene test}
146 | library(car)
147 | leveneTest(df$CRP, df$Group, center = mean)
148 | ```
149 |
150 | The _p_ value is less more the usually chosen $\alpha$ value of 0.05. The null hypothesis is not rejected and the variances are accepted as being equal, allowing for the use of Student's _t_ test.
151 |
152 | ## Conclusion
153 |
154 | Testing the assumptions for the use of parametric tests can seem laborious, but is an essential requirement in data analysis.
155 |
156 |
157 | Written by Dr Juan H Klopper
158 | http://www.juanklopper.com
159 | YouTube https://www.youtube.com/jhklopper
160 | Twitter @docjuank
--------------------------------------------------------------------------------
/10_Analyzing_categorical_data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "10 Analyzing categorical data"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = TRUE)
18 | setwd(getwd()) # Spreadsheet file and R markdown file in same folder
19 | ```
20 |
21 | ```{r Libraries, message=FALSE, warning=FALSE}
22 | library(dplyr)
23 | ```
24 |
25 | 
26 |
27 | ## Preamble
28 |
29 | Many research projects contain categorical, and more specifically, nominal categorical variables. The sample space of these variables are not numbers, yet we require some way of analyzing them.
30 |
31 | To the rescue comes the tests for categorical variables where we use the counts of the sample space elements to answer our research questions.
32 |
33 | In this chapter we will meet the common tests that are used to answer these questions. Take careful note of what their results say about the data, so as to use the correct test for your data.
34 |
35 | ## The $\chi^2$ goodness of fit test
36 |
37 | The $chi^2$ goodness of fit test examines the proportions of the sample space data point values for a nominal categorical variable. It is a test of proportions. Below, we import the `ProjectDataII.csv` file and save it as a data frame object with the computer variable name `df`.
38 |
39 | ```{r Data import}
40 | df <- read.csv("ProjectIIData.csv")
41 | ```
42 |
43 | One of the variables in the dataset is `Improvement`. In this simulated dataset, it refers to whether patients improved after receiving treatment. The sample space is dichotomous, with values `Yes` and `No`. We can use the `table()` function to see a count for each of these elements.
44 |
45 | ```{r Count of the sample space elements for the improvement variable}
46 | table(df$Improvement)
47 | ```
48 |
49 | A total of $156$ patients did not improve, whereas $144$ did. Before starting the data collection, we might have suspected a $0.5:0.5$ split in the data. Instead, we found proportions of `r 156 / (156 + 144)`:`r 144 / (156 + 144)`. We can test our findings against the expected proportions using the `chisq.test()` function. The first argument is a numeric vector containing the actual counts. The `p =` argument holds a numerical vector containing the expected proportions, which must sum to $1.0$.
50 |
51 | ```{r Chi square goodness of fit test}
52 | chisq.test(c(156, 144),
53 | p = c(0.5, 0.5))
54 | ```
55 |
56 | Since we only have a sample space of two elements, we have a single degree of freedom. For the appropriate $chi^2$ distribution, our findings represent a _p_ value of $0.49$. For an $\alpha$ value of $0.05$, we cannot reject the null hypothesis that there is no significant difference between the observed and expected proportions.
57 |
58 | ## The $chi^2$ test for independence
59 |
60 | The $chi^2$ test for independence measures, as the name implies, whether two categorical variables are independent. In our dataset, we also have a categorical variable named `Group`. We can inspected it with the `table()` function too.
61 |
62 | ```{r Sample space elements count for the Group variable}
63 | table(df$Group)
64 | ```
65 |
66 | Here we see that there are $150$ patients in each group. We can combine this information with that of the `Improvement` variable into what is termed a _contingency_ table of observed values. We use the same `table()` function, but list both variables.
67 |
68 | ```{r Contingency table of Group vs Improvement}
69 | table(df$Group,
70 | df$Improvement)
71 | ```
72 |
73 | Now we can see that $66$ patients in group A did not improve, while $84$ did. For patients in group C, these numbers are $90$ and $60$. We can use this contingency table of observed values to investigate whether the two nominal categorical variables are _independent_ of each other.
74 |
75 | ```{r Chi square test for independence}
76 | chisq.test(table(df$Group,
77 | df$Improvement),
78 | correct = FALSE) # Ignoring Yates' correction for continuity
79 | ```
80 |
81 | The degrees of freedom for this contingency table is $1$. This is calculated as $1$ subtracted from the number of rows times $1$ subtracted from the number of columns, i.e $\left( 2 - 1 \right) \times \left( 2 - 1 \right) = 1 \times 1 = 1 $. The $chi^2$ value is $7.6923$ representing a _p_ value of < $0.01$. Note that given an $\alpha$ value of $0.05$, we do not specify values smaller than $0.01$ when reporting the _p_ value. It is simply stated as being less than $0.01$. Given this small _p_ value, we can reject our null hypothesis that the variables are not dependent and accept the alternative hypothesis that they are indeed dependent. Which group the patient ended up in, did influence their _improvement_ outcome.
82 |
83 | The $chi^2$ test works on the principle of subtracting an expected table, $E$, from our table of observed values, $O$. This is shown in equation (1).
84 |
85 | $$ \chi^2 = \frac{\sum_{i=1}^{n} \left( \text{O}_i - \text{E}_i \right)^2}{\text{E}_i} \tag{1}$$
86 |
87 | This simulated study had a total of $300$ subjects. Some $150$ were in group A and $150$ in group B. Conversely, $156$ improved and $144$ did not. Since the observed table is $ 2 \times 2 $ in size, the expected table will be as well. Each value in the latter is determined by its position. For the value in row $1$, column $1$, we multiply the sum of row $1$ and column $1$ in the observed table and divide it by the sample total. This would be $156 \times 150 \div 300 = 78$. The same goes for the other three values. The _p_ value is calculated from the $\chi^2$ distribution for the given degrees of freedom.
88 |
89 | We need not confine ourselves to $2 \times 2 $ contingency tables. If each nominal categorical variable has more than two elements in their sample spaces, say $n$ and $m$, we would create a $n \times m $ table.
90 |
91 | Note that the $\chi^2$ distribution is a continuous distribution, yet our nominal categorical variable is, in essence, a discrete variable. Yates' correction for continuity is sometimes used to solve this _problem_ of possible over correction. It aims at correcting this _problem_ created by assuming that the discrete probabilities of counts in the table can be approximated by a continuous distribution. It subtracts a half, ($0.5$), from each subtraction of the expected from the observed table. It is shown in equation (2), but is not commonly used.
92 |
93 | $$ \chi^2 = \frac{\sum_{i=1}^{n} \left( | \text{O}_i - \text{E}_i | - 0.5 \right)^2}{\text{E}_i} \tag{2}$$
94 |
95 | One problem that arises from this approximation of a discrete frequency distribution, and that we must consider, is that of a small sample size. As a rule of thumb, we run into this problem with the $\chi^2$ test for independence if $80$% or more of the table entries have less than five samples. In this case, we make use of Fisher's exact test.
96 |
97 | ## Fisher's exact test
98 |
99 | Fisher's test is termed an exact test, as the _p_ value is calculated directly. It does not follow from a sampling distribution. This test also requires the creation of a contingency table, but it is limited to a $ 2 \times 2 $ table. If there are more than two elements in the sample space of either of the categorical variables, then they have to be combined in a way that makes sense fr=or the research question at hand.
100 |
101 | The equation for Fisher's exact test for a total sample size of $N$, is shown in equation (3).
102 |
103 | $$ \text{contingency table} = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \\ p = \frac{\left( a + b \right)! \left( c + d \right)! \left( a + c \right)! \left( \right)!b + d }{a! b! c! d! N!} \tag{3}$$
104 |
105 | The `fisher.test()` function uses equation (3). The values have to be entered as a matrix, which we create below. Note that we list all four values, but then specify that we require two rows. This will neatly split the data after the first two values. Be sure to enter the values in to correct order when copying from your data.
106 |
107 | ```{r Creating a contingency table with small sample counts}
108 | contingency <- matrix(c(3, 4, 6, 2), # The fisher.test function requires a matrix as imput
109 | nrow = 2) # Creating two rows from the four elements
110 | contingency
111 | ```
112 |
113 | ```{r Fisher exact test}
114 | fisher.test(contingency)
115 | ```
116 |
117 | We note a _p_ value of $0.31$.
118 |
119 | ## McNemar test
120 |
121 | This test is used for matched pairs of subjects. Two categorical variables are used and data point values are captured for both in each subject. In practice, these are either _before_ and _after_ variables or homozygotic twins, and so on.
122 |
123 | Both variables require a dichotomous outcome, i.e. the sample space only contains two elements. This results in a $ 2 \times 2 $ contingency table. The equation results in a $\chi^2$ value with a single degree of freedom. This is shown in equation (4).
124 |
125 | $$ \chi^2 = \frac{{\left( b - c \right)}^2}{b + c} \tag{4} $$
126 |
127 | Below, we have an example of $314$ patients receiving a treatment. Before $101$ patients had a certain symptom and $59$ did not. After the treatment $121$ had the symptoms and $34$ did not.
128 |
129 | ```{r Creating a contingency stable for the McNemar test}
130 | contingency <- matrix(c(101, 59, 121, 34),
131 | nrow = 2,
132 | byrow = FALSE)
133 | contingency
134 | ```
135 |
136 | Note how, in our contingency table, the before values run across the rows and that the after values run across the columns with the positive outcome listed first and the negative outcome second.
137 |
138 | The `mcnemar.test()` function calculates the $\chi^2$ and _p_ value for us, based on a single degree of freedom.
139 |
140 | ```{r Calculating a p value for the McNemar test}
141 | mcnemar.test(contingency,
142 | correct = FALSE) # Ignoring continuity correction
143 | ```
144 |
145 | Note that we once again ignore continuity correction.
146 |
147 | ## Conclusion
148 |
149 | The various tests for categorical variables are very commonly used in the literature. Make sure to use the correct test based on the research question and the sample size.
--------------------------------------------------------------------------------
/09_Nonparametric_tests.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Non-parametric tests"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | 
17 |
18 | ## Preamble
19 |
20 | In the previous chapter we learned that we cannot always use the common parametric tests to analyze our data.
21 |
22 | This chapter is all about the proper alternatives. These non-parametric tests are not dependent upon the parameters that we saw in the assumptions for the use of parametric tests.
23 |
24 | ## The Wilcoxon rank-sum test
25 |
26 | The Wilcoxon rank-sum test is used in place of the unpaired sample _t_ test (Student's _t_ test). In this situation, the collected data point values do not meet the assumptions for the use of a parametric test. It is, very confusingly, also known as the _Mann-Whitney U test_, the _Wilcoxon-Mann-Whitney test_ and the _Mann-Whitney-Wilcoxon test_.
27 |
28 | Although the parametric assumptions are not required, this test does require that the data point values are ordinal in some way, i.e. that they can be ranked in order. It makes this test ideal for the use of ordinal categorical data point values expressed as numbers too.
29 |
30 | All the data point values from the two groups are placed in the same set and ranked in ascending order. The lowest value is assigned a rank of $1$ and so on, until the largest value, given the last (and highest) rank.
31 |
32 | The groups are again separated and the ranks in each grouped summed. The null hypothesis is that these sums will be even in both groups. This means that it is equally likely that a randomly selected sample selected from one group will be less than or greater than a randomly selected sample from the other group.
33 |
34 | Measurement capabilities and rounding do effect the Wilcoxon rank-sum test. If measurements are only accurate to a certain decimal value, two data point values will be given the same rank, even though they might not be equal given a more accurate measurement (more decimal values). Rounding of numerical values have a similar effect.
35 |
36 | To investigate this test, we will create some simulated data and save it as a data frame object.
37 |
38 | ### Creating data
39 |
40 | Below we create a data frame object with two statistical variables, `Group` and `CRP`. The former is nominal categorical with a sample space containing two elements. We will use this variable to construct two groups. The latter variable is numerical and consists of $100$ values following a $\chi^2$ distribution with a degrees of freedom value of $2$.
41 |
42 | ```{r Creating data and a data.frame}
43 | set.seed(123) # Seeding the pseudo-random number generator for reproducible results
44 | group <- sample(c("A", "B"),
45 | 100,
46 | replace = TRUE)
47 |
48 | df <- data.frame(Group = group,
49 | CRP = round(rchisq(100, # Non-normal distribution
50 | 2),
51 | digits = 1))
52 | ```
53 |
54 | We can view this first six rows using the `head()` function.
55 |
56 | ```{r Firt six rows}
57 | head(df)
58 |
59 | ```
60 |
61 | ### Conducting the test
62 |
63 | The `wilcox.test()` function will conduct our Mann-Whitney-U tests. We use formula notation and separately define the data frame object using the `data = ` keyword argument. Note that we could also have used the `df$CRP ~ df$Group` syntax, omitting the second argument.
64 |
65 | ```{r Wilcoxon rank sum test}
66 | test1 <- wilcox.test(CRP ~ Group,
67 | data = df)
68 | ```
69 |
70 | Let's look at the results of our test.
71 |
72 | ```{r Results of the Wilcoxon rank-sum tests}
73 | test1
74 | ```
75 |
76 | We note a test statistic and a _p_ value, which we can use in a similar fashion as the Student's _t_ test in our reports.
77 |
78 | As and aside, the `wilcox.test()` function also allows the use of data stored in two separate lists.
79 |
80 | ```{r Alternative Wilcoxon rank-sum test format}
81 | grpA <- rnorm(50,
82 | 0,
83 | 1)
84 | grpB <- rchisq(50,
85 | 1)
86 |
87 | wilcox.test(grpA,
88 | grpB)
89 | ```
90 |
91 | ### Arguments
92 |
93 | It is worthwhile to take a closer look at the arguments of this function.
94 |
95 | * `paired =`
96 | * This can be set as `FALSE` (default), if the groups are independent, or `TRUE` if they are dependent
97 | * `alternative = c()`
98 | * Any one of `two.sided`, `less`,or `greater` depending on the alternative hypothesis
99 | * `mu = `
100 | * Sets the expected difference in medians
101 | * `exact =`
102 | * Performs an exact test
103 | * Default is TRUE if total sample size less than $50$ and there are no ties
104 | * Force the use of normal distribution approximation by setting to FALSE
105 | * `correct =`
106 | * Either `TRUE` or `FALSE` (default is FALSE), with in the case of the former, applies continuity correction to the calculation of the _p_ value
107 | * `conf.int =`
108 | * If set to TRUE, calculates the 95% confidence intervals of the median of all the differences between data point values for the two groups
109 | * Also give the difference in the current case
110 | * Can be approximated using bootstrapping
111 | * `conf.level =`
112 | * Sets the confidence level
113 | * `na.action = `
114 | * Determines what to do with null values, most commonly being `na.exclude`
115 |
116 |
117 | ### Effect size of the Wilcoxon sum-rank test
118 |
119 | Effect sizes are used more commonly today. For the Mann-Whitney-U test, this can be calculated approximately, reverting the _p_ value to a _z_ score using the `qnorm()` command. The _z_ score is then expressed as an _r_ value using equation (5).
120 |
121 | $$ r = \frac{z}{\sqrt{n}} \tag{5} $$
122 |
123 | ### Example
124 |
125 | Let's look at an example of simulated data to see the process of choosing a non-parametric tests in action.
126 |
127 | Given two groups, `Treatment` and `Control`, note the two sets of data point values for the variable `LDL`
128 |
129 | ```{r Data for two groups}
130 | treatment <- c(1.8, 1.7, 6.9, 0.8, 2.1, 7.6)
131 | contrl <- c(4.1, 3.2, 6.8, 1.9, 3.0, 4.7)
132 | ```
133 |
134 | A Shapiro-Wilk test shows that the first group is not from a normal distribution.
135 |
136 | ```{r Shapiro-Wilk test}
137 | shapiro.test(treatment)
138 | ```
139 |
140 | To make life easier, we create a data frame object from our two lists.
141 |
142 | Creating a `data.frame`
143 |
144 | ```{r Creating a data.frame}
145 | df <- data.frame(LDL = c(treatment, contrl),
146 | Group = rep(c("Treatment", "Control"),
147 | each = 6))
148 | head(df)
149 | ```
150 |
151 | Displaying a simple box plot gives us a great visual indicator of the data.
152 |
153 | ```{r Box plot}
154 | boxplot(LDL ~ Group,
155 | df,
156 | main = "LDL for Control and Treatment groups",
157 | xlab = "Groups",
158 | ylab = "LDL")
159 | ```
160 |
161 | The skewness in the data (especially for the treatment group) is clearly visible.
162 |
163 | Now for conducting the Wilcoxon rank-sum test.
164 |
165 | ```{r Wilcoxon rank-sum test}
166 | wilcox.test(LDL ~ Group,
167 | data = df)
168 | ```
169 |
170 | The $W$ statistic is the number of times that a data point value from the Treatment group is lower than that of the Control group.
171 |
172 | Sorting the data point values makes this easier to understand.
173 |
174 | ```{r Sorting}
175 | treatment <- sort(treatment)
176 | contrl <- sort(contrl)
177 | treatment
178 | contrl
179 | ```
180 |
181 |
182 | ## Wilcoxon signed-rank test
183 |
184 | This is the non-parametric equivalent to the _paired-sample t test_.
185 |
186 | The same `wilcox.test()` function is used for this test, except that the keyword argument `paired =` is set to `TRUE`.
187 |
188 | ## The Kruskal-Wallis test
189 |
190 | The _Kruskal-Wallis test_ is the non-parametric equivalent of analysis of variance. Here we are comparing more than two groups.
191 |
192 | Below we create data for a placebo group and for two treatment groups, let's say a low-dose and a high-dose group.
193 |
194 | ```{r Data for three groups}
195 | treatment_low <- c(1.8, 1.7, 6.9, 0.8, 2.1, 7.6)
196 | treatment_hi <- c(3, 4.2, 7, 4.5, 7.8, 3.3, 7.8)
197 | contrl <- c(4.1, 3.2, 6.8, 1.9, 3.0, 4.7)
198 | ```
199 |
200 | The `kruskal.test()` function takes a `list()` function as argument. The three vectors of data point values are specified.
201 |
202 | ```{r Conduction the Kruskal Wallis test}
203 | kruskal.test(list(treatment_low,
204 | treatment_hi,
205 | contrl))
206 | ```
207 |
208 | We note a Kruskal-Wallis $\chi^2$ statistic, degrees of freedom, and a _p_ value.
209 |
210 | ## Non-parametric alternatives to correlation
211 |
212 | The assumptions for the use of parametric tests applies to correlation. Both variables must meet these assumptions.
213 |
214 | The first alternative ranks the values before calculating the correlation and is termed *Spearman correlation* and the resultant correlation coefficient is termed *Spearman's* $\rho$. This can be passed as an argument to the `cor.test()` command.
215 |
216 | First, we create some paired data point values for our correlation.
217 |
218 | ```{r Creating variables for nonparametric correlation}
219 | height <- c(160, 167, 170, 168, 172, 159, 163, 155, 158, 180, 185, 181, 195)
220 | weight <- height + (round((rnorm(length(height), 0, 1)), 3) * 10)
221 | ```
222 |
223 | A scatter plot shows the pairs of values.
224 |
225 | ```{r Scatter plot}
226 | plot(height,
227 | weight,
228 | main = "Correlation between height and weight",
229 | xlab = "Height",
230 | ylab = "Weight",
231 | las = 1)
232 | ```
233 |
234 | Now for the Spearman correlation test, setting the `method = ` argument for the `cor.test()` function to `"Spearman"`.
235 |
236 | ```{r Spearman correlation test}
237 | cor.test(height,
238 | weight,
239 | method = "spearman")
240 | ```
241 |
242 | The correlation coefficient in this case is $\rho$. One problem with Spearman's $\rho$ is that it deals poorly with small sample sizes and a large number of tied ranks. *Kendall's* $\tau$ is an alternative non-parametric test for these cases.
243 |
244 | ```{r}
245 | cor.test(height,
246 | weight,
247 | method = "kendall")
248 | ```
249 |
250 | ## Conclusion
251 |
252 | These are the non-parametric test alternatives for the commonly used parametric tests. Always check that the assumptions for the use of parametric tests are met. If not, do use these non-parametric alternatives.
--------------------------------------------------------------------------------
/09_Nonparametric_tests_full.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Nonparametric tests"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | 
17 |
18 | ## Preamble
19 |
20 | In the previosu chapter we learned that we cannot alsways use the common paranetric tests to anlayze our data.
21 |
22 | This chapter is all about the proper alternatives. These non-parametric tests are not dependent upon the paraneters that we saw in the assumptions for the use of parameric tests.
23 |
24 | ## The Wilcoxon rank-sum test
25 |
26 | The Wilcoxon rank-sum test is used in place of the unpaired sample _t_ test (Student's _t_ test). In this siutation, the collected data point values do not meet the assumptions for the use of a parametric test. It is, very confusingly, also known as the _Mann-Whitney U test_, the _Wilcoxon-Mann-Whitney test_ and the _Mann-Whitney-Wilcoxn test_.
27 |
28 | Although the parametric assumptions are not required, this test does require that the data point values are ordinal in some way, i.e. that they can be ranked in order.
29 |
30 | All the data point values from the two groups are placed in the same set and ranked in ascending order. The lowest value is assigned a rank of $1$ and so on, until the largest value, given the last (and highest) rank.
31 |
32 | The groups are again separated and the ranks in each grouped summed. The null hypothesis is that these sums will be even in both groups. This means that it is equally likely that a randomly selected sample selected from one group will be less than or greater than a randomly selected sample from the other group.
33 |
34 | When there are _ties_, the ranks associated with these are avergaed. For example, if there are three values that are similar and if they were to start at rank $14$, they are initially each given the ranks $14$, $15$, and $16$. The ranks are summed, $14+15+16$ which is `r 14+15+16`. This is divided by $3$ (the number of ties) and then each is given this rank of $15$. The next value up the scale is given a rank of $17$, since the three ties took up $3$ places.
35 |
36 | The ranks for each group is totalled as $R_1$ and $R_2$.
37 |
38 | So as not to place any group in an advantaged postion, due to the fact that one of the groups may have a much larger sample size, the respective _mean rank_ is subtracted from each total. The equation for the mean rank in cases where either one or both of the groups have a sample size of less than $8$, is given in equation (1). This results in a _U_ statistic. It is the smaller of the two statistics which is then used to calculate a _p_ value.
39 |
40 | $$ {\overline{r}}_{i}=\frac{n_i \left( n_i + 1 \right)}{2} \\
41 | U_i = R_i - {\overline{r}}_{i} \tag{1}$$
42 |
43 | In cases where __both__ groups have at least $8$ samples, equation (2) is used. Here $n_1$ refers to the smaller of the two samples.
44 |
45 | $$ {\mu}_{R} = \frac{n_1 \left( n_1 + n_2 + 1 \right)}{2} \\
46 | {\sigma}_{R} = \sqrt{\frac{n_1 n_2 \left( n_1 + n_2 + 1 \right)}{12}} \\
47 | z = \frac{R - \mu_R}{\sigma_R} \tag{2} $$
48 |
49 | Note that in some textbooks, $\mu_U$ is given. The equations are algebraically similar and is expressed in equation (3).
50 |
51 | $$ \mu_U = \frac{n_1 n_2}{2} \\
52 | \sigma_U = \sqrt{\frac{n_1 n_2 \left( n_1 + n_2 + 1 \right)}{12}} \\
53 | z = \frac{U - \mu_U}{\sigma_U} \tag{3} $$
54 |
55 | In the case of equation (1) the _p_ value can be calculated using the _Monte Carlo method_, where a similar study group sample is created over and over again. This creates a sample distribution against which the original $U$ is tested. Alternatively, _U_ statistic tables are available.
56 |
57 | It is also possible to assume a normal distribution for the test statistic (as in equation (2) and equation (3)) and calculate from there.
58 |
59 | Also note that in the case of tied ranks, $\sigma$ must be calculated as in equation (4).
60 |
61 | $$ \sigma_U = \sqrt{ \frac{n_1 n_2}{12} \left[ \left( n + 1\right) - \sum_{i = 1}^{k} \frac{t_i^3 - t_i}{n \left( n - 1 \right)} \right]} \tag{4} $$
62 |
63 | In this equation $n = n_1 + n_2$, $t_i$ is the number of subjects sharing rank $i$, and $k$ is the number of distinct ranks.
64 |
65 | Measurement capabilities and rounding do effect the Wilcoxon rank-sum test. If measurements are only accurate to a certain decimal value, two data point values will be given the same rank, even though they might not be equal given a more accurate measurement (more decimal values). Rounding of numerical values have a similar effect.
66 |
67 | To investigate this test, we will create some simulated data and save it as a data frame object.
68 |
69 | ### Creating data
70 |
71 | Below we create a data frame object with two statistical variables, `Group` and `CRP`. The former is nominal categorical with a sample space containing two elements. We will use this variable to construct two groups. The latter variable is numerical and consists of $100$ values following a $\chi^2$ distribution with a degress of freedom value of $2$.
72 |
73 | ```{r Creating data and a data.frame}
74 | set.seed(123) # Seeding the pseudo-random number generator for reproducible results
75 | group <- sample(c("A", "B"),
76 | 100,
77 | replace = TRUE)
78 |
79 | df <- data.frame(Group = group,
80 | CRP = round(rchisq(100,
81 | 2),
82 | digits = 1))
83 | ```
84 |
85 | We can view this first six rows using the `head()` function.
86 |
87 | ```{r Firt six rows}
88 | head(df)
89 |
90 | ```
91 |
92 | ### Conducting the test
93 |
94 | The `wilcox.test()` function will conduct our Mann-Whitney-U tests. We use formula notation and seperately defined the data frame object using the `data = ` keyword argument. Note that we could also have used the `df$CRP ~ df$Group` syntax omitting teh second argument.
95 |
96 | ```{r Wilcoxon rank sum test}
97 | test1 <- wilcox.test(CRP ~ Group,
98 | data = df)
99 | ```
100 |
101 | Let's look at the results of our test.
102 |
103 | ```{r Results of the Wilcoxon rank-sum tests}
104 | test1
105 | ```
106 |
107 | We note a test statistic and a _p_ value, which we can use similar to Student's _t_ test in our reports.
108 |
109 | As and aside, the `wilcox.test()` function also allows the use of data stored in two separate lists.
110 |
111 | ```{r Alternative Wilcoxon rank-sum test format}
112 | grpA <- rnorm(50,
113 | 0,
114 | 1)
115 | grpB <- rchisq(50,
116 | 1)
117 |
118 | wilcox.test(grpA,
119 | grpB)
120 | ```
121 |
122 | ### Arguments
123 |
124 | It is worthwhile to take a closer look at the arguments of this function.
125 |
126 | * `paired =`
127 | * This can be set as `FALSE` (default), if the groups are independent or `TRUE` if they are dependent
128 | * `alternative = c()`
129 | * Any one of `two.sided`, `less`,or `greater` depending on the alternative hypothesis
130 | * `mu = `
131 | * Sets the expected difference in medians
132 | * `exact =`
133 | * Performs an exact test
134 | * Default is TRUE if total sample size less than $50$ and there are no ties
135 | * Force the use of normal distribution approximation by setting to FALSE
136 | * `correct =`
137 | * Either `TRUE` or `FALSE` (default is FALSE), with in the case of the former, applies continuity correction to the calculation of the _p_ value
138 | * `conf.int =`
139 | * If set to TRUE, calculates the 95% confidence intervals of the median of all the differences between data point values for the two groups
140 | * Also give the difference in the current case
141 | * Can be approximated using bootstrapping
142 | * `conf.level =`
143 | * Sets the confidence level
144 | * `na.action = `
145 | * Determines what to do with null values, most commonly being `na.exclude`
146 |
147 |
148 | ### Effect size of the Wilcoxon sum-rank test
149 |
150 | Effect size is used more commonly today. For the Mann-Whitney-U test, this can be calculated approximately, reverting the _p_ value to a _z_ score using the `qnorm()` command. The _z_ score is then expressed as an _r_ value using equation (5).
151 |
152 | $$ r = \frac{z}{\sqrt{n}} \tag{5} $$
153 |
154 | ### Example
155 |
156 | Let's look at an example of simulated data to see the process of choosing a non-paranetric tests in action.
157 |
158 | Given two groups, `Treatment` and `Control`, note the two sets of data point values for the variable `LDL`
159 |
160 | ```{r Data for two groups}
161 | treatment <- c(1.8, 1.7, 6.9, 0.8, 2.1, 7.6)
162 | contrl <- c(4.1, 3.2, 6.8, 1.9, 3.0, 4.7)
163 | ```
164 |
165 | A Shapiro-Wilk test shows that the first group is not from a normal distribution.
166 |
167 | ```{r Shapiro-Wilk test}
168 | shapiro.test(treatment)
169 | ```
170 |
171 | To make life easier, we create a data frame object from our two lists.
172 |
173 | Creating a `data.frame`
174 |
175 | ```{r Creating a data.frame}
176 | df <- data.frame(LDL = c(treatment, contrl),
177 | Group = rep(c("Treatment", "Control"),
178 | each = 6))
179 | head(df)
180 | ```
181 |
182 | Displaying a simple box plot gives us a great visual indicator of the data.
183 |
184 | ```{r Box plot}
185 | boxplot(LDL ~ Group,
186 | df,
187 | main = "LDL for Control and Treatment groups",
188 | xlab = "Groups",
189 | ylab = "LDL")
190 | ```
191 |
192 | The skewness in the data is clearly visible.
193 |
194 | No for conducting the Wilcoxon rank-sum test.
195 |
196 | ```{r Wilcoxon rank-sum test}
197 | wilcox.test(LDL ~ Group,
198 | data = df)
199 | ```
200 |
201 | The $W$ statistic is the number of times that a data point value from the Treatment group is lower than that of the Control group.
202 |
203 | Sorting the data point values makes this easier to understand.
204 |
205 | ```{r Sorting}
206 | treatment <- sort(treatment)
207 | contrl <- sort(contrl)
208 | treatment
209 | contrl
210 | ```
211 |
212 | The `outer()` command check each value in the first list (in turn) against every value in the second list and returns a Boolean value based on the expression. Below it checks for _less than_.
213 |
214 | ```{r The outer() command to show the cases where teratment is less than contrl}
215 | outer(treatment, contrl, "<")
216 | ```
217 |
218 | The `sum()` command will sum all the `TRUE` values.
219 |
220 |
221 | ```{r Sum the TRUE values}
222 | sum(outer(treatment, contrl, "<"))
223 | ```
224 |
225 | This is indeed the $W$ statistic given above. Summing the ranks for the `Treatment` group, though, note the solution of $34$.
226 |
227 | ```{r Rank of treatment group}
228 | sum(rank(df$LDL)[df$Group == "Treatment"])
229 | ```
230 |
231 | To get the $W$ statistic used in the `wilcoxon.test()` command the rank of the other group is used, from which is subtracted equation (1) above.
232 |
233 | ```{r}
234 | sum(rank(df$LDL)[df$Group == "Control"]) - ((6 * (6 + 1)) / 2)
235 | ```
236 |
237 | Using the `conf.int =` argument gives the difference in location.
238 |
239 | ```{r Difference in location}
240 | wilcox.test(LDL ~ Group,
241 | df,
242 | conf.int = TRUE)
243 | ```
244 |
245 | The difference in location is the median of the differences between all $36$ pairs.
246 |
247 | ```{r Difference between pairs}
248 | median(outer(treatment, contrl, "-"))
249 | ```
250 |
251 |
252 | The difference in location can be calculated using bootstrapping.
253 |
254 | ```{r Creating a bootstrapping function}
255 | library(boot)
256 | med.dif <- function(d, i) {
257 | tmp <- d[i,]
258 | median(tmp$LDL[tmp$Group=="A"]) - median(tmp$LDL[tmp$Group=="B"])
259 | }
260 | ```
261 |
262 | ```{r}
263 | boot.out <- boot(data = df,
264 | statistic = med.dif,
265 | R = 2000)
266 | median(boot.out$t)
267 | ```
268 |
269 | ## Wilcoxon signed-rank test
270 |
271 | This is the nonparametric equivalent to the _paired-sample t test_.
272 |
273 | To be added...
274 | ## Nonparametric alternatives to correlation
275 |
276 | The blog post on the assumptions for the use of parametric tests applies to correlation. Both variables must meet this assumption. The expectation being that one of them is categorical in nature and only if the sample space for that variable is binary.
277 |
278 | The first alternative ranks the values before calculating the correlation and is termed *Spearman correlation* and the resultant correlation coefficient is termed *Spearman's* $\rho$. This can be passed as an argument to the `cor.test()` command.
279 |
280 | ```{r}
281 | cor.test(x, y, method = "spearman")
282 | ```
283 |
284 | One problem with Spearman's $\rho$ is that it deals poorly with small sample sizes and a large number of tied ranks. *Kendall's* $\tau$ is an alternative non parametric test for these cases.
285 |
286 | ```{r}
287 | cor.test(x, y, method = "kendall")
288 | ```
--------------------------------------------------------------------------------
/07_Correlation_and_linear_models.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Correlation and linear regression models"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = TRUE)
18 | setwd(getwd())
19 | ```
20 |
21 | 
22 |
23 | ## Preamble
24 |
25 | In the previous chapter, we compared the same numerical variable between two or more groups, the groups being created from the sample space elements of a categorical variable.
26 |
27 | Consider now two different numerical variables, measured for each subject in an experiment. This chapter looks at the _correlation_ between these pairs of data point values and also at _linear models_ that predict one of the variables given the other (or more variables).
28 |
29 | Correlation measures the relationship between the change in the values of one of the variables with respect to the other: "How one *varies* as the other *varies*". This variation defines the term *covariance*.
30 |
31 | Univariable linear regression models take one of the numerical variables as a _dependent variable_. The value of each data point value in this variable is then _predicted_ in a model, based on the data point values for another numerical variable, the _independent variable_. Models with only single independent (predictor) variable are termed _univariate models_ and models with more than one independent variable are termed _multivariate models_.
32 |
33 | ## Covariance
34 |
35 | Whereas variance looked as the difference between each data point value for a variable and the mean for the variable, covariance considers two numerical variables in correlation. For two variables $x$ and $y$, the covariance can be calculated as shown in equation (1).
36 |
37 | $$cov \left(x,y \right) =\frac{\sum_i^n{\left( x_i - \bar{x} \right) \left( y_i - \bar{y} \right)}}{n} \tag{1}$$
38 |
39 | This equation simply goes through every pair (meaning that we need data point values for each variable in every pair, i.e. no missing data). It subtracts the mean of the relevant variable from each data point value in turn and multiplies the two differences. It then adds all of these multiplications and divides by the total sample size.
40 |
41 | In the code chunk below, we import a spreadsheet file using the built-in `read.csv()` function.
42 |
43 | ```{r Importing the dataset}
44 | df <- read.csv("ProjectIIData.csv",
45 | header = TRUE)
46 | ```
47 |
48 | Now we plot the `CRP` variable against the `Cholesterol` variable. This means that each dot on the plot has as $x$ value, which is the `CRP` value and a $y$ value, which is the `Cholesterol` value.
49 |
50 | ```{r Scatter plot of the data}
51 | plot(x = df$CRP,
52 | y = df$Cholesterol,
53 | main = "Correlation between CRP and cholesterol",
54 | xlab = "CRP", # Variable listed first
55 | ylab = "Cholesterol", # Variable listed second
56 | las = 1)
57 | ```
58 |
59 | The data point values for the two variables were created such that there is some correlation between them. As the values of `CRP` increases, so do the values of `Cholesterol`. Using the equation above, the covariance can be calculated, using the `cov()` command.
60 |
61 | ```{r The covariance between CRP and cholesterol}
62 | cov(df$CRP,
63 | df$Cholesterol)
64 | ```
65 |
66 | The covariance depends on the units in which the measurements took place, though. It remains for us to standardize the covariance and to express a magnitude of this correlation.
67 |
68 | ## Correlation coefficient
69 |
70 | The task remaining is accomplished by calculating the *correlation coefficient*. It standardizes the covariance by dividing it by the product of the standard deviations of each of the variables and is shown in equation (2). Note that the equation cancels the units involved and the result has a range of -1 to +1.
71 |
72 | $$r = \frac{cov \left( x, y \right)}{s_x s_y} \tag{2}$$
73 |
74 | This value *r* is also termed the *Pearson product-moment correlation coefficient* and the *Pearson correlation coefficient*. The code snippet below calculates the correlation coefficient.
75 |
76 | ```{r Calculating teh correlation coefficient}
77 | # Using parenthesis to control the order of arithmetical operations
78 | r = cov(df$CRP, df$Cholesterol) / (sd(df$CRP) * sd(df$Cholesterol))
79 | r
80 | ```
81 |
82 | A value of $+1$ expresses absolute positive correlation and a value of $-1$ expresses absolute negative correlation. In more human terms, it explains the *effect size*. A value of below and above $-0.1$ and $+0.1$ represents a small effect size. Values over $-0.3$ and $+0.3$ represent a medium effect size, and a values beyond $-0.5$ and $+0.5$ represent a large effect size. This must always be seen in the context of the research question, though.
83 |
84 | ### Significance tests
85 |
86 | While the effect size is arguably of more importance, it remains popular to test for significance. The null hypothesis being that there is no correlation ($r=0$).
87 |
88 | The correlation coefficient can be used to calculate a *z* score. Alas, the sampling distribution of *r* is not normal and *r* has to be to be transformed. In the R programming language, the *t* statistic transformation is used and is shown in equation (3).
89 |
90 | $$t_r = \frac{r \sqrt{n-2}}{1 - r^2} \tag{3}$$
91 |
92 | The `cor.test()` function calculates the *t* statistic, degrees of freedom, *p* value, 95% confidence intervals around the correlation coefficient and the actual correlation coefficient.
93 |
94 | ```{r Calculating the Pearson product moment correlation and p value}
95 | cor.test(df$CRP,
96 | df$Cholesterol)
97 | ```
98 |
99 | We note a _t_ statistic, the degrees of freedom, and a _p_ value. There is also a short summary explaining that we accept the alternative hypothesis, i.e. the correlation is not equal to zero. Then we have the $95$% confidence intervals around the correlation coefficient and the actual correlation coefficient. Everything we need to put in a report or manuscript.
100 |
101 | ## Univariate linear regression
102 |
103 | In univariate linear regression, we aim to build a model that will predict the value of a dependent numerical variable, given a single independent numerical variable. In our contrived example, we might want to predict the `Cholesterol` data point value of a patient, given the `CRP data point value.
104 |
105 | Univariate linear models are straight lines. Consider the pairs of values created below. The dependent, `dep` values are simply twice the independent, `indep`, values.
106 |
107 | ```{r Creating data}
108 | indep <- c(1:10)
109 | dep <- 2 * indep # Using broadcasting to multiply each value in indep by 2
110 | ```
111 |
112 | Let's plot this as a scatter plot.
113 |
114 | ```{r Scatter plot of created data}
115 | plot(indep,
116 | dep,
117 | main = "Perfectly correlated data",
118 | xlab = "Independent variable",
119 | ylab = "Dependent variable",
120 | las = 1)
121 | ```
122 |
123 | When a patient has an independent variable data point value of $4$, we see that the data point value for their dependent variable is $8$. A straight line can be drawn through all the data point values. You might remember from school that the equation for a straight line is $y = mx + c$, where $y$ is the dependent variable value, $m$ is the slope (rise-over-run), $x$ is the independent variable value, and $b$ is the $y$ intercept (that is where $x-0$.)
124 |
125 | The equation for our perfect model would be $y = 2x$. Plugging in any value for $x$, will give us a predicted value for $y$. This is what a linear model does. It allows us to take _unseen_ data, i.e. for new patients, plug their (independent) data point value into the model and get a predicted (dependent) variable value. A common reason for building models like this is when the dependent variable is expensive or difficult to measure.
126 |
127 | Below, we add the model's _regression line_, using the `abline()` function. The `lm()` function used as argument here, will be the focus of this section.
128 |
129 | ```{r A regression line for our perfect model}
130 | plot(indep,
131 | dep,
132 | main = "Perfectly correlated data",
133 | xlab = "Independent variable",
134 | ylab = "Dependent variable",
135 | las = 1)
136 | abline(lm(dep ~ indep))
137 | ```
138 |
139 | As mentioned, we can now plug in any independent variable value and the model will calculate the value for the dependent variable. For instance, a patient with an independent variable value of $3.5$ will be predicted to have a dependent variable value of $2 \times 3.5 = 7$.
140 |
141 | So, let's look at the `lm()` function. It stands for _linear model_. A linear model is a fantastic beast. It calculates a line that minimizes the error between predicted values and true values. Just to be clear, when given new data such as the $3.5$ above, our model predicts a dependent variable value of $7$. It might be that we have just such a patient in the dataset that we kept a secret and that their dependent variable value was actually $8$. The difference between the predicted and actual values is the _prediction error_, or more commonly, the _residual_.
142 |
143 | For fun, we use the `lm()` function on our perfect data. The tilde, `~`, symbol builds the formula for us, such that it reads: _"Predict the dependent variable value given the independent variable."_.
144 |
145 | ```{r Linear model of our perfect data}
146 | perfect_model <- lm(dep ~ indep)
147 | summary(perfect_model) # Printing a summary of the model
148 | ```
149 |
150 | The `summary()` function of our model shows the residuals. Since there were no differences between the predicted and actual dependent variable values for all $10$ samples, we see the summary statistics of $10$ zero values. Note the rounding constraints of the calculation giving extremely small values, which are in fact equivalent to zero.
151 |
152 | Under the `Coefficients` section we see an `Estimate`. The `(Intercept)` value is also zero (bar the rounding) and is the $y$ intercept for the straight line (when $x-0$), i.e. given an independent variable value of $0$, we expect a dependent variable value of $0$. The `Estimate` for the `indep` variable is $2$, the slope of our line. For every single unit rise in the value of our independent variable, the value of the dependent variable will rise by $2$.
153 |
154 | In real life, we never see perfect models. We even get a warning at the start of the summary pointing to the fact that our data might be a bit of wishful thinking. Let's return to our CRP and cholesterol values model. We use the `lm()` function to calculate a model.
155 |
156 | ```{r A linear model for our dataset}
157 | uni_lin_model <- lm(Cholesterol ~ CRP, # Using formula notation
158 | data = df) # Specifying the data fram
159 | summary(uni_lin_model) # Printing a summary of the model
160 | ```
161 |
162 | Above, we did not use `$` notation, but instead specified the data frame object as second argument after the formula.
163 |
164 | Inspecting the model results, we see the summary statistics of all the residuals (the differences between the model's predicted independent variable values and the actual values). We also see the intercept and slope for the regression line. Below, we plot our data point values and the regression line.
165 |
166 | ```{r Plotting the data and regression line}
167 | plot(x = df$CRP,
168 | y = df$Cholesterol,
169 | main = "Correlation between CRP and cholesterol",
170 | xlab = "CRP", # Variable listed first
171 | ylab = "Cholesterol", # Variable listed second
172 | las = 1)
173 | abline(uni_lin_model)
174 | ```
175 |
176 | For any `CRP` value our model predict a `Cholesterol` value. For most you can see that it is way off the actual value. This error is minimized by this line, though. Any other line would have bigger errors.
177 |
178 | We can also see an adjusted $R^2$ value. This is an adjustment to the the square of the correlation coefficient we learned about above and is termed the _coefficient of determination_. In essence, it states that our model explains $22$% of the variance in `Cholesterol`. As an effect size, this is not very good, but we note a _p_ value of less than $0.05$, declaring our model statistically significant. These values have to be interpreted in context!
179 |
180 | ## Multivariate logistic regression
181 |
182 | As mentioned, we could have more than one independent variable. For two the results would be a plane rather than a line, and for more than two variable, a hyper-plane. This is unimportant for our discussion. The `lm()` function makes the calculation very easy.
183 |
184 | Below, we use both `CRP` and `Triglyceride` to predict `Cholesterol` (again, not very realistic, but it suffices for explanation).
185 |
186 | ```{r Multivariate model}
187 | mult_lin_model <- lm(Cholesterol ~ CRP + Triglyceride,
188 | data = df)
189 | summary(mult_lin_model)
190 | ```
191 |
192 | We see similar summary results, except that we now have two independent variables, each with their own estimate and _p_ value. Overall the adjusted $R^2$ value is slightly higher and this model is better than our original.
193 |
194 | ## Conclusion
195 |
196 | Correlation and linear regression models are fantastic tools for numerical data analysis and very easy to use in R.
--------------------------------------------------------------------------------
/03_Summary_statistics.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "03 Summary statistics"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | 
17 |
18 | ```{r setup, include=FALSE}
19 | knitr::opts_chunk$set(echo = TRUE)
20 | #setwd(getwd())
21 | ```
22 |
23 | ```{r Libraries, message=FALSE, warning=FALSE}
24 | library(readr)
25 | library(dplyr)
26 | ```
27 |
28 | ```{r Settin gthe working directory}
29 | setwd(getwd())
30 | ```
31 |
32 |
33 | ## Preamble
34 |
35 | By now you have a firm grasp of the use of R and can import and manipulate data. Our job now, is to start to understand our data. Data carries hidden knowledge within its many rows and columns. As human beings, we are incapable of seeing that information by just staring at the value in a spreadsheet. Instead, we summaries the data, calculating single values that are representative of the data. In the case of sample data, these summary values are called _statistics_.
36 |
37 | _Summary statistics_, also known as _descriptive statistics_, is the first step in the analysis of data. It starts the process of unraveling the information in the data. It also provides us with a good insight into the data and is a guide as to what we can expect from the statistical tests that we will ultimately conduct.
38 |
39 | In this chapter, we will look at _measures of central tendency_ and _measures of dispersion_. The former calculates a single value that stands as representative of all the data point values for a variable. These include the well known _mean_, _median_, and _mode_, with which most of us are familiar. The latter includes _variance_, _standard deviation_, _range_, and _quantiles_. They give us an idea of the spread of the data point values.
40 |
41 | All of these statistics are very simple to calculate in R. To do so, we will continue on our journey of discovering the verbs (functions) of the `dplyr` library. They make analysis of an imported spreadsheet file a breeze.
42 |
43 | ## Measures of central tendency
44 |
45 | ### Mean
46 |
47 | The _mean_ or the _average_ is the sum total of all the data point values of a numerical variable divided by the total number of data point values. For $n$ data point value, each of which designated as $x_i$,we note the equation for the mean in (1) below.
48 |
49 | $$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\tag{1}$$
50 |
51 | These equation are not important, but a small number are included throughout this textbook, only for those interested in them.
52 |
53 | The values $10,11$, and $12$ sum to $33$ and since there are three values we have a mean of $11. The `mean()` function is used to calculate the mean.
54 |
55 | ```{r Calculating the mean}
56 | mean(c(10, 11, 12)) # Place the values in the c() function
57 | ```
58 |
59 | We are interested in using spreadsheet data, though, so let's import the dataset that we used before. Make sure that you have imported the `readr` and `dplyr` libraries and that you have set the working directory after saving this R markdown file and the spreadsheet file in the same directory (folder).
60 |
61 | ```{r Importing the spreadsheet file, message=FALSE, warning=FALSE}
62 | df <- read_csv("ProjectData.csv")
63 | ```
64 |
65 | Let's have a look at the statistical variables (column names in row $1$).
66 |
67 | ```{r Statistical variables}
68 | names(df)
69 | ```
70 |
71 | We can calculate the mean of the ages of all the patients.
72 |
73 | ```{r Calculating the mean of all the patients}
74 | mean(df$Age)
75 | ```
76 |
77 | If we want to run this value to the nearest integer (whole number), we can use the `round()` function. The first argument will be our `mean()` function and the second will be the number of decimal places that we want. In this case, zero.
78 |
79 | ```{r Rounding the mean of the ages}
80 | round(mean(df$Age),
81 | digits = 0)
82 | ```
83 |
84 | While this is interesting information, we probably want to divide the patients up into groups based on the sample space of one of the other variables. This is easily achieved when the variable is categorical. Even a numerical variable can be used to divide the patients into groups, though, as we saw in the previous chapter. This is achieved by selecting cut-off values. As an example, we could arbitrarily decide that all patients younger than $60$ are _young_ patients and all those that are $60+$ are _old_ patients. For this example, we will choose the `Group` variable. From the previous tutorial, we remember that the sample space included two values: `I` and `II`.
85 |
86 | While we could create two computer variables to hold vector objects, we can be more creative and use the `group_by` `dplyr` verb. We can combine it with the `summarise()` verb. An example might be the easiest way of explaining their use.
87 |
88 | ```{r Calculting the mean age of each of the groups}
89 | df %>%
90 | group_by(Group) %>%
91 | summarise(average_age = mean(Age))
92 | ```
93 |
94 | The code chunk above reads almost like an English sentence. We start with the tibble, `df`. We then pipe it to the `group_by` verb, which does what is says! We group the data by the chosen variable (in this case the `Group` variable). We then pipe this grouping to the `summarise()` verb. The argument in the `sumamrise()` verb is new. First, we create a name that carries some meaning. Here, we used the name `average_age`. If we gave this code to someone else, or if we read it months later, we will know what it was meant to do. On the right-hand side of the equal sign comes the summary statistic that we want, the mean of the `Age` variable. From the result you will see that the mean age of each of the two groups were calculated.
95 |
96 | In the code chunk below, we do the same for the `Survey` column.
97 |
98 | ```{r Mean of each of the survey answers}
99 | df %>%
100 | group_by(Survey) %>%
101 | summarise(average_age = mean(Age))
102 | ```
103 |
104 | We note that the mean age of the patients choosing any one of the five ordinal categorical values for the simulated survey question was nearly the same. Perhaps those that chose $5$ was slightly younger.
105 |
106 | ### Median
107 |
108 | To calculate the median, we place all the numerical data point values for a variable in ascending or descending order. We then simply divide them into two equal sets. If there are an odd number of values, we take the middle value to be the median. If there is an equal number of values, we take the average of the two middle values.
109 |
110 | As an example, the median of $11,12,16,19$ and $125$ is $16$ since there are two values less than and two values greater than $16$. If the values were $10,12,12,14,15,17,19$ and $125$, we note that $14$ and $15$ are the middle two values, with three values on either side of them. This makes the median $14.5$. Calculate the mean and the median of these two sets of values and see the big difference between the two statistics for each case.
111 |
112 | Let's calculate both the mean and the median ages of the two groups created from the `Group` column. We will once again create an appropriate name for our median calculation.
113 |
114 | ```{r Mean and median of the ages of the two group}
115 | df %>%
116 | group_by(Group) %>%
117 | summarise(mean_age = mean(Age),
118 | median_age = median(Age))
119 | ```
120 |
121 | The median is great for values that are not symmetric around the mean. For instance, the mean of $12,14,16,15,12,13,14,13$ and $100$, would be a misrepresentation of the values. The $100$ will bring the mean up to a value that is no longer a fair representative of all of the data point values.
122 |
123 | ### Mode
124 |
125 | The mode is very useful for categorical variables. It simply returns the data point value that occurs most frequently. If there are two values that are equal as the most frequent, both are listed and the variable is said to be _bi-modal_, we also get _tri-modal_, and if even more, _multi-modal_.
126 |
127 | To count the number of unique data point values we use the `table()` function.
128 |
129 | ```{r Counting the unique values in a categorical variable}
130 | table(df$Survey)
131 | ```
132 |
133 | From the result we see that the model was `1`, i.e. the value that was selected most in the survey question.
134 |
135 | Another way we can go about it is the now familiar `group_by()` and `summarise()` function. We use the `n()` function, without an argument, to do the counting.
136 |
137 | ```{r}
138 | df %>% group_by(Survey) %>% summarise(survey_count = n())
139 | ```
140 |
141 | ## Measures of dispersion
142 |
143 | Measures of dispersion give us an idea of how spread the data is. Let's have a look at the common types.
144 |
145 | ### Minimum and maximum
146 |
147 | The minimum and maximum values are the most extreme form of looking at dispersion. It lists the smallest and the largest values. The difference between the two is known as the _range_.
148 |
149 | In the code chunk below, we split the patients according to the `Group` variable and calculate the minimum and maximum ages. Once again, we give these values an appropriate name of our choice. The functions for minimum and maximum are `min()` and `max()`.
150 |
151 | ```{r Calculaing the minimum and maximum ages of the two groups}
152 | df %>%
153 | group_by(Group) %>%
154 | summarise(min_age = min(Age),
155 | max_age = max(Age))
156 | ```
157 |
158 | The `range()` takes a single vector of values as an argument. It cannot be used when using the `dplyr` `group_by()` function. Let's calculate the minimum and maximum value of the ages of all the patients.
159 |
160 | ```{r Range of the ages of all the patients}
161 | range(df$Age)
162 | ```
163 |
164 | ### Standard deviation
165 |
166 | The standard deviation is the average difference between each value and the mean. The `sd()` function accomplishes the task. It is shown in equation (2) for a sample.
167 |
168 | $$\sigma = \sqrt{\frac{\sum_{i=1}^{n} {\left( x_i - \bar{x} \right)}^{2}}{n-1}} \tag{2}$$
169 |
170 | You should be familiar with the syntax by now. Below is the mean and standard deviation of the ages of each of the two groups.
171 |
172 | ```{r Standard deviation of the ages of each of the two groups}
173 | df %>%
174 | group_by(Group) %>%
175 | summarise(average_age = mean(Age),
176 | std_age = sd(Age))
177 | ```
178 |
179 | Calculating the variance gives us a better appreciation of the standard deviation.
180 |
181 | ### Variance
182 |
183 | To calculate the standard deviation, i.e. the average difference between each data point value and the mean of a variable, we do subtraction. Since some of the values will be less than and some larger than the mean, we will get negative and positive numbers. If we sum them all up and divide by how many there are, we would get an answer of zero. Therefor, we square every difference. Squaring results in all values being positive. Now we can divide by the number of values that there are. This is the variance. It is the square of the standard deviation, or seen from a different point of view, the standard deviation is the square root of the variance.
184 |
185 | Below we calculate the variance of the ages of the two groups.
186 |
187 | ```{r Variance of the ages of the two groups}
188 | df %>%
189 | group_by(Group) %>%
190 | summarise(var_age = var(Age))
191 | ```
192 |
193 | We can square the standard deviation of the group I age to check out our definition.
194 |
195 | ```{r Squaring the standard deviation}
196 | 11.53006^2
197 | ```
198 |
199 | Bar the small rounding error, we indeed have the variance.
200 |
201 | ### Quantiles
202 |
203 | Where the median divides the set of numerical data point values into two halves, we can make this division at any arbitrary fraction of the whole. If we split is by the smallest quarter and the larger three quarters, we have what is referred to as the _first quartile_, also known as the _twenty-fifth percentile_. The median is then actually the _second quartile_ or fiftieth percentile. The three-quarter one-quarter split is the _third_quartile_ or the seventy-fifth percentile. For obvious reasons the minimum is the _zeroth_ quartile and the maximum is the _fourth quartile_. Any percentile can be used, but the ones listed are the most common to use.
204 |
205 | The `quantile()` function performs the duty in this instance. It is best to create separate vectors when using the `quantile()` function.
206 |
207 | ```{r Creating an age vector for group I patients}
208 | group_I_age <- df %>% filter(Group == "I") %>% select(Age)
209 | group_I_age <- as.vector(group_I_age$Age) # Overwrite as a vector of values
210 | ```
211 |
212 |
213 | ```{r Calculating the quantiles for the vector of values}
214 | quantile(group_I_age,
215 | probs = c(0, 0.25, 0.5, 0.75, 1))
216 | ```
217 |
218 | ### Interquartile range
219 |
220 | The interquartile range is the difference between the third and the first quartile values. It is used to detect statistical outliers. If we multiply the interquartile range by $1.5$ and add this to the third quartile and subtract it from the first quartile, we get bounds, beyond which any value is a suspected statistical outlier.
221 |
222 | The `IQR()` function is used.
223 |
224 | ```{r Calculating the IQR for the age of each group}
225 | df %>%
226 | group_by(Group) %>%
227 | summarise(iqr_Age = IQR(Age))
228 | ```
229 |
230 | ## Conclusion
231 |
232 | Summary statistics provide our first insight into the knowledge hidden in our data. The use of the `dplyr` library makes it easy for us to extract the data point values that we require. In combination with the standard R summary statistics functions, we can easily start to understand our data.
--------------------------------------------------------------------------------
/11_Logistic_regression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Analyzing categorical outcomes"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = TRUE)
18 | setwd(getwd()) # Spreadsheet file and R markdown file in same folder
19 | ```
20 |
21 | ```{r Libraries, message=FALSE, warning=FALSE}
22 | library(dplyr)
23 | ```
24 |
25 | 
26 |
27 | ## Preamble
28 |
29 | In the chapter on linear regression, we considered a dependent outcome variable, which was numerical in nature. We then used one or more independent variables, also numerical, to create a model to predict the dependent variable.
30 |
31 | In logistic regression, the dependent variable is categorical in nature. In most cases, we deal with a dichotomous variable, with a sample space of i.e. _Yes_ or _No_.
32 |
33 | When using a single independent variable, we refer to _simple logistic regression_ and when using more than one independent variable, the term used is _multivariable logistic regression_. To be clear, these independent variable may be numerical or categorical. The latter requires some consideration. When there are more than one dependent variable, as seen in repeated-measures models, we use the term _multivariate logistic regression_. [Hidalgo B, Goodman M. Multivariate or multivariable regression? Am J Public Health. 2013;103(1):39-40).]
34 |
35 | ## Simple logistic regression
36 |
37 | As mentioned, here we consider a single independent variable. We will start with the simplest case of a _numerical_ independent variable. Below, we import the `ProjectDataII.csv` spreadsheet file as a data frame object named `df`.
38 |
39 | ```{r Import data}
40 | df <- read.csv("ProjectIIData.csv")
41 | ```
42 |
43 | The `names()` functions shows a list of the statistical variable (column headers in row one) in the dataset.
44 |
45 | ```{r List the statistical variables}
46 | names(df)
47 | ```
48 |
49 | As a simulated dataset, we will consider the `Improvement` variable as our dependent variable. It is dichotomous as we can see from the `table()` function below.
50 |
51 | ```{r Sample space element count of Improvement variable}
52 | table(df$Improvement)
53 | ```
54 |
55 | One of the sample space elements has to selected as the positive outcome, i.e. the one which we are trying to predict. The `levels()` function can tell us this order.
56 |
57 | ```{r Checking the levels}
58 | levels(df$Improvement)
59 | ```
60 |
61 | We are interested in the last element, which is _Yes_ in this case. Fortunately, this was correctly done by R. If not, we can specify the levels using the `factor()` function.
62 |
63 | Now we consider the `CRP` variable as independent variable. The `glm()` function, short for _generalized linear model_ allows us to investigate how well `CRP` predicts `Outcome`.
64 |
65 | ```{r Univariate model of CRP and Outcome}
66 | crp_outcome <- glm(Improvement ~ CRP, # Formula
67 | data = df, # Data object
68 | family = "binomial") # Binomial for dichotomous outcome
69 | ```
70 |
71 | The `glm()` function takes a formula as input. Above we state that we want the `Improvement` outcome predicted by the `CRP` value. The `family =` argument allows us to choose a setting based on the dependent variable. Since this case represents a dichotomous variable, we set this to `"binomial"`.
72 |
73 | The`summary()` function provides for a summary of our model.
74 |
75 | ```{r Summary of the findings of our model}
76 | summary(crp_outcome)
77 | ```
78 |
79 | As with linear regression, logistic regression attempts at forming a straight line. This is not as _straightforward_ as it may seem, as the dependent variable only has two outcomes. To be sure, these were changed to values of $0$ and $1$ to do the calculations, based on the levels that we set before. The straight line is actually created by taking the logarithmic transform of the values.
80 |
81 | With this transformation we still have a $y$ intercept (with its estimate) and a `CRP` estimate. This has to be transformed back by taking the exponent of this value.
82 |
83 | ```{r Calculating the odds ratio}
84 | exp(-0.03983)
85 | ```
86 |
87 | What we have here is the odds ratio. Anything less than $1.0$ lessens the odds of our independent outcome, which was _Yes_ for improvement. For every one unit increase in CRP, we have a lessening of the odds of an improved outcome.
88 |
89 | We note a _p_ value of $0.48$, which is not significant at an $\alpha$ value of $0.05$. We are also interested in the $95$% confidence interval around the odds ratio. This is achieved using the `confint()` function. We add the `exp()` function to transform the results to odds ratios.
90 |
91 | ```{r Confidence interval of the odds ratio}
92 | exp(confint(crp_outcome))
93 | ```
94 |
95 | We can see from this why the _p_ value was not significant. The interval shows both a value less than $1.0$ and a value of more than $1.0$.
96 |
97 | What we don't see is a coefficient of determination, $R^2$. Instead, there are criteria to examine a model. One of the best known is the Akaike information criterion (AIC). We want to see this value as low as possible, as this indicates more information provided by the model. This is information gain or loss of uncertainty.
98 |
99 | So, how does this all compare to a linear model for a single independent variable?
100 |
101 | Given one of the independent variable data point values, we can calculate the predicted value of the dependent variable, for the same subject. The predicted value is numerical in nature.
102 |
103 | In logistic regression, we shift to a probability of an outcome. Since the dependent variable is categorical, we can only predict the probability of the _positive_ outcome, which is _Yes_ in this case. So, for a given sample, we predict the probability of the dependent variable for that subject being _Yes_. The cut-off is usually $0.5$. At or above this probability, the model predicts a dependent variable data point values of _Yes_, otherwise, it will predict a _No_.
104 |
105 | ## Multivariable logistic regression
106 |
107 | We can attempt to improve our model by introducing more independent variables. This is easily achieved with the formula used as first argument in the `glm()` model. We _add_ additional independent variables by using `+` notation.
108 |
109 | ```{r Multivariate logistic regression}
110 | crp_chol_outcome <- glm(Improvement ~ CRP + Cholesterol,
111 | data = df,
112 | family = "binomial")
113 | summary(crp_chol_outcome)
114 | ```
115 |
116 | Now we see an individual estimate (which we will convert into an odds ratio) for each of the independent variables. We also note that the AIC has increased. This model therefor has negative information gain and is a worse model.
117 |
118 | ## Adding a dichotomous nominal variable to the model
119 |
120 | From our data we also have a `Group` variable.
121 |
122 | ```{r Counts of the Group variable data point values}
123 | table(df$Group)
124 | ```
125 |
126 | From the `table()` function we know that it is dichotomous. We need to assign one of the elements in this sample space as the element of investigation. If we imagine this simulated study to contain two groups, one being the control group and one group receiving an active intervention. These are coded as `A` and `C` in the dataset. We are obviously interested in the active group. To encode this as our element of choice, we must set the levels. Currently, `C` is listed second and is the element that will be coded as $1$. We can change this using the `factor()` function.
127 |
128 | ```{r Changing the sample space element of choice}
129 | df$Group <- factor(df$Group,
130 | levels = c("C", "A")) # Listed in reverse order
131 | table(df$Group)
132 | ```
133 |
134 | We now see that `A` is listed second and is the value under consideration. Let's add this to the model.
135 |
136 | ```{r Adding Group to the model}
137 | crp_chol_group_outcome <- glm(Improvement ~ CRP + Cholesterol + Group,
138 | data = df,
139 | family = "binomial")
140 | summary(crp_chol_group_outcome)
141 | ```
142 |
143 | Our model now has a lower AIC. We also note that `Group` has a significant _p_ value to boot. We need to look at the the odds ratio and 95% confidence interval around the odds ration for `Group`.
144 |
145 | ```{r OR and 95 percent CI for Group}
146 | exp(0.669787) # Exponent of the coefficient of the Group variable
147 | exp(confint(crp_chol_group_outcome))
148 | ```
149 |
150 | We see an odds ratio of $1.95$ and a $95$% confidence interval around this of $1.23 - 3.11$ for the the `Group` variable. Both the lower and upper confidence interval values are higher than one, indicating a worse outcome (_Yes_) for those in group A.
151 |
152 | ## Adding a multi-level nominal categorical variable to the model
153 |
154 | Now that we know how to interpret the summary of a model that includes a binary categorical variable, we have to consider a model with an independent variable with more than two elements in its sample space.
155 |
156 | In the code below, we use the `mutate()` function from the `dplyr` package just as a review. We can add new columns with it. Below then, we add a variable called `Stage` and select $300$ random data point values from it. It has a sample space of four elements.
157 |
158 | ```{r Creating a new variable called Stage}
159 | set.seed(1) # For reproducible results
160 | df <- df %>% mutate(Stage = sample(c("I", "II", "III", "IV"),
161 | size = 300,
162 | replace = TRUE))
163 | names(df) # List the variables
164 | ```
165 |
166 | We can take a look art the frequency count of our new variable using the `table()` function.
167 |
168 | ```{r Frequency count of the new Stage variable}
169 | table(df$Stage)
170 | ```
171 |
172 | When we think of our research question, we have to select one of the elements in the sample space of this new variable as the _baseline_ variable. We want to examine the odds ratios for the other sample space elements against this element. To make absolutely sure that R see this new variable as a factor, we use the `levels =` argument in the factor()` function. We state the sample space elements in the order required.
173 |
174 | ```{r Specifying the levels for the new variable Stage}
175 | df$Stage <- factor(df$Stage,
176 | levels = c("I", "II", "III", "IV"))
177 | ```
178 |
179 | Using the `levels()` function, we can confirm our chosen order.
180 |
181 | ```{r Confirming the new levels}
182 | levels(df$Stage)
183 | ```
184 |
185 | Now for the model. We simple add the `Stage` variable to the formula.
186 |
187 | ```{r Creating a new model including Stage}
188 | crp_chol_group_stage_outcome <- glm(Improvement ~ CRP + Cholesterol + Group + Stage,
189 | data = df,
190 | family = "binomial")
191 | summary(crp_chol_group_stage_outcome)
192 | ```
193 |
194 | First off, note that our AIC has increased, making this a worse model than our prior attempt. Note also the `StageII`, `StageIII`, and `StageIV` values. Since we set the first `level` value as `I`, we are indeed asking what the odds ratio are for the other stage against stage `I`.
195 |
196 | ## Forward and backward stepwise methods
197 |
198 | The question that arises is which variable to include in a model. There are two popular methods for going about this, called the _backward_ and the _forward_ stepwise methods. Note that there are mixed methods as well.
199 |
200 | Below, we consider the forward stepwise method. Here, we start with an _empty_ model, consisting only of the intercept.
201 |
202 | ```{r}
203 | simple_model <- glm(Improvement ~ 1,
204 | data = df,
205 | family = "binomial")
206 | summary(simple_model)
207 | ```
208 |
209 | We have none of our independent variable in the model simple. We note an AIC of $417.41$. The aim is to improve on the model. The `glm()` function in R will attempt to achieve this by adding our variable (as listed in the `scope =` argument) one-by-one. Note that we use here the model that contained all of the variables of interest and extract the formula from that model.
210 |
211 | The `step()` function will keep those variables that decrease the AIC and omit those that do not. The first argument is our simplest model. We then state the `direction =` as `forward`. Finally, the formula of the _full_ model.
212 |
213 | ```{r Forward step logistic regression}
214 | step(simple_model,
215 | direction = "forward",
216 | scope = formula(crp_chol_group_stage_outcome))
217 | ```
218 |
219 | The result of the _best model_ is given. We see that contains the variable `Group` and `Cholesterol`, with an AIC of $411.5$. We can use these variables in our final model.
220 |
221 | ```{r Our best model}
222 | best_model_forward <- glm(Improvement ~ Group + Cholesterol,
223 | data = df,
224 | family = "binomial")
225 | summary(best_model_forward)
226 | ```
227 |
228 | From this we can extract the odds ratios and the $95$% confidence intervals as before.
229 |
230 | With the backward stepwise method, we start with our full method. R will remove one-by-one and only keep the ones that decrease the AIC.
231 |
232 | ```{r}
233 | step(crp_chol_group_stage_outcome,
234 | direction = "backward")
235 | ```
236 |
237 | We end up in this case with the same variables for a best model.
238 |
239 | ## Conclusion
240 |
241 | Logistic regression is a powerful modeling tool for predicting a categorical outcome. It forms the basis of many scoring systems.
242 |
243 | Care must be taken in its interpretation, though. While we can create good models, they do not always infer successfully to other population.
244 |
--------------------------------------------------------------------------------
/ProjectData.csv:
--------------------------------------------------------------------------------
1 | Age,Difference,CRP,Group,sBP,Weight,SideEffects,Survey
2 | 60,-1.7,5.8,I,104,221,No,5
3 | 65,0.8,8.5,I,128,144,No,5
4 | 53,0.5,8,I,128,146,No,3
5 | 57,0.2,7.7,II,123,126,No,4
6 | 52,-0.8,6.9,I,121,208,Yes,5
7 | 56,0.7,7.8,I,131,168,No,4
8 | 54,-0.3,7.1,II,127,210,Yes,4
9 | 51,1.1,8.4,II,138,150,No,4
10 | 65,-0.5,7.4,I,119,115,No,1
11 | 60,-1.7,5.3,II,110,155,Yes,3
12 | 60,-2,5.8,II,101,139,No,2
13 | 58,0.3,8.1,II,131,192,No,2
14 | 54,-1.1,6.3,I,111,217,No,1
15 | 50,-0.5,7.2,I,121,172,No,4
16 | 52,0,7.7,I,129,148,Yes,1
17 | 52,-0.6,7,II,118,168,No,4
18 | 60,0.6,7.8,I,135,119,Yes,3
19 | 61,1.4,8.7,I,137,230,No,3
20 | 53,0.6,7.6,II,131,111,No,2
21 | 48,-0.2,7.5,II,127,122,Yes,5
22 | 63,0.1,7.6,I,125,199,No,1
23 | 46,0.6,8.5,II,131,183,No,2
24 | 45,1.2,9.2,II,137,146,No,1
25 | 45,-0.5,7.2,II,116,168,No,1
26 | 46,0.1,7.2,I,130,199,Yes,5
27 | 61,-2,5.6,I,109,168,No,3
28 | 57,0.3,8.1,I,128,141,Yes,2
29 | 47,-0.3,7.2,II,118,139,No,2
30 | 54,-0.6,6.9,I,120,225,No,2
31 | 62,-0.7,6.8,I,120,166,Yes,2
32 | 46,0.1,7.7,II,126,221,No,5
33 | 49,-1.3,6.4,II,114,172,No,5
34 | 58,-0.6,7.2,I,116,141,No,3
35 | 50,-1.3,5.7,II,111,210,No,2
36 | 58,0.7,8.2,II,132,133,Yes,5
37 | 53,-0.2,7.5,II,124,161,No,4
38 | 65,0.4,7.5,I,132,223,Yes,5
39 | 59,0.8,8.5,I,129,223,No,4
40 | 57,-0.1,7.6,I,127,164,No,4
41 | 65,-1.4,6.1,II,112,206,Yes,3
42 | 64,0,7.2,I,122,172,No,4
43 | 45,0.7,8.3,I,136,117,No,5
44 | 49,-0.9,6.1,II,117,155,No,5
45 | 52,-0.8,6.3,II,117,183,No,5
46 | 59,-1.3,6,I,117,124,Yes,4
47 | 47,-0.5,6.9,II,116,141,No,5
48 | 53,-1,6.3,II,112,117,Yes,5
49 | 54,1.4,9.2,II,144,157,No,2
50 | 55,0.5,7.6,I,133,201,No,1
51 | 54,0.4,7.9,I,134,155,Yes,5
52 | 50,0,7.8,I,121,179,No,2
53 | 45,-1,7,II,119,128,No,2
54 | 48,-0.3,6.7,I,122,228,No,4
55 | 53,-0.7,6.6,I,116,117,No,5
56 | 50,1.4,9.2,II,134,208,Yes,3
57 | 63,-1,6.4,II,119,122,No,3
58 | 64,-1.9,5.3,I,111,232,Yes,2
59 | 54,-2.6,4.5,II,95,170,No,3
60 | 49,0.4,8,II,125,192,No,1
61 | 45,0.7,8.6,II,137,111,Yes,1
62 | 51,-1.3,6.2,I,113,214,No,4
63 | 64,-0.6,6.9,I,118,152,No,1
64 | 56,1,8.8,I,139,217,No,1
65 | 63,0.6,7.8,II,135,152,No,1
66 | 57,-1.4,5.7,I,109,223,Yes,3
67 | 58,-0.8,6.4,I,122,111,No,1
68 | 50,0.6,7.9,II,131,212,Yes,5
69 | 60,-0.4,6.7,II,120,208,No,3
70 | 51,0.7,8.4,I,132,128,No,2
71 | 65,0.9,8,II,138,130,Yes,5
72 | 59,-0.5,6.6,II,125,175,No,3
73 | 49,-0.5,7.2,II,118,206,No,2
74 | 56,0.5,7.6,I,126,219,No,1
75 | 53,1.1,9,I,135,217,No,2
76 | 50,0.8,8.1,I,138,137,Yes,3
77 | 48,0.8,7.9,II,136,188,No,1
78 | 59,2,9.4,I,147,223,Yes,4
79 | 46,0.9,8.8,I,134,144,No,4
80 | 51,-0.5,6.9,II,123,177,No,1
81 | 55,0.3,7.8,II,128,119,Yes,1
82 | 53,0.1,7.7,I,130,192,No,4
83 | 56,0.8,8,II,136,186,No,2
84 | 50,0.6,8.1,II,136,146,No,5
85 | 63,0.3,7.5,II,131,219,No,3
86 | 51,-0.2,7.4,I,128,172,Yes,5
87 | 56,-0.6,7.2,I,116,232,No,4
88 | 56,-0.6,6.8,I,117,150,Yes,1
89 | 63,1.1,8.8,II,133,119,No,4
90 | 61,0.1,7.9,I,128,228,No,5
91 | 56,-0.7,7.3,I,118,210,Yes,1
92 | 62,1.5,9.1,II,136,175,No,4
93 | 50,1.1,8.4,II,132,225,No,1
94 | 60,-0.6,7,I,117,186,No,1
95 | 45,1.7,9.2,II,146,214,No,3
96 | 45,0.6,8.4,II,133,119,Yes,2
97 | 49,-2.2,5.3,II,99,188,No,1
98 | 52,0.6,8.5,I,129,188,Yes,5
99 | 50,0.1,7.7,I,130,135,No,5
100 | 52,0,7.9,I,121,201,No,5
101 | 61,-0.3,7.4,II,119,206,Yes,1
102 | 65,1.8,9.5,I,139,139,No,4
103 | 48,-0.6,7.2,I,119,161,No,1
104 | 64,0,7.9,II,123,144,No,5
105 | 63,1,8.7,II,135,192,No,5
106 | 64,-0.5,7.4,I,117,144,Yes,1
107 | 57,0.2,7.8,II,131,159,No,4
108 | 45,-0.9,6.5,II,121,203,Yes,3
109 | 45,2.7,10.5,II,148,155,No,1
110 | 60,-1.8,5.3,I,102,203,No,1
111 | 49,-0.4,7.3,I,118,197,Yes,1
112 | 58,0.9,8.8,I,130,141,No,2
113 | 60,-0.3,6.8,II,123,232,No,5
114 | 63,-2,5.7,I,100,212,No,4
115 | 62,1.3,8.9,I,139,137,No,1
116 | 59,-1.9,6.1,II,102,212,Yes,1
117 | 63,-0.3,7.3,II,121,175,No,5
118 | 48,-0.2,7.5,I,124,214,Yes,4
119 | 55,-0.1,7.7,II,120,128,No,3
120 | 50,1.4,8.5,II,141,230,No,3
121 | 64,1.5,9.3,II,139,133,Yes,1
122 | 49,2.6,10.3,I,148,159,No,5
123 | 57,0,7.9,I,124,161,No,3
124 | 49,-0.4,7.2,I,123,170,No,3
125 | 63,-1,6.6,II,119,159,No,2
126 | 49,1,8.9,I,138,111,Yes,2
127 | 61,-0.3,7.5,I,125,197,No,4
128 | 53,1,8,II,136,206,Yes,4
129 | 45,-1.7,6.2,II,111,199,No,1
130 | 65,0.2,8.1,I,128,221,No,1
131 | 47,-0.5,6.6,II,117,139,Yes,2
132 | 45,-2.4,4.8,II,100,133,No,2
133 | 46,0.4,8,II,128,230,No,3
134 | 50,1,8.8,I,136,194,No,3
135 | 56,-0.3,7.1,I,125,168,No,3
136 | 46,0.9,8.6,I,136,139,Yes,1
137 | 48,-0.2,6.9,II,123,225,No,2
138 | 55,-0.5,6.8,I,116,166,Yes,3
139 | 59,-0.6,6.8,I,118,157,No,5
140 | 64,0.2,7.2,II,125,168,No,2
141 | 55,0.4,8.3,II,129,152,Yes,3
142 | 64,0.2,7.6,I,130,177,No,3
143 | 57,-0.2,6.8,II,121,144,No,2
144 | 54,1,8.1,II,137,217,No,2
145 | 53,1.1,8.8,II,140,206,No,1
146 | 52,2,9.1,I,147,177,Yes,1
147 | 46,0,7.5,I,129,172,No,1
148 | 61,0.6,8.2,I,132,113,Yes,4
149 | 45,0.6,8.1,II,129,122,No,2
150 | 51,-0.9,6.5,I,117,206,No,3
151 | 47,-1.2,6.5,I,112,166,Yes,4
152 | 49,0.6,8.3,II,136,225,No,2
153 | 53,0.3,7.8,II,130,126,No,4
154 | 60,0.5,7.9,I,135,230,No,5
155 | 53,0,7.1,II,129,197,No,5
156 | 59,2,9.7,II,141,183,Yes,3
157 | 63,-1,6.3,II,115,188,No,1
158 | 61,0.7,8.1,I,137,179,Yes,5
159 | 46,-1.6,5.9,I,107,122,No,5
160 | 61,1.5,8.9,I,138,133,No,4
161 | 60,0.3,7.3,II,128,197,Yes,1
162 | 49,-0.3,7.6,I,120,210,No,5
163 | 49,-0.7,7,I,122,217,No,4
164 | 51,0.4,8.4,II,125,232,No,5
165 | 49,0.1,7.8,II,130,126,No,5
166 | 57,-0.7,7.2,I,115,141,Yes,3
167 | 55,0,7.1,II,124,212,No,3
168 | 56,0,7.9,II,123,230,Yes,2
169 | 63,-1.2,6.1,II,118,221,No,1
170 | 62,-0.7,6.5,I,119,219,No,1
171 | 53,-2.7,4.7,I,97,111,Yes,4
172 | 52,-0.2,7.7,I,127,188,No,3
173 | 62,-0.4,7.3,II,119,183,No,4
174 | 55,-1.6,6.1,I,108,170,No,3
175 | 52,0.8,8,I,137,166,No,4
176 | 59,-0.5,6.8,II,116,179,Yes,4
177 | 38,-1.3,6.5,II,112,155,No,1
178 | 38,0.8,8.7,I,129,197,Yes,5
179 | 38,-0.3,7.1,II,126,124,No,5
180 | 38,0.3,8.1,II,124,186,No,4
181 | 38,-1.2,6.5,II,112,133,Yes,4
182 | 38,0.8,7.8,I,135,115,No,5
183 | 38,0.3,7.9,I,128,175,No,4
184 | 39,-2.3,5.2,I,99,230,No,5
185 | 39,-1.6,5.7,II,104,232,No,5
186 | 39,1.2,8.6,I,135,135,Yes,4
187 | 39,1.8,9.3,I,139,144,No,3
188 | 39,0.3,7.6,II,125,177,Yes,1
189 | 40,-0.3,6.8,II,119,192,No,2
190 | 40,1.2,9.2,I,138,124,No,3
191 | 40,1.3,8.8,II,134,203,Yes,4
192 | 40,0.1,8.1,II,129,181,No,5
193 | 40,0.8,8.1,II,133,150,No,1
194 | 40,-1.3,6.4,I,116,221,No,5
195 | 41,0,7.5,I,127,223,No,4
196 | 41,-0.4,7.2,I,124,122,Yes,5
197 | 41,-0.8,6.4,II,115,221,No,2
198 | 41,1.7,8.9,I,146,188,Yes,3
199 | 41,1.2,9.1,I,140,150,No,1
200 | 41,-0.3,7.2,II,117,199,No,3
201 | 42,-0.7,7.3,II,119,168,Yes,1
202 | 42,0.4,7.4,I,132,194,No,5
203 | 42,-0.6,6.9,II,114,115,No,4
204 | 42,-0.5,6.7,II,123,130,No,4
205 | 42,-1.1,6.9,II,113,181,No,4
206 | 42,0.8,7.9,I,130,126,Yes,2
207 | 42,-0.1,7,I,128,223,No,1
208 | 42,1.2,8.4,I,134,119,Yes,4
209 | 42,0,7.6,II,129,146,No,5
210 | 43,1.1,8.5,I,132,232,No,1
211 | 43,1.2,8.5,I,139,175,Yes,3
212 | 43,1.2,8.7,II,138,119,No,2
213 | 43,1,8.6,II,133,217,No,3
214 | 43,-0.3,7.2,I,125,223,No,3
215 | 43,-0.9,7.1,II,115,115,No,4
216 | 43,-0.9,6.7,II,115,161,Yes,4
217 | 43,-0.8,6.8,II,113,141,No,1
218 | 43,0.6,8.5,I,133,155,Yes,4
219 | 43,0,7.9,I,126,141,No,4
220 | 43,-0.2,7,I,120,139,No,2
221 | 43,-0.1,7.6,II,119,166,Yes,5
222 | 43,-0.6,6.5,I,115,164,No,1
223 | 43,-1.2,6.3,I,110,141,No,1
224 | 44,-1.3,6.1,II,111,111,No,1
225 | 44,-0.1,7,II,124,115,No,4
226 | 44,0.2,8,I,132,212,Yes,3
227 | 44,-0.3,7.6,II,117,201,No,5
228 | 44,-0.1,6.9,II,124,124,Yes,4
229 | 44,-1.1,6.5,II,118,152,No,2
230 | 45,0.5,8.1,I,130,126,No,5
231 | 45,0,7.4,I,126,219,Yes,1
232 | 45,-1.8,6,I,108,159,No,5
233 | 45,0.9,8.2,II,133,150,No,4
234 | 45,-0.2,7.1,I,125,199,No,1
235 | 46,1,8.6,I,134,150,No,2
236 | 46,0.7,8,II,137,137,Yes,3
237 | 46,0,7.4,II,126,122,No,2
238 | 47,-0.7,6.9,I,120,166,Yes,2
239 | 47,0.8,7.9,II,132,203,No,3
240 | 47,-0.5,6.8,II,120,115,No,4
241 | 47,-0.1,7.1,II,129,137,Yes,3
242 | 47,-1.6,5.9,I,106,177,No,4
243 | 47,-0.4,7.1,I,120,183,No,5
244 | 48,0.2,7.8,I,124,139,No,2
245 | 48,-0.8,6.5,II,121,113,No,2
246 | 48,0.5,7.7,I,130,225,Yes,4
247 | 48,-0.2,6.9,I,118,175,No,4
248 | 48,0.7,8.3,II,128,206,Yes,4
249 | 48,-1.9,5.9,II,110,170,No,1
250 | 48,0.5,7.9,I,134,172,No,3
251 | 48,-1.4,6.4,II,111,217,Yes,4
252 | 48,0.6,8.4,II,134,128,No,3
253 | 48,-1,6.6,II,112,161,No,5
254 | 49,-0.3,7.1,I,119,152,No,5
255 | 49,-0.5,7.4,I,123,183,No,4
256 | 49,-0.1,7.4,I,123,150,Yes,2
257 | 49,-0.4,7,II,118,221,No,3
258 | 49,0.6,8,I,127,133,Yes,2
259 | 49,2,10,I,143,148,No,2
260 | 49,0.6,8.1,II,133,225,No,4
261 | 49,-1.4,6.6,II,114,115,Yes,1
262 | 49,-0.2,7.7,I,124,150,No,2
263 | 49,0.9,8.7,II,135,214,No,3
264 | 49,-0.4,6.6,II,120,122,No,5
265 | 50,2.5,10.4,II,152,223,No,1
266 | 50,0.7,8.5,I,127,206,Yes,1
267 | 50,-0.4,7.1,I,121,214,No,3
268 | 50,-1,6.6,I,118,210,Yes,5
269 | 50,0.4,7.6,II,125,111,No,1
270 | 50,-0.3,7,I,120,122,No,2
271 | 50,1.4,8.5,I,138,117,Yes,5
272 | 50,1.2,9.1,II,139,221,No,1
273 | 50,1.6,9.6,II,144,192,No,3
274 | 51,1.1,8.3,I,132,230,No,1
275 | 51,-0.5,7.3,II,122,206,No,3
276 | 51,0.6,8.1,II,132,150,Yes,1
277 | 51,-0.6,6.6,II,123,126,No,1
278 | 51,-0.5,7.1,I,116,150,Yes,3
279 | 52,0.8,8.4,I,132,217,No,2
280 | 52,0.8,8.2,I,133,157,No,4
281 | 52,-0.5,7,II,118,232,Yes,4
282 | 52,0.1,7.5,I,122,221,No,2
283 | 52,2.8,10.6,I,155,170,No,2
284 | 52,0.5,8.3,II,134,212,No,5
285 | 52,-1.7,5.4,II,107,124,No,5
286 | 53,0.4,7.9,I,127,179,Yes,3
287 | 53,-0.1,7.7,II,120,225,No,1
288 | 53,0.8,8.7,II,130,144,Yes,4
289 | 53,0.6,7.9,II,135,164,No,1
290 | 53,-2.1,5.6,I,100,232,No,2
291 | 53,0.4,7.7,I,127,201,Yes,3
292 | 53,-0.8,6.9,I,115,175,No,2
293 | 53,-0.2,7.6,II,120,161,No,1
294 | 53,0,7.4,I,123,161,No,3
295 | 53,-0.1,7.9,I,125,210,No,1
296 | 53,0,7.9,II,128,208,Yes,3
297 | 54,-1,6.7,II,119,221,No,5
298 | 54,1,8.3,I,131,183,Yes,4
299 | 54,-0.9,6.6,II,120,128,No,5
300 | 54,0.2,8.1,II,125,177,No,4
301 | 54,0.3,8.1,II,127,170,Yes,1
302 | 54,-0.8,6.5,I,118,115,No,5
303 | 54,-1.8,5.3,I,110,150,No,3
304 | 54,0.5,7.8,I,134,141,No,3
305 | 55,0,7.3,II,129,186,No,1
306 | 55,-1.6,5.8,I,111,152,Yes,2
307 | 55,-0.1,7.3,I,123,186,No,1
308 | 55,-0.2,7.4,II,125,212,Yes,3
309 | 55,0.4,8.3,II,130,225,No,1
310 | 55,-0.8,6.7,I,117,214,No,3
311 | 55,0.2,7.8,II,128,115,Yes,1
312 | 55,-0.5,7.1,II,117,159,Yes,5
313 | 55,-1.1,6.7,II,113,111,Yes,3
314 | 56,0.2,7.6,I,127,217,No,1
315 | 56,-1.2,6.6,I,111,190,Yes,3
316 | 56,-1,6.1,I,113,137,No,1
317 | 56,-0.1,7.5,II,120,214,No,3
318 | 56,-0.3,7.6,I,123,135,Yes,3
319 | 57,0.5,8.4,I,127,141,Yes,4
320 | 57,0.6,8.3,II,130,113,Yes,1
321 | 57,-1.2,6.8,II,116,166,Yes,1
322 | 57,-1.1,6.8,I,114,190,No,3
323 | 57,-0.5,7.4,II,121,232,Yes,3
324 | 58,-1.2,6.3,II,113,113,No,2
325 | 58,1.3,9,II,138,115,No,3
326 | 58,0.2,8,I,131,135,Yes,1
327 | 58,-0.8,6.6,I,121,144,Yes,1
328 | 58,0.9,8.1,I,134,159,Yes,3
329 | 58,1.3,9,II,138,152,Yes,2
330 | 59,0.5,7.6,I,128,168,No,1
331 | 59,0.4,7.4,I,134,201,Yes,1
332 | 59,-0.7,7.2,II,118,190,No,4
333 | 59,1.5,8.7,II,137,232,No,1
334 | 59,-0.3,7.5,I,119,144,Yes,5
335 | 59,0.9,8.8,II,132,197,Yes,1
336 | 59,-0.9,6.7,II,116,206,Yes,3
337 | 60,-1.3,6.6,II,115,177,Yes,3
338 | 60,-0.4,7.5,I,125,217,No,1
339 | 60,-0.3,7.3,I,127,175,Yes,2
340 | 60,0,7.1,I,125,203,No,2
341 | 60,-0.9,7,II,111,139,No,4
342 | 60,-0.2,7.5,I,119,225,Yes,2
343 | 60,0.2,8,I,124,208,Yes,3
344 | 60,0.8,8.8,II,130,133,Yes,4
345 | 60,0,7.1,II,125,194,Yes,2
346 | 61,-1.9,5.7,I,103,119,No,3
347 | 61,-0.1,7,II,123,232,Yes,3
348 | 61,0.2,7.4,II,125,172,No,2
349 | 61,-0.4,6.6,II,118,192,No,5
350 | 61,0,7.7,I,129,128,Yes,1
351 | 61,0.4,8.3,I,134,217,Yes,1
352 | 61,0.5,8.2,I,126,203,Yes,3
353 | 61,0.3,7.7,II,127,117,Yes,3
354 | 61,0.4,7.5,I,127,111,No,2
355 | 62,0.2,7.5,I,124,225,Yes,3
356 | 62,2.2,9.7,II,151,139,No,3
357 | 62,-0.5,7.3,II,124,221,No,1
358 | 62,0.9,8.8,I,135,228,Yes,1
359 | 63,0.5,7.9,II,130,119,Yes,5
360 | 63,0.4,8.3,II,129,139,Yes,2
361 | 63,-2.3,5.2,II,105,159,Yes,3
362 | 63,-0.7,7.2,I,115,115,No,4
363 | 63,-1.2,5.9,I,114,225,Yes,4
364 | 64,-1,6.8,I,112,124,No,3
365 | 64,0.3,7.5,II,131,217,No,3
366 | 64,-0.5,6.8,I,122,223,Yes,4
367 | 64,-0.2,6.8,I,118,217,Yes,5
368 | 64,0.4,7.7,II,127,128,Yes,5
369 | 64,0.3,8,II,128,119,Yes,5
370 | 64,-1,6.1,I,117,177,No,2
371 | 65,1.9,9,II,142,179,Yes,3
372 | 65,-0.5,7.4,II,116,230,No,3
373 | 65,1.4,8.7,II,142,157,No,3
374 | 65,0.7,8.1,I,132,192,Yes,1
375 | 65,-0.1,7.4,I,122,111,Yes,2
376 | 65,0.1,7.4,I,130,128,Yes,1
377 | 66,0.5,7.9,II,130,152,Yes,5
378 | 66,0,7.6,I,129,161,No,1
379 | 66,1.5,8.6,I,145,175,Yes,4
380 | 66,-0.1,6.9,II,123,179,No,3
381 | 67,-0.6,7.2,II,123,144,No,2
382 | 67,0.7,8,I,128,186,Yes,4
383 | 67,2.4,10,II,147,232,Yes,2
384 | 67,0.1,7.3,II,122,155,Yes,5
385 | 67,-0.5,6.7,II,118,135,Yes,2
386 | 67,1.6,9.1,I,136,126,No,3
387 | 68,-0.9,6.6,I,111,157,Yes,1
388 | 68,0.6,7.9,I,128,190,No,2
389 | 68,-0.3,7.5,II,125,139,No,4
390 | 68,1.8,9.8,I,146,221,Yes,1
391 | 68,1.1,8.5,I,132,175,Yes,4
392 | 68,0.2,8,II,131,206,Yes,5
393 | 69,0.2,7.3,II,130,177,Yes,4
394 | 69,1.2,8.6,I,142,122,No,2
395 | 69,-1.4,6.1,II,110,157,Yes,1
396 | 69,0.1,7.8,II,126,137,No,2
397 | 69,-1.3,5.7,II,113,199,No,5
398 | 69,1,8.3,I,136,183,Yes,3
399 | 69,1.2,8.4,I,133,113,Yes,4
400 | 69,1.4,8.5,I,144,212,Yes,5
401 | 70,0.5,7.6,II,126,141,Yes,3
402 | 70,0.3,7.3,I,126,230,No,1
403 | 70,0.8,8.6,I,136,190,Yes,5
404 | 70,0.5,8.1,II,130,122,No,4
405 | 70,0,7.5,II,122,126,No,4
406 | 71,1.1,8.6,I,139,177,Yes,2
407 | 71,-0.6,6.9,II,116,212,Yes,3
408 | 71,-0.8,7,II,114,228,Yes,3
409 | 71,0.5,8.5,II,130,135,Yes,3
410 | 71,0.6,8.2,I,129,201,No,4
411 | 71,-1.1,6.2,I,118,206,Yes,4
412 | 71,0.2,8,I,131,168,No,5
413 | 72,-0.3,7.5,II,126,137,No,4
414 | 72,-0.5,6.5,I,118,155,Yes,5
415 | 72,0.2,7.4,I,129,117,Yes,3
416 | 72,0.7,8.5,II,130,188,Yes,2
417 | 72,-0.1,7.8,II,122,197,Yes,5
418 | 72,1,8.5,I,137,115,No,1
419 | 72,-1.6,5.4,II,106,152,Yes,2
420 | 72,-1.4,5.9,II,109,124,No,1
421 | 72,-0.5,6.7,II,122,210,No,1
422 | 73,-0.6,7.4,I,120,122,Yes,1
423 | 73,-1,6.9,I,113,139,Yes,5
424 | 73,-1.6,5.7,I,107,133,Yes,4
425 | 73,0.7,8.4,II,132,217,Yes,3
426 | 73,0.3,7.6,I,132,117,No,5
427 | 73,-2,5.9,I,103,170,Yes,4
428 | 73,1.3,8.5,II,143,194,No,4
429 | 73,-0.3,7.1,II,117,221,No,1
430 | 73,0.7,8.3,I,135,157,Yes,4
431 | 73,-0.4,7,II,121,217,Yes,3
432 | 73,0.6,7.7,II,133,217,Yes,3
433 | 74,-1,6.3,II,114,230,Yes,4
434 | 74,-1.5,5.7,I,106,181,No,5
435 | 74,0.5,7.6,I,134,144,Yes,2
436 | 74,-0.2,7.5,I,121,214,No,1
437 | 74,1.1,8.8,II,132,190,No,1
438 | 74,0.9,8.6,I,130,206,Yes,4
439 | 74,1.1,8.3,I,134,188,Yes,3
440 | 74,-1.7,5.6,II,109,232,Yes,1
441 | 75,0.7,8.2,II,137,157,Yes,2
442 | 75,0.1,8.1,I,126,223,No,5
443 | 75,-0.7,7.1,II,116,152,Yes,3
444 | 75,-1.8,5.9,II,108,141,No,2
445 | 75,0.5,7.7,II,134,111,No,4
446 | 75,1.2,8.5,I,137,128,Yes,1
447 | 75,1.7,9.2,I,140,179,Yes,2
448 | 75,-0.9,6.8,I,119,139,Yes,5
449 | 76,-2.4,4.7,II,98,119,Yes,1
450 | 76,0.2,7.4,I,126,181,No,2
451 | 76,0.5,8.4,I,134,175,Yes,1
452 | 76,0.1,7.7,II,129,146,No,3
453 | 76,0.3,8,II,128,155,No,2
454 | 76,-1.3,5.7,I,112,113,Yes,3
455 | 76,-0.2,7.3,II,125,161,Yes,3
456 | 77,-0.7,6.6,II,122,230,Yes,1
457 | 77,0.6,7.9,II,134,144,Yes,1
458 | 77,-0.3,6.9,I,118,159,No,5
459 | 77,-1,6.8,I,118,221,Yes,5
460 | 78,-0.6,7.3,I,114,148,No,4
461 | 78,-1.8,5.2,II,102,201,No,1
462 | 78,0,7.6,I,130,170,Yes,3
463 | 78,0.3,7.9,I,132,130,Yes,4
464 | 78,-0.5,7.4,II,123,179,Yes,3
465 | 78,-1.8,5.9,II,104,206,Yes,2
466 | 78,-0.6,6.7,I,115,164,No,3
467 | 79,0.3,7.3,II,132,214,Yes,3
468 | 79,-0.2,7.5,II,119,157,No,1
469 | 79,-1.2,6.1,II,118,203,No,5
470 | 79,0.6,7.6,I,130,219,Yes,5
471 | 79,-1,6.7,I,111,137,Yes,2
472 | 80,-0.4,7.4,I,122,113,Yes,2
473 | 80,-2.1,5,II,103,135,Yes,1
474 | 80,-0.1,7,I,124,175,No,2
475 | 80,0.7,8.5,I,128,228,Yes,3
476 | 80,-0.9,6.6,II,112,228,No,1
477 | 81,-0.5,6.7,II,123,217,No,2
478 | 81,0.1,7.2,I,125,181,Yes,4
479 | 81,-1.1,5.9,II,113,188,Yes,2
480 | 81,-1.5,5.6,II,115,225,Yes,2
481 | 81,-0.7,6.6,II,117,192,Yes,3
482 | 82,0.5,8.2,I,132,179,No,1
483 | 82,1,8.8,I,138,157,Yes,2
484 | 82,1.9,9.8,I,140,170,No,5
485 | 82,-0.8,6.4,II,113,186,No,3
486 | 82,-0.6,7.4,I,116,124,Yes,2
487 | 82,0,7.3,I,123,124,Yes,5
488 | 82,-1,6.1,II,112,172,Yes,5
489 | 83,-1,6.8,II,118,155,Yes,1
490 | 83,-2,5.9,I,105,122,No,4
491 | 83,-2.3,5.1,II,97,212,Yes,3
492 | 83,0,7.1,II,124,214,No,4
493 | 83,0.5,8,II,130,203,No,4
494 | 83,-0.8,7,I,120,161,Yes,2
495 | 84,0.5,8.5,I,129,190,Yes,4
496 | 84,-0.4,7.4,I,120,230,Yes,4
497 | 84,0,7.5,II,121,161,Yes,4
498 | 84,-0.6,6.9,I,120,124,No,3
499 | 84,0.4,8.4,I,129,203,Yes,2
500 | 85,0.8,8.3,II,133,137,No,3
501 | 85,0.5,7.9,II,134,172,No,4
502 |
--------------------------------------------------------------------------------
/04 Visualizing data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: " 04 Visualizing data"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = TRUE)
18 | setwd(getwd()) # Spreadsheet file and R markdown file in same folder
19 | ```
20 |
21 | 
22 |
23 | ## Introduction
24 |
25 | Summary statistics provide us with a single value, or at least a couple of values, that give us a good impression of our data. It is true, though, that a _picture is worth a thousand words_. Many presentations, posters, and published papers contain plots (graphs, figures). They can convey a large amount of data and provide us with a deeper knowledge of our data.
26 |
27 | Base R can create publication ready plots. There are also many packages available to create plots. The most common is the `ggplot2` library. In this lecture, though, we will use the newer `plotly` library. It is built for the modern web and allows us to create plots for interactive content such as web pages, as well as static images, ready for submission to your favorite journal.
28 |
29 | The choice of plot depends greatly on the type of data that we wish to visualize. We will start this lecture by looking at plots for categorical data. During the lecture, we will first use the R base plotting tools, but quickly introduce the `plotly` library. Make sure that it is installed using the _Packages_ tab in the bottom-right panel of RStudio.
30 |
31 | ## Plots for categorical data
32 |
33 | #### Creating simulated data
34 |
35 | We start off by creating a nominal categorical variable, containing $100$ data point values, randomly selected from a sample space of four diseases. We do this using the `sample()` function and remember to set the `replace =` argument to `TRUE`. We store these $100$ values as an instance of the vector type and call it `diseases`.
36 |
37 | ```{r Creating a nominal categorical data}
38 | set.seed(1234) # Seeding the pseudo-random number generator
39 | diseases <- sample(c("Pneumonia", "COAD", "Bronchitis", "Asthma"),
40 | 100,
41 | replace = TRUE)
42 | ```
43 |
44 | We can take a look at how many times each of the sample space elements appear in our randomly selected dataset. We use the `table()` function to achieve this.
45 |
46 | ```{r Counting the number of each sample space element}
47 | table(diseases)
48 | ```
49 |
50 | To extract a vector containing only the data point values (count of each sample space element), we can use the `as.numeric()` function, passing `table(cities)` as argument.
51 |
52 | ```{r Extracting only the list of values}
53 | as.numeric(table(diseases))
54 | ```
55 |
56 | If we simply want a list of the sample space elements we use the `names()` function.
57 |
58 | ```{r Extracting the sample space elements}
59 | names(table(diseases))
60 | ```
61 |
62 | Using the `table()` function and then extracting the count values with the `as.numeric()` function and the sample space elements using the `names()` functions provides us with all the data required to create a bar chart, ideal for nominal categorical variables.
63 |
64 | ### Simple bar chart
65 |
66 | A bar chart is a great choice to display a count of a nominal categorical variable. The built-in R `barplot()` function creates a simple, austere plot. Look at the code chunk below.
67 |
68 | ```{r Creating a simple bar plot}
69 | barplot(as.numeric(table(diseases)),
70 | names.arg = names(table(diseases)),
71 | main = "Presenting diseases",
72 | xlab = "Pulmonary conditions",
73 | ylab = "Count",
74 | las = 1)
75 | ```
76 |
77 | You will note that we have several arguments. Technically, we really only needed the names of each bar and the count. We use the `as.numeric()` function to extract the number of each of the sample space elements in our dataset. Next up is a keyword argument called `names.arg =`. With this argument we can give a name to each bar. This name appears at the bottom of each bar. Once again, we use the `names()` function to extract the data from a table of our `diseases` variable. Note that we initially created an atomic vector with $100$ values. Using the `tables()` function allows for the creation of a sample space list and a count.
78 |
79 | The `main =` argument allows us to enter a title for our plot. Since the title is a string, the name is placed in quotation marks. The `xlab =` and `ylab =` keyword arguments similarly allow us to add titles for the `x` and `y` axes. The `lab = 1` argument turns the default vertically oriented _y_ axis numbers upright. Leave the keyword argument out to see the difference.
80 |
81 | Now let's recreate the plot using the `plotly` library, sut to introduce the library and to show the differences from the built-in bar plot. Plotly is imported using the `library()` function.
82 |
83 | ```{r Importing Plotly, message=FALSE, warning=FALSE}
84 | library(plotly)
85 | ```
86 |
87 | Take a look at the code chunk below. Our aim is to recreate the plot above.
88 |
89 | ```{r Plotly bar chart}
90 | p1 <- plot_ly(x = names(table(diseases)), # A list of the elements of the categorical variable
91 | y = as.numeric(table(diseases)), # The list of counts
92 | name = "Presenting diseases", # A title for the plot
93 | type = "bar") # The plot type
94 | p1
95 | ```
96 |
97 | Running this code in RStudio creates an interactive plot. Hover over the elements to see more information to pop up. There is also a bar that appears above the plot (top-right) that allows us to save a static image to our computer drive and perform a number of other convenient tasks such as zooming and panning.
98 |
99 | We can add a title and axes labels using the `layout()` function. To do this, we use the pipe symbol. With the `layout()` function we can add a main title using the `title =` argument. The `xaxis =` and `yaxis =` arguments can contain lists of arguments using the `list()` function. Below we add a `title =` list argument and in the case of the _y_-axis, we also turn off the default dark black line that runs across the _x_-axis at $y=0$.
100 |
101 | ```{r Bar chart with title and axes labels}
102 | p2 <- plot_ly(x = names(table(diseases)),
103 | y = as.numeric(table(diseases)),
104 | name = "Diseases",
105 | type = "bar") %>%
106 | layout(title = "Presenting diseases",
107 | xaxis = list(title = "Pulmonary conditions"),
108 | yaxis = list(title = "Count",
109 | zeroline = FALSE)) # Turns off the thick black line on the x-axis
110 | p2
111 | ```
112 |
113 | Our plot is now similar to the original, but interactive, with a lot more features.
114 |
115 | #### Importing a spreadsheet file
116 |
117 | Instead of the random dataset that we created above, let's import a spreadsheet file and use the `dplyr` library to extract the data that we require.
118 |
119 | ```{r Importing the readr and dplyr libraries, message=FALSE, warning=FALSE}
120 | library(readr)
121 | library(dplyr)
122 | ```
123 |
124 | ```{r Importing the ProjectData spreadsheet file, message=FALSE}
125 | df <- read_csv("ProjectData.csv")
126 | ```
127 |
128 | There are eight variables with $500$ records (rows).
129 |
130 | ```{r Dimension of the dataset}
131 | dim(df)
132 | ```
133 |
134 | To extract the names of the eight variables, we use the `names()` function.
135 |
136 | ```{r Statistical variables}
137 | names(df)
138 | ```
139 |
140 | The `Group` variable represents the group into which each patient was randomized, with the sample space being `I` and `II`. We can imagine this being a trial, with patients in group I receiving a placebo drug (the control group) and patients in group II receiving the experimental drug.
141 |
142 | Let's use the `dplyr` library to count the number of each group.
143 |
144 | ```{r Counting the number of patients in each group}
145 | df %>% group_by(Group) %>%
146 | summarise(group_count = n())
147 | ```
148 |
149 | We can save this information as a small tibble.
150 |
151 | ```{r Saving the group counts}
152 | group_size <- df %>% group_by(Group) %>%
153 | summarise(group_count = n())
154 | ```
155 |
156 | We note $251$ patients in group I and $249$ patients in group two. Since this is a nominal categorical variable, we can use a bar chart to represent the data.
157 |
158 | ```{r Bar plot of groups}
159 | p3 <- plot_ly(x = group_size$Group,
160 | y = group_size$group_count,
161 | name = "Group count",
162 | type = "bar") %>%
163 | layout(title = "Number of patients in each group",
164 | xaxis = list(title = "Group"),
165 | yaxis = list(title = "Count",
166 | zeroline = FALSE))
167 | p3
168 | ```
169 |
170 | As exercise, let's look at the `Survey` variable. In this simulated experiment it is meant to represent a choice given by a patient as a survey question. The sample space as entered into the spreadsheet is `1,2,3,4,5`. This could represent something like _Totally disagree_, _Disagree_, _Neither agree not disagre_, and so on. It might also be a pain scale, with `0` being no pain and `5` being maximum pain. The point being, it represents an ordinal categorical variable. Let's look at the counts and save it as a tibble.
171 |
172 | ```{r Saving the Survey count}
173 | survey_choice <- df %>% group_by(Survey) %>%
174 | summarise(survey_count = n())
175 | ```
176 |
177 | Now for a bar plot of the data.
178 |
179 | ```{r Bar plot of the Survey variable}
180 | p4 <- plot_ly(x = survey_choice$Survey,
181 | y = survey_choice$survey_count,
182 | name = "Survey count",
183 | type = "bar") %>%
184 | layout(title = "Number of each choice selected",
185 | xaxis = list(title = "Answer"),
186 | yaxis = list(title = "Count",
187 | zeroline = FALSE))
188 | p4
189 | ```
190 |
191 | ## Plots for numerical data
192 |
193 | ### Histograms
194 |
195 | Histograms are similar to bar charts in that they give a count of data point values. In this case, though, the variable under consideration is numerical. To create counts, we often _bin the values_ of the numerical variable. This is a fairly easy concept to understand. Imagine our sample space for a numerical variable ranges from $1$ to $10$. Imagine also that decimal-value numbers are allowed. Depending on the size of the data and the number of decimal places in each observation, we might only have only or two instances of the same value. We can remedy this situation by partitioning the values, i.e. all values less than two, and then all values greater than or equal to two, but smaller than four, and then all values equal to or greater than four, but less than six, and so on. We can then count the number of isntances in each bin and display this as a histogram.
196 |
197 | Let's consider the `Age` variable in our spreadsheet. Here are the minimum and maxim ages.
198 |
199 | ```{r}
200 | df %>% select(Age) %>%
201 | summarise(min_age = min(Age),
202 | max_age = max(Age))
203 | ```
204 |
205 | ```{r}
206 | p5 <- plot_ly() %>%
207 | add_histogram(x = df$Age,
208 | name = "Age") %>%
209 | layout(title = "Age distribution",
210 | xaxis = list(title = "Age"),
211 | yaxis = list(title = "Counts",
212 | zeroline = FALSE))
213 | p5
214 | ```
215 |
216 | When we run this code chunk in RStudio, we get a histogram. Note that it differs from a bar plot in that the _bars_ are next to each other, representing the fact that the data is continuous numerical and not catgeorical. Hovering over each bar also shows the bin limits and how many patients fell into that partiuclar bin.
217 |
218 | Instead of absolute counts, we can also use the `histnorm =` argument to give a relative frequency count (fraction of $1.0$).
219 |
220 | ```{r Histogram with relative frequency}
221 | p5 <- plot_ly() %>%
222 | add_histogram(x = df$Age,
223 | name = "Age",
224 | histnorm = "probability") %>%
225 | layout(title = "Age distribution",
226 | xaxis = list(title = "Age"),
227 | yaxis = list(title = "Relative frequency",
228 | zeroline = FALSE))
229 | p5
230 | ```
231 |
232 | There are many ways to _code_ for the same result. This provides for confusion at times, but also for great flexibility. In the code chunk below is an example of recreating our original absolute count histogram, but with different syntax. Take alook, noting the tilde symbol and the use of the `type =` argument as opposed to the `add_histogram()` function we used before.
233 |
234 | ```{r}
235 | p6 <- plot_ly(df, x = ~Age, type = "histogram") %>%
236 | layout(title = "Age distribution",
237 | xaxis = list(title = "Age"),
238 | yaxis = list(title = "Count",
239 | zeroline = FALSE))
240 | p6
241 | ```
242 |
243 | ### Box-and-whisker plot
244 |
245 | The ubiqutous box-and-whisker plot portrays the spread of a numerical variable. It is often used in conjunction with a split in the data along a categorical variable. In the case of our dataset, we might consider the `Age` variable for patients in groups I and II.
246 |
247 | The center box indicates three value. The middle line is the median, and the lower and upper bounds are the first and thrid quartile value. As default, the whisker go out to $1.5$ times the interquesrtile range below and above the first and third quartiles. These values are called the _lower_ and the _upper fences_. Value beyond this are considered statistical outliers. If there are no outliers, the whiskers indicate the minimum and maximum values for the variable.
248 |
249 | Le's take a look then at the age distribution of each of the group of patients.
250 |
251 | ```{r Box plot of age of each group}
252 | p7 <- plot_ly(df,
253 | color = ~Group,
254 | y = ~Age,
255 | type = "box",
256 | colors = "BrBG",
257 | boxpoints = "all",
258 | jitter = 0.3) %>%
259 | layout(title = "Age distribution of each group",
260 | xaxis = list(title = "Group"),
261 | yaxis = list(title = "Age",
262 | zeroline = FALSE))
263 | p7
264 | ```
265 |
266 | The boxes, whiskers and markers can be colored individually. To do this, though, we need to create individual list of values from our data. In the code chunk below, we prepare a monochrome plot that is suitable for submission to a journal. We create separate lists for the ages of each group and create the boxes individually.
267 |
268 | ```{r Creating age lists for each box}
269 | age_I <- df %>% filter(Group == "I") %>% select(Age)
270 | age_II <- df %>% filter(Group == "II") %>% select(Age)
271 | ```
272 |
273 | ```{r Monochrome box plot of ages for each group}
274 | p9 <- plot_ly() %>%
275 | add_boxplot(y = age_I$Age,
276 | fillcolor = "rgb(100, 100, 100)", # Dark grey
277 | line = list(color = "rgb(20, 20, 20)"), # Very dark grey
278 | name = "Group I") %>%
279 | add_boxplot(y = age_II$Age,
280 | fillcolor = "rgb(170, 170, 170)", # Lighter grey
281 | line = list(color = "rgb(20, 20, 20)"),
282 | name = "Group II") %>%
283 | layout(title = "Age distribution of each group",
284 | xaxis = list(title = "Group"),
285 | yaxis = list(title = "Age",
286 | zeroline = FALSE))
287 | p9
288 | ```
289 |
290 | The `CRP` variable contains some outliers. There are many arguments to manipulate the look of these markers. Below we create tibbles for `CRP` values for each of the groups and show some of the arguments. Most notably we change the shape to diamonds and we color the background of these outliers the same as the boxes.
291 |
292 | ```{r Creating crp lists for each box}
293 | crp_I <- df %>% filter(Group == "I") %>% select(CRP)
294 | crp_II <- df %>% filter(Group == "II") %>% select(CRP)
295 | ```
296 |
297 | ```{r Monochrome box plot of crp for each group}
298 | p10 <- plot_ly() %>%
299 | add_boxplot(y = crp_I$CRP,
300 | fillcolor = "rgb(100, 100, 100)", # Dark grey
301 | line = list(color = "rgb(20, 20, 20)"), # Very dark grey
302 | marker = list(symbol = "diamond",
303 | size = 10,
304 | color = "rgb(100, 100, 100)",
305 | line = list(color = "rgb(20, 20, 20)",
306 | width = 2)),
307 | name = "Group I") %>%
308 | add_boxplot(y = crp_II$CRP,
309 | fillcolor = "rgb(170, 170, 170)", # Lighter grey
310 | line = list(color = "rgb(20, 20, 20)"),
311 | marker = list(symbol = "diamond",
312 | size = 10,
313 | color = "rgb(170, 170, 170)",
314 | line = list(color = "rgb(20, 20, 20)",
315 | width = 2)),
316 | name = "Group II") %>%
317 | layout(title = "CRP distribution of each group",
318 | xaxis = list(title = "Group"),
319 | yaxis = list(title = "c-reactive protein",
320 | zeroline = FALSE))
321 | p10
322 | ```
323 |
324 |
325 | ### Scatter plots
326 |
327 |
--------------------------------------------------------------------------------
/05_Sampling_distributions.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sampling distributions"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = TRUE)
18 | setwd(getwd())
19 | ```
20 |
21 | 
22 |
23 | ## Preamble
24 |
25 | In statistics our aim is to discover information about a population or differences between populations. Populations are enormous and in most cases, we cannot get data from each subject in a population. Instead, we pick representative samples from a population or populations and describe or compare those subjects only. This saves time and money and makes discovery about the information we seek possible. Inferential statistics is all about the analysis of our sample data and inferring those result onto the populations.
26 |
27 | Just as data point values from samples and populations come in patterns known as distributions, so does statistical values such as the mean, differences in means, standard deviations and so on. These distributions are known as _sampling distributions_.
28 |
29 | Sampling distributions form the basis of inferential statistics. They allow us to express common statistical entities such as confidence levels and _p_ values. To see this connection, we need to look at the incredible _Central Limit Theorem_ (CLT).
30 |
31 | ## The central limit theorem
32 |
33 | The CLT is one of the most beautiful theorems in all of statistics. In this section, we will create a population, draw samples from it, and in the process develop a deep and meaningful understanding of inferential statistics.
34 |
35 | Let's consider then a population of $2000$ subjects. This is a small population, but to keep our calculations simple, it will suffice. Using the `runif()` function allows us to choose random data point values from a set of stated integers, following a discrete uniform distribution. Each value in the domain has an equal chance of being selected at every turn during the process of selecting our $2000$ values. We can draw a histogram to show the distribution of our population variable.
36 |
37 | ```{r Two-thousand values from a discrete uniform distribution}
38 | set.seed(123) # Seed the pseudo-random number generator for reproducible results
39 | population <- runif(2000,
40 | min = 11,
41 | max = 20) # Choose 2000 random integers each with equal likelihood
42 | # Create a histogram with counts shown above each bin
43 | hist(population,
44 | breaks = 10,
45 | main = "Population variable",
46 | xlab = "Variable",
47 | ylab = "Count",
48 | labels = TRUE,
49 | xlim = c(11, 20),
50 | ylim = c(0, 250))
51 | ```
52 |
53 | The histogram shows that each value appears in our dataset with the counts being roughly equal.
54 |
55 | Now we consider doing some research on our population. We select $30$ subjects at random and calculate the mean of the variable data point values.
56 |
57 | ```{r Our first experiment}
58 | set.seed(123)
59 | mean(sample(population,
60 | size = 30,
61 | replace = TRUE)) # Calculate the mean of 30 random subjects
62 | ```
63 |
64 | We note a mean of $15.9$. Consider the following, though: _What if we started our experiment a week later_? We would get a different random sample of $30$ subjects. We can simulate this with seeding the pseudo-random number generator with a different value. (Remember that we are only using the `set.seed()` function to get reproducible results and by omitting it, we would get different values each time we run the code.)
65 |
66 | ```{r Repeating our experiment}
67 | set.seed(321)
68 | mean(sample(population,
69 | size = 30,
70 | replace = TRUE))
71 | ```
72 |
73 | Now we have a mean of $15.6$. What if we ran our experiment $10$ times, recording the mean each time? Well, we would have $10$ _random values_ of a statistical variable. Yes indeed, a statistic such as the mean can be a variable!
74 |
75 | In the code chunk below, we use a `for` loop to run our experiment $10$ times, appending the mean of each run of $30$ subjects to an initial empty computer variable called `sample_10`. Then we create a histogram of the $10$ values. The actual R code is not of interest here and you need not follow it. Rather understand the statistical concepts.
76 |
77 | ```{r Repeat study 10 times}
78 | sample_10 <- c() # Creating an empty vector
79 | set.seed(123)
80 | for (i in 1:10){ # Looping though the experiment 10 times
81 | sample_10 <- append(sample_10, mean(sample(population, size = 30, replace = TRUE)))}
82 | hist(sample_10,
83 | main = "Ten repeat studies",
84 | xlab = "Mean",
85 | ylab = "Count")
86 | ```
87 |
88 | Suddenly the distribution does not seem uniform! Let's repeat our experiment $100$ times. Each time we select $30$ random subjects and calculate a mean, adding it to the vector object `sample_100`.
89 |
90 | ```{r Repeat study 100 times}
91 | sample_100 <- c()
92 | set.seed(123)
93 | for (i in 1:100){
94 | sample_100 <- append(sample_100, mean(sample(population, size = 30, replace = TRUE)))}
95 | hist(sample_100,
96 | main = "One-hundred repeat studies",
97 | xlab = "Mean",
98 | ylab = "Count")
99 | ```
100 |
101 | Now this is interesting. Here goes a thousand repeats.
102 |
103 | ```{r Repeat study 1000 times}
104 | sample_1000 <- c()
105 | set.seed(123)
106 | for (i in 1:1000){
107 | sample_1000 <- append(sample_1000, mean(sample(population, size = 30, replace = TRUE)))}
108 | hist(sample_1000,
109 | main = "One-thousand repeat studies",
110 | xlab = "Mean",
111 | ylab = "Count",
112 | ylim = c(0, 400),
113 | labels = TRUE)
114 | ```
115 |
116 | What we have here is the CLT in action. A statistic (the mean in this example) tends to a normal distribution as we increase the number of times we randomly calculate it! This is profound. From our $1000$ repeats, we see that by chance we would only see a mean of $17$ or more `r 1 / 1000 *100`% of the time or less than $14.5$ `r 21 / 1000 *100`% of the time. If we express this as fraction, it would be `r 1 / 1000` and `r 21 / 1000`. Letting the cat out of the bag, these are probabilities, leading us to _p_ values
117 |
118 | In reality, we cannot repeat an experiment $1000$ times. We would do it just once, i.e getting a different mean with a specific probability. That once could be any one of the $1000$ experiments (or more) that we created above.
119 |
120 | ## The deity experiment
121 |
122 | Let's take the big step and put all of this in perspective. This will allow a thorough understanding of inferential statistics.
123 |
124 | Below, we simulate a new research project. We set it up, though, such that we know all the values for a specific variable. In this experiment, there are only $8000$ subjects in the whole population. Imagine some rare disease, or laboratory experiment in which there are only $8000$ specific genetic individuals.
125 |
126 | In this experiment, we take two roles. In the first, we are a researcher. We think that the $8000$ subjects are distinct for a specific variable and actual make up two populations. We want to know if there is a statistically significant difference between the two population for the variable under consideration.
127 |
128 | We also take the role of an imaginary deity. This deity knows that there is no difference between the two populations. We simulate this by creating two sets of vectors, creating $4000$ random values with the same parameters for each.
129 |
130 | ```{r Creating two similar populations}
131 | set.seed(123)
132 | group_A_population <- sample(30:70,
133 | size = 4000,
134 | replace = TRUE)
135 | group_B_population <- sample(30:70,
136 | size = 4000,
137 | replace = TRUE)
138 | ```
139 |
140 | Now we run an experiment $1,000$ times. Each time we take a random $30$ subjects from each population and calculate the mean of our variable for each group. We subtract the two means, so as to capture $1,000$ differences in means. We will then have a sampling distribution of $1,000$ means. As you can guess by now, the CLT guarantees a normal distribution.
141 |
142 | ```{r Drawing random samples 10000 times}
143 | set.seed(123)
144 | sample_1000 <- c()
145 | for (i in 1:1000){
146 | sample_1000 <- append(sample_1000,
147 | mean(sample(group_A_population, size = 30, replace = TRUE)) -
148 | mean(sample(group_B_population, size = 30, replace = TRUE)))
149 | }
150 | hist(sample_1000,
151 | main = "Deity experiment",
152 | xlab = "Difference between means",
153 | ylab = "Count",
154 | ylim = c(0, 300),
155 | labels = TRUE)
156 | ```
157 |
158 | This is, once again, profound! Remember that in real life, we would only do our research experiment once. Our difference in means would be one of the $1,000$ (and more) above. Theoretically the CLT is described as the number of experiments approaches infinity, but $1,000$ already brings the point home.
159 |
160 | Imagine then that our one experiment was one that found a difference a $9.5$ From the approximation, we can deduce that that would only occur `r 5 / 1000` times. We would express this as a statistically significant result and claim that there is a difference between the two populations. Remember, though, as deity, we set this up so that there never was a difference. This, though, is inferential statistics. It is based on calculations given that there is no difference between populations. As humans, we decide on an arbitrary cut-off of say $0.05$ beyond which we claim a difference. There might very well be, but that is not the way inferential statistics work. It is based on theoretical sampling distributions for NO DIFFERENCES.
161 |
162 | From our deity experiment it was easy to recreate a study many, many times over. In real-life, we only get to do our study once (with others perhaps replicating it later). How then do we know how the histogram above should look like, given that we do not have access to all the repeat studies?
163 |
164 | First of all, it is important to note that in mathematical statistics, we deal with a smooth curve and not a frequency distribution (histogram) as above. There are actual equations that draw smooth curves given our data. These equations take only information from our one study to draw these curves, called cumulative distributions curves. Statisticians have developed fantastical equations for any number of sampling distributions.
165 |
166 | ## The _z_ distribution
167 |
168 | This is the quintessential distribution and is, in fact, the normal distribution. For its use as a sampling distribution, i.e. to investigate the differences in means for a numerical between two groups, we need to know the population standard deviation for this difference.
169 |
170 | Imagine, then, that we know it to be $1$. With a mean of $0$, this results in the standard normal distribution shown in the plot below.
171 |
172 | ```{r Normal distribution}
173 | curve(dnorm(x,
174 | 0,
175 | 1,),
176 | from = -3, to = 3,
177 | lwd = 2,
178 | main = "Normal distribution",
179 | ylab = "Distribution",
180 | xlab = "z statistic")
181 | ```
182 |
183 |
184 | Any given diference must be converted to a _z_ score (also known as a _z_ statistic). This is shown in equation (1).
185 |
186 | $$ z = \frac{x - \mu}{\sigma} \tag{1}$$
187 |
188 | If we ran an experiment and found a difference in means of $-1.2$, we can convert this into a _z_ score using equation (1).
189 |
190 | ```{r Calculating a z score}
191 | z = (-1.2 - 0) / 1
192 | z
193 | ```
194 |
195 | We can show this on our plot of the normal distribution.
196 |
197 | ```{r z distribution with z score of -1.2}
198 | curve(dnorm(x,
199 | 0,
200 | 1),
201 | from = -3, to = 3,
202 | lwd = 2,
203 | main = "z distribution",
204 | ylab = "Distribution",
205 | xlab = "z score")
206 | abline(v = -1.2,
207 | col = "red")
208 | ```
209 |
210 | This is _how many standard deviations_ the result is from the mean. Our question then revolves around how likely it was to find such a difference (area under the curve to the left of the vertical line indicating the $-1.2$ standard deviations away from the mean (or a _z_ score of $-1.2$). We can use the `pnorm()` function to caluclate this.
211 |
212 | ```{r p value for our difference}
213 | pnorm(-1.2, # Difference in means
214 | 0, # Mean
215 | 1) # Standard deviation
216 | ```
217 |
218 | We note an $11.5$% chance of a difference of $-1.2$ or less.
219 |
220 | For an example where we know that the standard deviation of the difference in means is $3$, we can ask what the probability is of a difference of $4.5$ or more. Remember that we can have $4.5$ or $-4.5$ depending which group we subtract form which group.
221 |
222 | ```{r z score for new probelm}
223 | z = (4.5 - 0) / 3
224 | z
225 | ```
226 |
227 | Since our research question involved only a difference (a two-tailed alternative hyspothesis - see next chapter), we need to duplicate this on the right-hand side of the bell-shaped curve as well, i.e. $-1.5$. We should also mutiply the results by $2$ to get a _p_ value for our research question (see the following plot).
228 |
229 | ```{r p value for our new difference}
230 | pnorm(-1.5, # Difference in means
231 | 0, # Mean
232 | 3) * 2 # Standard deviation
233 | ```
234 |
235 | We visualize this in the plot below.
236 |
237 | ```{r z score for a two tailed hypothesis}
238 | curve(dnorm(x,
239 | 0,
240 | 3),
241 | from = -6, to = 6,
242 | lwd = 2,
243 | main = "z distribution",
244 | ylab = "Distribution",
245 | xlab = "z score")
246 | abline(v = -1.5,
247 | col = "red")
248 | abline(v = 1.5,
249 | col = "red")
250 | ```
251 |
252 | The plot is very informative. For a known difference in population means of $0$ with a standard deviation of this difference of $3$, we see that a difference in means finding of $-4.5$ results in a _z_ score of $-1.5$ (or $ -1.5 $ standard deviations away from the mean). Since we are interested in a difference of at least this _large_, we duplicate our difference on the right-hand side, i.e. a _z_ score of $4.5$. The probability of finding a difference in means of at least $4.5$ (positive or negative) $0.62$. We can state a _p_ value of $0.67$. This is the area under the curve to the left of the $-4.5$ line added to the area under the curve to the right of the $4.5$ line.
253 |
254 | ## The _t_ distribution
255 |
256 | One of the most common sampling distributions is the _t_ distribution. If we knew the standard deviation for a variable in a whole population (the standard deviation of the difference in means for two groups for instance), we can create what is called the _z_ distribution, as mentioned.
257 |
258 | Since we hardly ever know the standard deviation for a variable in a whole population, we make do with the _t_ distribution. This sampling distribution requires only one parameter from our data and that is the sample size.
259 |
260 | We use the _t_ distribution when comparing a numerical variable between two groups. It forms the basis for the famous Student's _t_ test, developed by William Cosset, while working at the Guinness Brewing Company. The parameter we take from our sample is then the sample size minus the number of groups. This is called the _degrees of freedom_ (DOF). Since we are comparing two groups, with sample sizes $n_1$ and $n_2$, we will have a DOF value of $n_1 + n_2 - 2$.
261 |
262 | The _t_ distribution appears very much bell-shaped. For larger sample sizes, this distribution actually approximates the _z_ distribution.
263 |
264 | In the graph below, we see a dark line representing a DOF value of $30$. Considering the mean at $0$, we see successive curves for DOF values of $1,2,3,4$ and $10$ (from the bottom up).
265 |
266 | ```{r The t distribution}
267 | curve(dt(x, df = 30),
268 | from = -3, to = 3,
269 | lwd = 3,
270 | ylab = "Distribution",
271 | xlab = "t statistic")
272 | ind <- c(1, 2, 3, 5, 10)
273 | for (i in ind) curve(dt(x, df = i), -3, 3, add = TRUE)
274 | ```
275 |
276 | ## The $\chi^2$ distribution
277 |
278 | Another common sampling distribution is the $chi^2$ distribution. It also takes one parameter from our sample, being the DOF. It is calculated a bit differently, though.
279 |
280 | We commonly use the $chi^2$ distribution when dealing with categorical variables. In the graph below, we see a dark line representing a DOF value of $1$. Successive curves have DOF values of $2,3,5$, and $10$.
281 |
282 | ```{r The chi-squared distribution}
283 | curve(dchisq(x,
284 | df = 1),
285 | from = 0,
286 | to = 10,
287 | lwd = 3,
288 | ylab = "Distribution",
289 | xlab = "Statistic")
290 | ind <- c(2, 3, 5, 10)
291 | for (i in ind) curve(dchisq(x,
292 | df = i),
293 | 0,
294 | 10,
295 | add = TRUE)
296 | ```
297 |
298 | ## Conclusion
299 |
300 | Sampling distributions are the bedrock of inferential statistics and we use them to calculate our _p_ values for reporting in a journal article.
301 |
302 | We will see these sampling distributions in action when we conduct _t_ and $\chi^2$ tests. Their meaning will become crystal clear then.
--------------------------------------------------------------------------------
/06_A_parametric_comparison_of_means.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "A parametric comparison of means"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | 
17 |
18 | ```{r setup, include=FALSE}
19 | knitr::opts_chunk$set(echo = TRUE)
20 | setwd(getwd())
21 | ```
22 |
23 | ## Preamble
24 |
25 | The time has come for us to do some real inferential statistics. In this tutorial, we will look at comparing the means for a numerical variable between two or more groups. It is important to understand that the groups are made up of splitting the subjects of a research project by the sample space of one of the categorical variables in the dataset. Note that this _splitting into groups_ can also be done by a numerical variable, but this variable will have to be converted to a categorical variable. An example might be _age_. We can groups all patients younger than $60$ years of age as belonging to a group called _young_ and everyone else belonging to a group called _not young_. This necessitates a new variable in which the sample space will be these two new data point values.
26 |
27 | We will stick to the more common case of using one of the categorical variables to create two or more groups. It is the difference in means for a numerical variable that is our interest in this chapter.
28 |
29 | The most common test to compare the means for a numerical variable between two groups is Student's _t_ test. When comparing more than two groups, the most common test is analysis of variance (ANOVA).
30 |
31 | All the tests in this chapter are examples of parametric tests. Parametric tests are based on assumptions made of the data. These assumptions must be met before we can actually use these test. Fortunately, this is so in most cases. In the chapter on non-parametric tests, we will actually look at these assumptions and learn about specific statistical tests associated with these assumptions. If these tests show that the assumptions do not hold, we cannot use the common parametric tests shown in this chapter. Instead, we will use the mentioned non-parametric test.
32 |
33 | ## Student's _t_ test
34 |
35 | As quintessential example of a parametric comparison of two means, we will conduct Student's _t_ test. This tests compares the means of numerical variable between two groups. It is one of the most common tests in inferential statistics.
36 |
37 | Let's then consider our original dataset that we have been using all along. We used the `setwd(getwd())` statement at the beginning of this R markdown file and saved it in the same directory or folder as the spreadsheet file that we will use.
38 |
39 | ```{r Importing libraries, message=FALSE, warning=FALSE}
40 | library(dplyr)
41 | library(car) # For Levene's test
42 | ```
43 |
44 | ```{r Importing he data, message=FALSE, warning=FALSE}
45 | df <- read.csv("ProjectData.csv") # Using the built-in read.csv function
46 | ```
47 |
48 | We remember that there are two values in the sample space of the `Group` variable. We can use the `table()` function to count the occurrences of each.
49 |
50 | ```{r Counting the categorical variable Group}
51 | table(df$Group) # Count the occurences of each of the sample space values
52 | ```
53 |
54 | To conduct Student's _t_ test, we will use the `t.test()` function. We need to pass two lists containing the numerical variable under consideration as arguments. There are a few ways to go about this. Below, we create two list objects to achieve this. Let's compare the ages between the two groups. We used appropriately named computer variables and the `filter()` and `select()` function from the `dplyr` library to do this. The added `pull()` function converts the extracted single column tibble object as a vector of values.
55 |
56 | ```{r Creating two list objects to hold the ages}
57 | # Selecting only the Age column values but splitting those by the sample space
58 | # of the Group column
59 | group_I_age <- pull(df %>% filter(Group == "I") %>%
60 | select(Age)) # Only patients that have a data point value of I in the Group column
61 | group_II_age <- pull(df %>% filter(Group == "II") %>%
62 | select(Age)) # Only patients that have a data point value of II in the Group column
63 | ```
64 |
65 | Now for the simple _t_ test. We pass the two vectors (lists of age values) as arguments.
66 |
67 | ```{r Student t test comparing ages between the two groups}
68 | t.test(group_I_age, # Age values of patients in group I
69 | group_II_age, # Age values of patients in group II
70 | var.equal = TRUE) # Stating that we are assuming equal variances
71 | ```
72 |
73 | We note a _t statistic_ of $0.557$, a degrees of freedom (DOF) values of $498$,, and a _p_ value of $0.58$ (rounded). We also note a confidence interval around the differences in the two means (more on this later) and the two actual means. In between there is a statement about an alternative hypothesis.
74 |
75 | The _p_ value is more than $0.05$ and we would state that there is not a significant different in age between patients in group I and group II. The two mean values are also easy to interpret.
76 |
77 | We need to spend some time on the other information, though.
78 |
79 | ## Hypothesis testing
80 |
81 | The goal of inferential statistics is infer the results from a sample to a larger population. To do this, we need to formulate a research question. In our simple case this would be: _"Is there a difference in age between patients in the two groups?"_ This is indeed a proper question, since we can translate it back to the variable for which we have collected data point values in our sample of patients. We have a numerical variable `Age` and a categorical variable `Group`. We create two groups from the latter as it has a sample space of two elements, i.e. `I` and `II`.
82 |
83 | As good scientists we set a __null hypothesis__, which we accept as the truth unless gathered evidence shows the contrary. This null hypothesis always assume that our research question returns a _not true_ value. In this case: _"No, there is no difference between the ages of the two groups."_ More specifically, we state that the difference in means is $0$.
84 |
85 | Now, we need to specify an __alternative hypothesis__. This is the hypothesis that we __accept__ when evidence shows that we can __reject__ the null hypothesis. This is what happens when we state that there is a _statistically significant difference between the (mean) ages of the two groups_.
86 |
87 | This decision is based on a _p_ value and an arbitrary cut-off, known as the $\alpha$ value. It is customary to set this as $0.05$, such that a _p_ value of less than $0.05$ leads us to reject our null hypothesis and accept our alternative hypothesis. When the _p_ value is not less than $0.05$, we fail to reject our null hypothesis and state that there is no significant different between the (mean) ages of the two groups.
88 |
89 | ## Degrees of freedom and the _t_ distribution
90 |
91 | The DOF value in a _t_ test is simply the the sample size minus two (because we have two groups). Since we had a sample size of $500$, we have a DOF value of $498$.
92 |
93 | In the previous chapter we looked at the _t_ distribution. This is a sampling distribution and _simulates_ all the possible differences in means that we could find if we could do our research over and over again. Since we cannot, the _t_ distribution uses the DOF value to create a theoretical distribution. Let's look at this distribution with a DOF of $498$.
94 |
95 | ```{r}
96 | curve(dt(x, df = 498),
97 | from = -3, to = 3,
98 | lwd = 2,
99 | main = "t distribution for 498 degrees of freedom",
100 | ylab = "Distribution",
101 | xlab = "t statistic")
102 | ```
103 |
104 | The difference in means for our two set of ages was $0.585$. This is converted to a _t_ statistic with the use of a pooled variance for the whole sample. The _t_ distribution above is a sampling distribution of all possible _t_ statistic values (based only on the DOF value). As an aside, if we knew the standard deviation of a variable in the whole distribution, we would not need the _t_ distribution, but would instead use the _z_ distribution (from the normal distribution).
105 |
106 | The _t_ statistic was $0.557$. Let's place this on our plot.
107 |
108 | ```{r }
109 | curve(dt(x, df = 498),
110 | from = -3, to = 3,
111 | lwd = 2,
112 | main = "t distribution for 498 degrees of freedom and the t statistic",
113 | ylab = "Distribution",
114 | xlab = "t statistic")
115 | abline(v = 0.557)
116 | ```
117 |
118 | Of particular importance here, is our null and alternative hypothesis. We stated a null hypothesis of no difference in ages and an alternative hypothesis of a difference. We did not state that group I patients will be older or younger than those in group II. The difference in means would also be either positive or negative based in which mean we subtracted from which mean. This is termed _a two-tailed alternative hypothesis_ and is the proper standard in hypothesis testing.
119 |
120 | This all means that we have to replicate the _t_ statistic on the other side of the graph.
121 |
122 | ```{r Out two-tailed hypothesis}
123 | curve(dt(x, df = 498),
124 | from = -3, to = 3,
125 | lwd = 2,
126 | main = "t distribution for 498 degrees of freedom and the two-tailed t statistic",
127 | ylab = "Distribution",
128 | xlab = "t statistic")
129 | abline(v = 0.557)
130 | abline(v = -0.557)
131 | ```
132 |
133 | The area under the _t_ distribution curve (for whatever DOF value) is $1$. The areas under the curve from negative infinity to our left-hand side _t_ statistic and from our right-hand side _t_ statistic to positive infinity is calculated. This is the two-tailed _p_ value.
134 |
135 | If we could reasonably convince our audience that one mean would be higher than the other, we could calculate a one-tailed _p_ value. Depending on how we stated this (which group was higher), we would either have the area under the curve to the left or right of the _t_ statistic as our ultimate _p_ value.
136 |
137 | As a bit of fun, we can add two vertical lines that would represent an area of $0.05$ (that is $0.025$ on either side). For $498$ DOF, those _t_ statistics (known as _critical values_) would be about $-1.96$ and $1.96$.
138 |
139 | ```{r Indicating the critical t values}
140 | curve(dt(x, df = 498),
141 | from = -3, to = 3,
142 | lwd = 2,
143 | main = "Critical t statistic values for a DOF value of 498",
144 | ylab = "Distribution",
145 | xlab = "t statistic")
146 | abline(v = 0.557)
147 | abline(v = -0.557)
148 | abline(v = -1.96,
149 | col = "red")
150 | abline(v = 1.96,
151 | col = "red")
152 | ```
153 |
154 | It is now clear that our _t_ statistic has an area under the curve way more than the critical _t_ statistic (shown in red). This represents an $\alpha$ value of $0.05$.
155 |
156 | We can use the `pt()` function to calculate the area under the curve for the areas to the left of the left-hand side black line and to the right of the right-hand side red lines (representing _t_ statistics of $-0.557$ and $0.557$).
157 |
158 | ```{r Two tailed hypothesis p value of our t statistic}
159 | pt(-0.557, # Use the left-hand side t statistic as the area is calculate from negative infinity
160 | 498) * 2 # Multiply by 2 for two-tailed hypothesis
161 | ```
162 |
163 | A _t_ statistic of $-1.96$ ($1.96$) represents and are under the curve of about $0.025$ (each). These are the red lines in the plot above.
164 |
165 | ```{r t statistic for an alpha value of 0.025}
166 | pt(-1.96,
167 | 498) * 2
168 | ```
169 |
170 |
171 | ## The confidence interval
172 |
173 | A confidence interval in this setting refers to the difference in means. Remember that we have taken samples from population. Five-hundred of them. Through hypothesis testing, we actually want to know if these are two different populations (Groups I and II might have been chosen such that we suspect some differences, i.e. receiving a different treatment).
174 |
175 | Since we are dealing with inferential statistics, we want to infer this difference in means to the underlying populations. Because we only have a sample of each, we know that this difference will not be accurate. We can _take a guess_ at where the difference in means would lie.
176 |
177 | To do this we set a confidence level, typically $95$%. This will given us a symmetric value below and above our difference in mean. A confidence interval. This interval does not state that we are $95$% confident that the true difference in means falls between the two values! It means that of we repeated the study $100$ times (each time will have difference interval values), that in $95$% of them, the true population difference would fall between the interval values. We don't now whether our current intervals are one of the correct $95$% or the incorrect $5$%>
178 |
179 | To calculate the confidence interval for the difference in means, we construct a _t_distribution around a mean difference of our current mean difference (for a given DOF value) and find the values which will represent $2.5$% of the curve on either side of the middle.
180 |
181 | The `t.test()` function does this for us. In our example, comparing the ages of patients in groups I and II, the `t.test()` function results shows lower and upper bounds for the $95$% confidence interval as $-1.48$ and $2.65$ (rounded). The mean of these two values is $0.58$. The is also the actual difference in means between the two groups,as it should be. The confidence intervals fall symmetrically around the mean difference.
182 |
183 | If we write a report or a paper, we could state that: *"The difference in average ages between patients in groups I and II was $0.58$ (95% CI $-1.48$ to $2.65$, p value 0.57).*
184 |
185 | We would interpret this as a non-statistically significant difference and also have some difference range that we could infer on our population. We note, though, that we cannot be sure that this particular study is on of the $95$% of studies we could do (all with different confidence intervals) that contain the actual difference between the two populations.
186 |
187 | ## Comparing the means of two groups with unequal variances
188 |
189 | Student's _t_ test as was used above assumes that there is a near equal variance for the numerical variable in the two groups. Let's take a look at the variances in age.
190 |
191 | ```{r Variance in age for our two groups}
192 | df %>% group_by(Group) %>%
193 | summarise(VAR = var(Age))
194 | ```
195 |
196 | For the units in which age in measured (years and in the case of variance $\text{years}^2$), this is not a big difference.
197 |
198 | There is a test that can investigate this, though, and it is Levene's test.
199 |
200 | The `car` library contains the `leveneTest()` function. We can use a formula as below to investigate the differences in variances. The formula uses a tilde symbol and is pretty intuitive.
201 |
202 | ```{r Levene test for equal variances}
203 | leveneTest(df$Age ~ df$Group)
204 | ```
205 |
206 | Since our _p_ value for this test was not less than $0.05$, we do not reject the null hypothesis which is that there is no difference in the variances.
207 |
208 | If there was, we would simply use the `t.test()` function, but use the keyword argument and value `var.equal = FALSE`.
209 |
210 | ## Paired samples
211 |
212 | Our dataset does not contain statistical variables that we might use for paired samples. Examples would have been the `CRP` variable for instance. We might have measured this variable before and after an intervention. When this is done, there is a dependence between the two sets of variable that we could investigate. A _t_ test comparing the means of the before and after values. Be careful, as we are no longer creating two groups by way of a categorical variable. To do this, we might calculate the difference between the before and after values and investigate the differences in means between these differences for two groups.
213 |
214 | Here, though, we are looking at matched pairs of value. In this case we would still use the `t.test()` function, but use the `paired = TRUE` keyword argument.
215 |
216 | ## Comparing more than two means
217 |
218 | If we want to compare more than two means, we have to use analysis of variance (ANOVA). This is a fundamental test using the _F_ distribution. It is actually the basis for many statistical analysis. here we will simply use it for comparing the means of more than two groups.
219 |
220 | Since we have five elements in the `Survey` categorical variable, let's compare the mean ages between all five groups. First, we make sure than the `Survey` statistical variable is seen as categorical. The `factor()` function achieves this. We also specify the actual elements in the order that we please, through the use of the `levels = ` argument.
221 |
222 | ```{r Converting Survey to a categorical variable}
223 | df$Survey <- factor(df$Survey,
224 | levels = c(1, 2, 3, 4, 5),
225 | labels = c("Strongly disagree",
226 | "Disagree",
227 | "Neutral",
228 | "Agree",
229 | "Strongly agree"))
230 | ```
231 |
232 | Now we use the `aov()` function to conduct our ANOVA. The tilde symbol is used to build a _formula_. As used below, it states that we wish to compare the `Age` values according to the groups created by the sample space of the `Survey` variable.
233 |
234 | ```{r Using ANOVA to compare the means of the ages of the five groups}
235 | anova_age_survey <- aov(df$Age ~ df$Survey)
236 | summary(anova_age_survey)
237 | ```
238 |
239 | We note a _p_ value of more than $0.05$ and hence we do not reject a null hypothesis which would state that there is no difference in the ages of the five groups.
240 |
241 | If and only if the _p_ value was less than our chosen $\alpha$ value, would we do what is called _post-hoc_ analysis. That is where we actually try and find out between which of the more than two groups (five in this case) the differences actually are. When the _p_ value is more than the chosen $\alpha$ value, we __DO NOT__ do post-hoc analysis. For the sake of illustration, we will do it here, though.
242 |
243 | There are many post-hoc analysis. One of the most often used is _Tukey's test_, which we seen in action below.
244 |
245 | ```{r Tukey post hoc test}
246 | TukeyHSD(anova_age_survey)
247 | ```
248 |
249 | We do not find statistically significant differences between any of the pairs. If we took the time, as we always should, we could have investigated the summary statistics and a box-and-whisker plot of this data. It would have been clear that we were not going to find statistically significant differences.
250 |
251 | ```{r Summary statistics of ages for each answer group}
252 | df %>% group_by(Survey) %>%
253 | summarise(Ave = mean(Age),
254 | SD = sd(Age),
255 | MIN = min(Age),
256 | MED = median(Age),
257 | MAX = max(Age),
258 | IQR = IQR(Age))
259 | ```
260 |
261 | ```{r Box plot of ages for each answer group}
262 | boxplot(df$Age ~ df$Survey,
263 | main = "Age distribution for age answer to the survey question",
264 | xlab = "Survey question answer",
265 | ylab = "Age")
266 | ```
267 |
268 | ## Conclusion
269 |
270 | The comparisons of means are some of the most commonly used statistical tests. While they are easy to implement, we will see in the next chapter, that they are based on a number of assumptions which must be investigated before they are used.
--------------------------------------------------------------------------------
/04_Inbuilt_R_plots.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "The built in R plots"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
15 |
16 | 
17 |
18 | ```{r setup, include=FALSE}
19 | knitr::opts_chunk$set(echo = TRUE)
20 | setwd(getwd())
21 | ```
22 |
23 | ## Preamble
24 |
25 | Summarizing data through the use of descriptive statistics is greatly enhanced by visualizing the data. Plots, figures, and graphs provide for rich insight into the knowledge hidden in the data. Specific graphs are tailored to specific variable types and it is important to consider this.
26 |
27 | The base R language can create a great number of plots. These can be modified to create publication ready figures. In this chapter we will look at some of the most commonly used plots.
28 |
29 | There is also a rich set of libraries available that expand on the creation of graphs. It is well worthwhile to explore these on your own. They include the ubiquitous `ggplot2` and the interactive `plotly` libraries.
30 |
31 | ## Data import
32 |
33 | We are going to use the `readr` and `dplyr` libraries in this chapter.
34 |
35 | ```{r Library import, message=FALSE, warning=FALSE}
36 | library(readr)
37 | library(dplyr)
38 | ```
39 |
40 | Below, we import the now familiar `ProjectData.csv` file. We will use the data in this spreadsheet to create our plots.
41 |
42 | ```{r Importing the ProjectData spreadsheet file, message=FALSE}
43 | df <- read_csv("ProjectData.csv")
44 | ```
45 |
46 | Just as a reminder, we explore the variables in our dataset using the `names()` function. Pay particular attention to the data types as we decide which plots to visualize the data with.
47 |
48 | ```{r Variables}
49 | names(df)
50 | ```
51 |
52 | ## Strip plots
53 |
54 | Strip plots are also called dot plots. We will use strip plots to introduce the way in which plots are created in R. They represent data point values for a variable along a single axis. It is more common for these plots to visualize discrete numerical variables, but they can also be used for continuous numerical variables and for nominal categorical variable that are expressed as numerical values. The latter is better represented by bar plots, though.
55 |
56 | Below, we create a strip plot to visualize the `Age` variable as an example of discrete data (since we only captured integer values).
57 |
58 | ```{r Strip plot of age}
59 | stripchart(df$Age)
60 | ```
61 |
62 | Our first impression is that this is not a very useful plot. We have $500$ data point values and they all seem to be plotted on top of each other. There is also no title and no _x_-axis label. Let's start by adding some jitter to the dots to try and visualize all of them. If there are a large number of data point values, this might not solve the problem.
63 |
64 | ```{r Strip plot with jitter}
65 | stripchart(df$Age,
66 | method = "jitter")
67 | ```
68 | This is slightly better, but to be honest, a box-and-whisker plot would probably be the default plot for numerical data such as this. Let's carry on, though, and add a title and a _x_-axis title with the `main =` and the `xlab =` arguments.
69 |
70 | ```{r Strip plot with titles}
71 | stripchart(df$Age,
72 | method = "jitter",
73 | main = "Age of patient cohort",
74 | xlab = "Ages (in years)")
75 | ```
76 |
77 | The `method = "stack"` argument creates nice _skyscrapers_ instead of the jitter used above. This might be a better option. Let's see.
78 |
79 | ```{r Strip plot with the stack option}
80 | stripchart(df$Age,
81 | method = "stack",
82 | main = "Age of patient cohort",
83 | xlab = "Ages (in years)")
84 | ```
85 | In essence this is very much a bar plot, though, so let's take a look.
86 |
87 | ## Bar plot
88 |
89 | Bar plots give an indication of the number of individual data point values from the sample space of a variable. As such they are excellent for categorical variables, but can also be used for discrete numerical variables. Since both categorical variables and discrete numerical values are well, _discrete_, the bars have gaps between them to indicate that each of the items in the sample space for the variable is distinct from another. They do not form a continuum as in continuous numerical variables.
90 |
91 | The height of each bar indicates the number of times an element from the sample space of a variable appears in the data. There are a number of options for bar plots. Let's take a look at these and also introduce some new options for plots in R.
92 |
93 | ### Simple bar plot
94 |
95 | A simple bar plot shows vertical bars, representing the count of each unique sample space element for the discrete variable.
96 |
97 | ```{r Simple bar plot}
98 | barplot(table(df$Group),
99 | main = "Number of patients in each group",
100 | xlab = "Group",
101 | ylab = "Count",
102 | las = 1) # Turns the y-axis values upright
103 | ```
104 |
105 | ### Horizontal bar plot
106 |
107 | By setting the `horiz =' keyword argument to `TRUE`, we get horizontal bars
108 |
109 | ```{r Horizontal bar plot}
110 | barplot(table(df$Survey),
111 | main = "Survey question",
112 | ylab = "Survey question answers",
113 | xlab = "Count",
114 | horiz = TRUE,
115 | las = 1)
116 | ```
117 |
118 | We captured the ordinal categorical variable data point values for `Survey` as the numerical values $1,2,3,4$, and $5$. In fact, they represent the sample space values: _Strongly disagree, Disagree, Neither, Agree_, and _Strongly agree_. We can add the using the `names.arg =` keyword argument.
119 |
120 | ```{r Bar plot with labels}
121 | barplot(table(df$Survey),
122 | main = "Survey question",
123 | xlab = "Survey question answers",
124 | ylab = "Count",
125 | names.arg = c("Strongly disagree",
126 | "Disagree",
127 | "Neither",
128 | "Agree",
129 | "Strongly agree"),
130 | las = 1)
131 | ```
132 |
133 | ### Patterned bar plot
134 |
135 | Instead of a solid color, we can add a fine pattern to fill the bars. Below, we stipulate the border color for the bars, use grey as the fill color, and set a pattern with a low density.
136 |
137 | ```{r Patterned horizontal bar plot}
138 | barplot(table(df$Survey),
139 | main = "Survey question",
140 | ylab = "Survey question answers",
141 | xlab = "Count",
142 | horiz = TRUE,
143 | border = "black",
144 | col = "grey",
145 | density = 20,
146 | las = 1)
147 | ```
148 |
149 | ### Colored bar plot
150 |
151 | Although most journals require monochrome figures, R does not limit our ability to add color. Below, we specify the color of each of the five bars. Many colors in R have specific names, most of which are as we would expect.
152 |
153 |
154 | ```{r Colored bar plot}
155 | barplot(table(df$Survey),
156 | main = "Survey question",
157 | ylab = "Survey question answers",
158 | xlab = "Count",
159 | horiz = TRUE,
160 | border = "black",
161 | col = c("red", "grey", "grey", "grey", "grey"),
162 | density = 20,
163 | las = 1)
164 | ```
165 |
166 | A common form of bar charts is the stacked bar chart. We can use it to show the counts of a discrete variable for more than one categorical group.
167 |
168 | ```{r Stacked bar plot}
169 | barplot(table(df$Group, df$Survey),
170 | main = "Survey question per group",
171 | xlab = "Survey question answers",
172 | ylab = "Count",
173 | border = "black",
174 | col = c("black", "grey"),
175 | legend = rownames(table(df$Group, df$Survey)),
176 | density = 20,
177 | las = 1)
178 | ```
179 |
180 | ### Grouped bar plot
181 |
182 | For those that find stacked charts a bit uninformative (which can happen when the counts are very close), we can group the bars.
183 |
184 | ```{r Patterned bar plot in groups}
185 | barplot(table(df$Group,
186 | df$Survey),
187 | main = "Survey question per group",
188 | xlab = "Survey question answers",
189 | ylab = "Count",
190 | border = "black",
191 | col = c("black", "grey"),
192 | legend = rownames(table(df$Group, df$Survey)),
193 | density = c(20, 40),
194 | beside = TRUE,
195 | las = 1)
196 | ```
197 |
198 | ## Histogram
199 |
200 | The histogram gives a count of the number of times a value, from a continuous numerical variable, appears. Since these variables are continuous, though, we have to create _bins_, i.e. lower and upper bounds between which the occurrences are counted. Let's take a look at an example to make this clear. We use the `hist()` function to accomplish this.
201 |
202 | ```{r Histogram of patient age with density pattern}
203 | hist(df$Age,
204 | main = "Histogram of patient ages",
205 | xlab = "Age (in years)",
206 | ylab = "Count",
207 | density = 10, # Density pattern
208 | las = 1)
209 | ```
210 |
211 | It seem as if the default plot created bins in five year intervals so that we ended up with $10$ bins. We can use the `break =` argument to change this behavior. The value that we pass to this argument will determine the number of bins.
212 |
213 | ```{r Histogram of ages with five bins}
214 | hist(df$Age,
215 | breaks = 5,
216 | main = "Histogram of patient ages",
217 | xlab = "Age (in years)",
218 | ylab = "Count",
219 | las = 1)
220 | ```
221 |
222 | Five breaks we formed allowing for the creation of six bins.
223 |
224 | We can add a count to each bar with the `labels = TRUE` argument. Sometimes the top of the numbers are cut off. The alleviate this problem, we create a bit more space by increasing the $y$ axis range. Below the `ylim = ` keyword argument is set to `0` to `90`.
225 |
226 | ```{r Histogram with labels}
227 | hist(df$Age,
228 | labels = TRUE, # Adding labels
229 | main = "Histogram of patient ages",
230 | xlab = "Age (in years)",
231 | ylab = "Count",
232 | ylim = c(0, 90),
233 | las = 1)
234 | ```
235 |
236 | Now we have $10$ year bins. We can actually pass a vector of values to determine the bins. We include the minimum and maximum values to show the outer bounds. Note that we need not create bins of equal size.
237 |
238 | ```{r Histogram with specified bin boundaries}
239 | hist(df$Age,
240 | breaks = c(min(df$Age),50, 70, 80, max(df$Age)),
241 | main = "Histogram of patient ages",
242 | xlab = "Age (in years)",
243 | ylab = "Relative frequency")
244 | ```
245 |
246 | Under these circumstances, we no longer have a count, but rather a relative frequency (fractional count). This can also be forced with the `freq = FALSE` argument.
247 |
248 | You may wonder in which bins the border cases are included. In the plot above, for example, we need to know if a patient who is $40$ years of age is counted in the left-most bin or in the larger (broader) bin. By default, the inclusion is on the minimum border. Therefor this patient will be counted in the broader bin. The `right = TRUE` argument changes this behavior to the upper bound value.
249 |
250 | ```{r Histogram for more than one group}
251 | transparent_dark_grey = rgb(0.2, 0.2, 0.2, 0.5) # Specifying the RGBA values for the color
252 | transparent_light_grey = rgb(0.8, 0.8, 0.8, 0.5)
253 | hist((df %>% filter(Group == "I") %>% select(Age))$Age,
254 | col = transparent_dark_grey,
255 | main = "Age distribution for each group",
256 | xlab = "Age",
257 | ylab = "Count")
258 | hist((df %>% filter(Group == "II") %>% select(Age))$Age,
259 | col = transparent_light_grey,
260 | add = TRUE)
261 | legend("topright",
262 | legend = c("Group I", "Group II"),
263 | col = c(transparent_dark_grey, transparent_light_grey),
264 | pt.cex = 2,
265 | pch = 15)
266 | ```
267 |
268 | ## Density plots
269 |
270 | A density plot creates a mathematical formula to _smooth out_ the histogram. It is a great visual aid to look at the shape of the distribution of the data point values. below, we note that our patient ages as only somewhat normally distributed.
271 |
272 | ```{r A simple density plot}
273 | plot(density(df$Age),
274 | main = "Distribution of patient ages",
275 | xlab = "Age (in years)",
276 | ylab = "Density",
277 | lwd = 2)
278 | ```
279 |
280 | ## Box and whisker plots
281 |
282 | These are the most common plots used to show the distribution of numerical variables for one or more categorical or discrete groups. In the code chunk below, we look at the age distribution of each of the two groups.
283 |
284 | ```{r Box plot of ages per group}
285 | boxplot(df$Age ~ df$Group,
286 | col = c(transparent_dark_grey, transparent_light_grey),
287 | boxwex = 0.4, # Width of boxes as a fraction
288 | main = "Age per group",
289 | xlab = "Group",
290 | ylab = "Age")
291 | legend("topright",
292 | legend = c("Group I", "Group II"),
293 | col = c(transparent_dark_grey, transparent_light_grey),
294 | pt.cex = 2,
295 | pch = 15)
296 | ```
297 |
298 | The dark middle bar indicates the value of the median (or second quartile). The upper and lower edges of the boxes are the values for the third and first quartiles. If there are no statistical outliers, the whiskers stretch out to the maximum (top) and minimum (bottom) values. Values that are more than $1.5$ times the interquartile range more than the third quartile or less than the first quartile will be indicated as separate dots beyond the whiskers. The latter will then be indicative of this limit of _outliers_.
299 |
300 | ## Scatter plots
301 |
302 | Scatter plots are used to show the correlation between two numerical variables for each subject in a dataset. Each dot is represents the value of one of the variables on the $x$ axis and the other variable on the $y$ axis.
303 |
304 | ```{r Age vs weight}
305 | plot(df$Age, # Independent variable (x-axis)
306 | df$Weight, # Dependent variable (y-axis)
307 | main = "Patient age vs weight",
308 | xlab = "Age (in years)",
309 | ylab = "Weight (in lbs)")
310 | ```
311 |
312 | In the chapter on correlation and linear models, we will learn how to draw a line as a model for our data. We add this with the `abline()` function below. It is the `lm()` function used as an argument that does the actual calculation. More on this in the mentioned, upcoming chapter.
313 |
314 |
315 | ```{r Adding a regression line}
316 | plot(df$Age, # Independent variable (x-axis)
317 | df$Weight, # Dependent variable (y-axis)
318 | main = "Patient age vs weight",
319 | xlab = "Age (in years)",
320 | ylab = "Weight (in lbs)")
321 | abline(lm(df$Weight ~ df$Age))
322 | ```
323 |
324 | There are a great number of arguments that we can use to control the look of a plot. have a look at the code comments in the plot below to get a glimpse at the power of plotting in R.
325 |
326 | ```{r Controlling the axes}
327 | plot(df$Age, # Independent variable (x-axis)
328 | df$Weight, # Dependent variable (y-axis)
329 | main = "Patient age vs weight",
330 | xlab = "Age (in years)",
331 | ylab = "Weight (in lbs)",
332 | axes = FALSE)
333 | # x-axis
334 | # Small tick labels
335 | par(tcl = 0.1) # Tick length of +0.1 (protruding into plot)
336 | axis(1,
337 | at = seq(30, 90, by = 1), # From 30 to 90, stepsize 1
338 | labels = FALSE) # Don't add number labels
339 | # Slightly taller tick marks every 5 steps
340 | par(tcl = 0.2)
341 | axis(1,
342 | at = seq(30, 90, by = 5),
343 | labels = FALSE)
344 | # MAjor tick marks every 10 years
345 | par(tcl = -0.5)
346 | axis(1,
347 | at = seq(30, 90, by = 10))
348 | # y-axis
349 | # Small tick marks every 1 lb
350 | par(tcl = 0.1)
351 | axis(2,
352 | at = seq(110, 230, by = 2),
353 | labels = FALSE)
354 | par(tcl = -0.5)
355 | axis(2,
356 | at = seq(110, 230, by = 20))
357 | ```
358 |
359 | ## Multiple plots
360 |
361 | At times it makes sense to create more than one plot in a figure. R allows for fine control over placement of different plot.
362 |
363 | In the code below, we use a single row with two columns for our layout. We plot both a bar plot and a pie plot of the same data.
364 |
365 | A quick note on pie plots: Pie plots are not good at providing accurate visual information, and hence, were not even discussed in this tutorial. Even the help section states the following: _Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data._
366 |
367 | ```{r Combining two plots in a row}
368 | grey_five <- c(rgb(0.1, 0.1, 0.1),
369 | rgb(0.3, 0.3, 0.3),
370 | rgb(0.5, 0.5, 0.5),
371 | rgb(0.7, 0.7, 0.7),
372 | rgb(0.9, 0.9, 0.9)) # Specifying five shades of grey
373 | par(mfrow = c(1, 2)) # One row, two columns
374 | barplot(table(df$Survey),
375 | main = "Bar plots",
376 | col = grey_five)
377 | pie(table(df$Survey),
378 | main = "Pie plots are bad",
379 | radius = 1,
380 | col = grey_five)
381 | ```
382 |
383 | The graphical parameter `par(fig =)` lets us control the location of a figure precisely in a plot.
384 |
385 | We need to provide the coordinates in a normalized form, i.e. as `c(x1, x2, y1, y2)`. For example, the whole plot area would be `c(0, 1, 0, 1)` with `(x1, y1) = (0, 0)` being the lower-left corner and `(x2, y2) = (1, 1)` being the upper-right corner.
386 |
387 | Note: we have used parameters `cex` to decrease the size of labels and `mai` to define margins.
388 |
389 | ```{r Specifying exact positions}
390 | # make labels and margins smaller
391 | par(cex = 0.7, mai = c(0.1, 0.1, 0.2, 0.1))
392 | # define area for the histogram
393 | par(fig = c(0.1, 0.7, 0.3, 0.9))
394 | hist(df$Age,
395 | main = "Three plots of patient age")
396 | # define area for the boxplot
397 | par(fig = c(0.8 ,1 ,0 ,1 ),
398 | new = TRUE)
399 | boxplot(df$Age)
400 | # define area for the stripchart
401 | par(fig = c(0.1, 0.67, 0.1, 0.25),
402 | new = TRUE)
403 | stripchart(df$Age,
404 | method = "jitter")
405 | ```
406 |
407 | ## Saving plots for your reports
408 |
409 | Most manuscripts include plots as figures. Plots created in R can be saved to your computer drive. Below we use the `png()` function to save portable network graphics files. This is a common file type for pictures, with a small file size and good quality.
410 |
411 | The `png()` function allows us to specify a picture size (width and height). The default unit is pixels. R also provides for saving in other file types. Look up `bmp`, `jpeg`, and `tiff` in the _Help_ tab on the bottom right panel in RStudio.
412 |
413 | Once the code for saving the file is complete, remember to use the `dev.off()` function.
414 |
415 | ```{r Saving a file}
416 | png(file = "Fig 1.png", # Directory or folder set by setwd(getwd())
417 | width = 600,
418 | height = 600)
419 | set.seed(1234)
420 | boxplot(rchisq(100, df = 2),
421 | col = "gold",
422 | boxwex = 0.4,
423 | main = "c-Reactive Protein",
424 | ylab = "CRP")
425 | dev.off() # Search for dev.off in the Help tab
426 | ```
427 |
428 | ## Conclusion
429 |
430 | Whole textbooks can be written on plotting in R. It is best to explore all the plots that are available. Fortunately, there is a lot of information in the Help section in RStudio. Frequently consult this section to learn more about all the different arguments that can be used to modify your plots.
431 |
432 | Plots tell a visual story and help us understand our data. Use them liberally in your data analysis.
433 |
434 |
--------------------------------------------------------------------------------
/02_Working_with_data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "02 Working with data"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | number_sections: no
7 | toc: yes
8 | word_document:
9 | toc: yes
10 | ---
11 |
12 |
17 |
18 | 
19 |
20 | ```{r setup, include=FALSE}
21 | knitr::opts_chunk$set(echo = TRUE)
22 | #setwd(getwd())
23 | ```
24 |
25 | ## Preamble
26 |
27 | In this lecture, we will learn how to import a spreadsheet file containing data for a simulated research project. Once imported, we can extract specific parts of our data to analyze it and thereby answer our research questions.
28 |
29 | In order to extract this data, we need to learn about _addressing_ or _indexing_. This concept simply refers to an address given to each data point value in a dataset. The same happens in a Microsoft Excel file, where every cell is named for its column and row value, i.e. _B7_, referring to the data in the cell in column _B_ and in row _7_.
30 |
31 | Along this exciting journey, we are going to learn about the setting of a working directory, the R library (package) ecosystem, indexing, importing spreadsheet files, and manipulating the data in the file using the _verbs_ that are available in one of the libraries that we are going to import, namely `dplyr`.
32 |
33 | ## Setting the working directory
34 |
35 | RStudio has a very powerful project structure. It is easier to simply save a R markdown file (this document) and a spreadsheet file in the same directory (folder) on your computer drive.
36 |
37 | Once you create an R markdown file, hit the Save button (or go to File > Save on the menu bar) to save the file with an appropriate name, in an appropriate directory (folder). Then make sure that the spreadsheet file is in the same directory (folder).
38 |
39 | We can use code to find the address on our computer drive, where the file was actually stored. The `getwd()` function stands for _get working directory_. Note that you have to save the R markdown file first, before calling this function. Below, we create a code chunk by hitting __CTL ALT I__ ( __CMD OPT I__ on a Mac). Remember to get into the habit to write a unique, short, and descriptive name for the chunk following the `{r}`. Also leave some code comments using hash-tags.
40 |
41 | ```{r Getting the current working directory}
42 | # Looking for the current working directory on my computer drive
43 | getwd()
44 | ```
45 |
46 | The directory (folder) that contains your file is shown. Note that even on a Windows computer, forward slashes are used. If you are familiar with Windows, you know that it uses backslashes to shown the directory structure.
47 |
48 | Now, we need to tell R to actually use this directory (folder) as its active directory (folder) when searching for the spreadsheet file that we might want to import. This is why saving the two files together is so important. If the spreadsheet file is in another directory (folder), you have to type the complete directory (folder) structure address to that file. That is just too much work! To set the active director (folder) to the current directory (folder), we use the `setwd()` function and place the `getwd()` as its argument. An argument is the information that a keyword (called a function) uses to _do its job_.
49 |
50 | ```{r Set the active working directory, message=FALSE, warning=FALSE}
51 | # The keyword (function) is setwd() and the argument is another function called getwd()
52 | setwd(getwd())
53 | ```
54 |
55 | Before we get to importing and working with spreadsheet files, let's create some simulated data so that we can look at indexing (or addressing).
56 |
57 | ## Creating computer variables to work with
58 |
59 | First, we will create a computer variable called `age` to hold $20$ random value between $15$ and $85$ (inclusive). The `sample()` function means that we will take it from a uniform distribution, where every value in the sample space of the statistical variable (i.e. $15,16,17,\ldots , 85$) has an equal likelihood of being selected.
60 |
61 | ```{r Creating an age variable}
62 | set.seed(123) # So that we all get the same pseudorandom numbers
63 | age <- sample(15:85, # The range to select from
64 | 20, # The number of values required
65 | replace = TRUE) # Replace means that the selected number is available fro the next rounds
66 | ```
67 |
68 | Now that we have a computer variable called `age`, when can introduce the `length()` function. It counts how many elements there are in the object that we pass as argument.
69 |
70 | ```{r How many data point values are there}
71 | length(age)
72 | ```
73 |
74 | Indeed, because we asked for $20$ elements when we created our list object, we get an answer of `20`.
75 |
76 | We can also print all $20$ values to the screen by just calling the computer variable.
77 |
78 | ```{r Show all the values}
79 | age
80 | ```
81 |
82 | ## Using indexing (addressing) to select specific values
83 |
84 | You will notice that the `age` variable appears in the Environment tab on the top right-hand side of the screen (if you left the Panel settings in RStudio at its default values). It states `int [1:20]` and then starts listing the actual values. From this, we can guess that every element's address is the number $1$ through $20$. This is indeed so. Below, we use square brackets (indicating indexing) and ask for the value with an index (address) of `1`.
85 |
86 | ```{r Show th first value}
87 | age[1]
88 | ```
89 |
90 | When we look at the list we printed out above, we see that $35$ was indeed the first value. Remember that this will only be true for you if you also used `set.seed(123)`. Omitting the `set.seed()` function will result in different value generated each time the code chunk is executed. A different integer value will also result in a different set of values. Let's have a look at the fifth value, which should be $81$.
91 |
92 | ```{r Show the fifth value}
93 | age[5]
94 | ```
95 |
96 | We can view a range of values by using the range symbol, `:`. Let's get the first five values, i.e. `1:5`.
97 |
98 | ```{r Values 1 through 5}
99 | age[1:5]
100 | ```
101 |
102 | If we only want values one and five, we pass these values as an atomic vector, using the `c()` function as address.
103 |
104 | ```{r Values 1 and 5}
105 | age[c(1, 5)]
106 | ```
107 |
108 | Now we consider something a bit more interesting. What about selecting values based on a rule? In the code chunk below, we only ask for values that are equal to or larger than $50$. Note that we use the computer variable name twice!
109 |
110 | ```{r Only values equal to or large than 50}
111 | age[age >= 50]
112 | ```
113 |
114 | We can specify more than one rule. Below, we ask for all values that are less than $30$ __or__ more than $50$. The symbol for or is the straight, vertical line, which is above the enter key on my keyboard.
115 |
116 | ```{r Vales less than 30 or larger than 50}
117 | age[age < 30 | age > 50]
118 | ```
119 |
120 | Note that every value is either below $30$ or larger than $50$. This code actually runs through all of the $20$ values in our set, one by one. For each one is asks the question: "Is this value equal to or larger than $50$?". This is referred to as a logical operation or Boolean logic. If the value is equal to or larger than $50$, logical value of `TRUE` is returned and the value is included in the final result. If not, a logical value of `FALSE` is returned and the value is not included.
121 |
122 | ## Importing a library
123 |
124 | Packages or libraries are written by the huge community of developers of the R language from all over they world. They kindly donate their skills and time to make our lives easier! Libraries are installed in the Packages tab on the bottom right panel. Simply type the name of the package in the search box after hitting the Install button. Allow time for the package to be installed from the internet and answer _Yes_ if any question is asked in a pop-up box.
125 |
126 | After installation, we have to import the library for use in our current R markdown file. So, go ahead and first install the `readr` and the `dplyr` libraries. Below, we then import them (individually) using the `library()` function.
127 |
128 | Note that after the name of the chunk, is a comma and then `message=FALSE` and then another comma and the `warning=FALSE`. This can be added manually, or by deselecting the buttons after hitting the little gear icon of a chunk. The effect is that messages and warnings that are sometimes displayed on importing libraries are not shown. In most cases, these are superfluous.
129 |
130 | ```{r Importing readr and dplyr, message=FALSE, warning=FALSE}
131 | library(readr)
132 | library(dplyr)
133 | ```
134 |
135 | The `dplyr` library contains all the function which we will use to extract data from our imported spreadsheet file.
136 |
137 | ## Importing a data file
138 |
139 | The `readr` library contains a function called `read_csv()`. There is a `read.csv()` function in base R (without importing the `readr` library). The `read_csv()` function is more useful at times, though. It creates an object called a _tibble_. A tibble has some advantages over a traditional _data frame_, which is the result of importing a file with the `read.csv()` function. Tibbles print nicely to most screens and it interprets categorical variables as well, categorical variables, as opposed to factors. More about this in future tutorials.
140 |
141 | Note that the spreadsheet file is in `.csv` or _comma separated values_ format. This is much more generic than Microsoft's `.xslx` (Excel) format. When you are in Excel, simply save the file as a `.csv` file. It strips away all the bells-and-whistles added by Microsoft that really just clutter things up. If given a choice, always select __MS-DOS csv__ when saving a spreadsheet file.
142 |
143 | In the chunk below, we import the `ProjectData.csv` file. Because it is in the same directory (folder) than the R markdown file, we can just refer to it, without using a long computer drive address, simply because we use the `setwd(getwd())` functions above. We import the spreadsheet file in the computer variable `df`.
144 |
145 | ```{r Importing a csv file, message=FALSE}
146 | df <- read_csv("ProjectData.csv")
147 | ```
148 |
149 | ## Exploring the data
150 |
151 | Note that there is now a `df` variable in the Environment tab on the top right of the interface. To the far right of it is a little spreadsheet file icon. Click it to open a spreadsheet-like view of the imported data. You can also write `View(df)` in a code chunk to achieve the same result. A tab will open next to the current file's tab at the top of the left-hand, upper panel.
152 |
153 | The first thing to do is to look at the size of our data, i.e. how many rows of data (how many subjects in our study) and how many variables (columns). We use the `dim()`, short for dimensions, function to do this.
154 |
155 | ```{r Size of the data}
156 | dim(df)
157 | ```
158 |
159 | We can view the statistical variable names (in the columns in row 1), using the `names()` function.
160 |
161 | ```{r Statistical variables}
162 | names(df)
163 | ```
164 |
165 | We see a list of the eight variables. Note how every variable name is descriptive of the data that it represents. Note also that there are no _illegal_ characters in the variables names, such as spaces, parentheses, and so on. (See chapter 01.)
166 |
167 | To see the values in a variable (each row in that column), we use the dollar `$` notation. Let's look at the chunk below. We do not want to list all $500$ values, so we use indexing (with square brackets) to show the first five rows of the _Age_ statistical variable.
168 |
169 | ```{r First 5 age value}
170 | df$Age[1:5]
171 | ```
172 |
173 | What about the first five _Difference_ values?
174 |
175 | ```{r First 5 difference values}
176 | df$Difference[1:5]
177 | ```
178 |
179 | Now the first five _Group_ values.
180 |
181 | ```{r First 5 group values}
182 | df$Group[1:5]
183 | ```
184 |
185 | Have a look at the first five values for each of the seven variables. Can you identify the variable type?
186 |
187 | The `head()` function displays the first six rows of all the variables. Don't use it too often. If you have many variables, printing to the screen will be difficult.
188 |
189 | ```{r First six rows}
190 | head(df)
191 | ```
192 |
193 | We are not restricted to the first six rows, but can actually specify the number of rows with an argument.
194 |
195 | ```{r First 10 data point values of all the variables}
196 | head(df,
197 | 10) # Placing each argument on its own line makes code more readable
198 | ```
199 |
200 | ## The dplyr library
201 |
202 | The `dplyr` library provides some of the most useful functions to have been added to R. They follow a convention that is slightly different from R and is referred to as the _tidyverse_, or sometimes (against his will) as the _Hadley-verse_ after its inventor, Hadley Wickham.
203 |
204 | The tidyverse set of libraries allow for the use of the pipe, `%>%` operator. A frightful thing when you first start using it, but an absolute powerful delight when you get use to it.
205 |
206 | The functions in `dplyr` are called _verbs_ and for good reason. Let's start by looking at the `filter()` function. Note that when referring to `dplyr`, the terms __function__ and __verb__ will be used interchangeably.
207 |
208 | ### filter
209 |
210 | Our first function in `dplyr` is `filter()`. As with indexing (addressing), it allows us to provide a recipe for selecting data. So, let's look down the _Age_ column and return only the values greater than $80$. You will note that the first argument is the actual computer variable that holds our spreadsheet file. The second argument references the statistical variable (column name) and the recipe (using Boolean logic). Note that we don't use the `$` notation with these verbs.
211 |
212 | ```{r Filter only Age data point values larger than 80}
213 | filter(df, Age > 80)
214 | ```
215 |
216 | We see all the rows in the dataset returned, but only for those patients who are older then $80$.
217 |
218 | Before we carry on, let's plunge right into using the pipe operator. The sooner we rip that plaster off, the better. Are you ready? Here comes the pain.
219 |
220 | ```{r Using the pipe operator to do the same as above}
221 | df %>% filter(Age > 80)
222 | ```
223 |
224 | It's actually quite easy. We take out the first argument, which is the computer variable name and put it before the pipe. The pipe actually just says: "Take what is to the left of me and pass it as first argument to the function that is to my right. Thank you very much for your cooperation.". By the way, the keyboard shortcut for the pipe operator is __SHFT CTRL M__ ( __SHFT CMD M__ on a Mac).
225 |
226 | Let's try that again and select only patients that have a _I_ marked as their _Group_ data point value.
227 |
228 | ```{r Filtering only the Group I patients}
229 | df %>% filter(Group == "I")
230 | ```
231 |
232 | We use the double equal sign, `==` in the case of Boolean logic. As mentioned before, it goes row by row and checks whether the value is indeed _I_. If so, an answer of `TRUE` is returned and that row is included.
233 |
234 | Now let's get fancy and provide two recipes. Patients older than $80$ __and__ who are in group I. The symbol for _and_ is `&`. Each recipe is placed in a set of parenthesis.
235 |
236 | ```{r Age over 80 in group I}
237 | df %>% filter((Age > 80) & (Group == "I"))
238 | ```
239 |
240 | Great, let's go for another verb.
241 |
242 | ### select
243 |
244 | The `select()` function allows us to return only the statistical variables (columns) in which we are interested. Let's only get the _Age_ variable.
245 |
246 | ```{r Select only the Age variable}
247 | df %>% select(Age)
248 | ```
249 |
250 | Great. Now for something very fancy. It showcases the power of the pipe operator. We want only patients who are older than $80$ and who are in group I and then we only want to see the _SideEffects_ and the _Survey_ variables.
251 |
252 | ```{r Age over 80 in group I showing onlu survey variable}
253 | df %>% filter((Age > 80) & (Group == "I")) %>%
254 | select(SideEffects, Survey)
255 | ```
256 |
257 | Whoa!
258 |
259 | In the first tutorial, we created a computer variable by giving it an appropriate name and then we either manually entered some values using the `c()` function or we created random values using the `sample()` or the `rnomr()` functions. By using the `filter()` and the `select()` functions, we can create similar computer variables to hold lists of values that we can then pass as arguments to the statistical tests that we want to conduct.
260 |
261 | So, let's set up a very simple research question. We want to divide our patients into those that are in group I and those that are in group II. When then want to create a list of the `CRP` values for each group. Let's name the first list of CRP value `group_I_CRP` and the group II equivalent, `group_II_CRP`. We will combine our two functions, `filter()` and `select()` to do this.
262 |
263 | ```{r Creating a tibble of group I patient CRP value}
264 | group_I_CRP <- df %>%
265 | filter(Group == "I") %>%
266 | select(CRP)
267 | ```
268 |
269 | Actually, we just have a new tibble with a single variable called `CRP`. We can convert the single variable into a vector (list of elements) using the `as.vector()` function and passing the tibble with the dollar sign, indicating the CRP column, as argument.
270 |
271 | ```{r Creating a vector}
272 | group_I_CRP <- as.vector(group_I_CRP$CRP)
273 | ```
274 |
275 | Let's do the same for group II patients.
276 |
277 | ```{r Creating a list of group II patient CRP values}
278 | group_II_CRP <- df %>%
279 | filter(Group == "II") %>%
280 | select(CRP)
281 | group_II_CRP <- as.vector(group_II_CRP$CRP)
282 | ```
283 |
284 | Another way to go about this, is simply to create two tibbles. This is especially handy if we have a lot of research questions pertaining the comparing the two groups (created from the `Group` variable). Let's do just this and create a `group_I` and a `group_II` tibble.
285 |
286 | ```{r Creating two tibbles}
287 | group_I <- df %>%
288 | filter(Group == "I")
289 | group_II <- df %>%
290 | filter(Group == "II")
291 | ```
292 |
293 | If you now look at the Environment tab on the top right, you will see three tibbles and three computer variables that are simply vectors.
294 |
295 | ### mutate
296 |
297 | The `mutate()` function allows us to create new variables (columns) in the dataset. If we look at the first six entries for the `Weight`column, we note that the values seem to be in pounds.
298 |
299 | ```{r The first six weight data point values}
300 | head(df$Weight)
301 | ```
302 |
303 | Let's create a new variable (column) which contains the weight of each subject in kilograms. There are $2.21$ pounds in each kilogram, so, we have to divide each value by $2.21$. We do this through the concept of _broadcasting_. Below is a simple example of a list of five values and we multiply each by $5$ by using R's ability to _broadcast_.
304 |
305 | ```{r Broadcasting}
306 | 5 * c(1, 2, 3, 4, 5)
307 | ```
308 |
309 | Note how the multiplication by $5$ is _broadcasted_ to each element in the vector object.
310 |
311 | Here goes the `mutate()` function.
312 |
313 | ```{r Creating a new variable called Weight_kg}
314 | df %>% mutate(Weight_kg = Weight / 2.21)
315 | ```
316 |
317 | This added the new variable, which we called `Weight_kg`. The change is not permanent, though.
318 |
319 | ```{r The change is not permamanent}
320 | names(df)
321 | ```
322 |
323 | The easiest way to make the change permanent is to overwrite the original computer variable.
324 |
325 | ```{r Making the change permanent}
326 | # Returning the names and types of the column headers (variables)
327 | df <- df %>% mutate(Weight_kg = Weight / 2.21)
328 | names(df)
329 | ```
330 |
331 | One more important use for the `mutate()` function is the creation of a new categorical variable by _binning_ a numerical variable. With _binning_, we refer to creating lower and upper bounds and assigning a sample space element for a categorical variable for each numerical data point value.
332 |
333 | To do this, let's first look at the `pull()` function. It extracts a column (or selected data point values in a column) as a list of values.
334 |
335 | ```{r Extracting }
336 | crp <- pull(df %>% select(CRP))
337 | ```
338 |
339 | We can now, for instance, decide to create three bins, by dividing the data point values for the `CRP` variable into three equally-sized bins (depending if the sample size is divisible by the number of bins we wish to create and if there are ties in the values). To do this, we use the `cut()` function.
340 |
341 | ```{r Creating three bins}
342 | cut(crp,
343 | 3)
344 | ```
345 |
346 | Now we note a categorical data point value for each of the $500$ numerical `CRP` values. Because we did not provide three sample space values, we note (at the bottom), three _Levels_ names. Levels indicate that this is a categorical variable and then states the elements in the sample space. The `(` and `]` notation is mathematical in nature. The former indicates an open domain and the latter a closed domain. So for the first element we will have values from $4.49$ to $6.53$. The latter is included, i.e. if a data point value is actually $6.53$, it will be included in that bin.
347 |
348 | Let's define our data point values for the sample space, perhaps `Low`, `Medium`, and `High`. We also use the `mutate()` function to create a new variable. We will call it `CRP_group`. To make the change to the tibble permanent, we overwrite the old tibble by simply stating `df <-`.
349 |
350 | ```{r Creating a new categorical variable}
351 | df <- df %>% mutate(CRP_group = cut(pull(df %>% select(CRP)),
352 | 3,
353 | labels = c("Low", "Medium", "High")))
354 | ```
355 |
356 | We can use the `table()` function to count the elements in the sample space of our new variable.
357 |
358 | ```{r Counting the Low Mediam and High data point values}
359 | table(df$CRP_group)
360 | ```
361 |
362 | ## Conclusion
363 |
364 | For many research projects, data collection is done in the form of a spreadsheet. Even if database software using the structure query language (SQL) or other database frameworks are used, it is common to extract the data as a spreadsheet. R and the appropriate libraries make for a powerful tool in manipulating this data for analysis.
365 |
366 | The `filter()` and `select()` function are the two most common verbs that we will use on our data analysis. There are more verbs in the `dplyr` library. We will meet them in future tutorials.
--------------------------------------------------------------------------------