├── Course_Data.zip
├── README.md
├── _config.yml
├── alternative
    ├── LM_Cheatsheet.html
    ├── LM_Cheatsheet.md
    ├── LM_Cheatsheet.pdf
    ├── Welcome_elearning_lin_mod.pdf
    └── timetable.md
├── anova+.Rmd
├── anova+.html
├── anova.Rmd
├── anova.html
├── anova.pdf
├── cheat_sheet.pdf
├── conclusion.pdf
├── data
    ├── Assay.txt
    ├── Bronchitis.csv
    ├── OscillationIndex.txt
    ├── amess.csv
    ├── clinicalTrials.txt
    ├── crab.csv
    ├── diet.csv
    ├── genotypes.txt
    ├── globalBreastCancerRisk.csv
    ├── lactoferrin.csv
    ├── myocardialinfarction.csv
    ├── pollution.csv
    ├── protein-expression.csv
    ├── students.csv
    └── treatments.txt
├── glm+.Rmd
├── glm+.html
├── glm.Rmd
├── glm.html
├── glm.pdf
├── gml.html
├── images
    ├── examplePlots.png
    └── plot-char.png
├── index.md
├── install.R
├── logos
    ├── CRUK_CI_logo.png
    ├── LMB_logo.png
    ├── LMB_logo_small.png
    └── Logos.txt
├── multiple_regression+.Rmd
├── multiple_regression+.html
├── multiple_regression.Rmd
├── multiple_regression.html
├── multiple_regression.pdf
├── r-recap.Rmd
├── r-recap.nb.html
├── simple_regression+.Rmd
├── simple_regression+.html
├── simple_regression.Rmd
├── simple_regression.html
├── simple_regression.pdf
├── time_series.pdf
├── time_series_analysis.Rmd
├── time_series_analysis.html
└── timetable.pdf


/Course_Data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/Course_Data.zip


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction to Linear Modelling with R
 2 | 
 3 | ## Description
 4 | 
 5 | The course will cover ANOVA, linear regression and some extensions. It will be a mixture of lectures and hands-on time using RStudio to analyse data.
 6 | 
 7 | 
 8 | # Aims: During this course you will learn about: 
 9 | 
10 | - ANOVA
11 | - Simple and multiple regression
12 | - Generalised Linear Models 
13 | - Introduction to more advanced topics, like non-linear models and time series.
14 | 
15 | # Objectives: After this course you should be able to
16 | 
17 | - Realise the connection between t-tests, ANOVA and linear regression 
18 | - Fit a linear regression
19 | - Check if the assumptions of linear regression are met by the data and what to do if they are not
20 | - Know when linear regression is not appropriate and have an idea of which alternative method might be appropriate
21 | - Know when you need to seek help with analysis as the data structure is too complex for the methods taught
22 | 
23 | # Pre-requisites
24 | 
25 |  This course assumes basic knowledge of statistics and use of R, which would be obtained from our Introductory Statistics Course and an "Introduction to R for Solving Biological Problems" run at the Genetics department (or equivalent).
26 |  
27 |  - [Introduction to Solving Biological Problems with R](http://cambiotraining.github.io/r-intro/)
28 |  - [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
29 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-slate
2 | google_analytics: UA-63148050-14
3 | 


--------------------------------------------------------------------------------
/alternative/LM_Cheatsheet.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | output:
 3 |   html_document: default
 4 |   pdf_document: default
 5 | ---
 6 | ![](logos/CRUK_CI_logo.png)![](logos/LMB_logo_small.png)
 7 | ## Linear Modelling with R Course Cheatsheet
 8 |   
 9 | | **ANOVA**.  | Notes  |
10 | | --- | --- |
11 | | aov()  | linear model with categorical predictors  |
12 | | oneway.test() | (heteroscedastic) linear model with a categorical predictor  |
13 | | kruskal.test  |  rank-based linear model with a categorical predictor |
14 | | t.test()  | (heteroscedastic and heteroscedastic) linear model with a binary predictor  |
15 | | tapply()  | apply a function to each vector element  |
16 | | qqnorm()  | normal quantile-quantile plot   |
17 | | shapiro.test()  | test of normality  |
18 | | bartlett.test() | test of equality of variance between groups  |
19 | 
20 | | **Simple Regression**  | Notes |  
21 | | --- | --- |
22 | | cor()  | correlation between between 2 variables |   
23 | | cor.test()  |  test for (linear or rank) association between 2 variables |   
24 | | residuals()  | extract residuals from a model fit  | 
25 | | lm()  | linear model fit |
26 | 
27 | | **Multiple Regression**  | Notes  |
28 | | --- | --- |
29 | | AIC()  | Akaike's information criterion for a fitted model  |
30 | | stepAIC()  | AIC based stepwise model selection |
31 | | nls()  | non-linear least squares |
32 | 
33 | | **Generalised Linear Models**  | Notes  |
34 | | --- | --- |
35 | | install.packages()  | blah  |
36 | | glm()  | generalised linear model fit  |
37 | | gamlss()  | generalised linear and additive model fit  |
38 | | anova()  | comparison of embedded models  |
39 | | chisq.test()  | Pearson's chi-square test  |
40 | | prop.test()  | test of equality of proportions |
41 | 
42 | | **Time Series and Non-Linear Models**  | Notes  |
43 | | --- | --- |
44 | | acf()  | auto-correlation function  |
45 | | pacf()  | partial auto-correlation function  |
46 | | arima()  | ARIMA Modelling of time series  |
47 |   
48 | 
49 | 
50 | 
51 | 
52 | 


--------------------------------------------------------------------------------
/alternative/LM_Cheatsheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/alternative/LM_Cheatsheet.pdf


--------------------------------------------------------------------------------
/alternative/Welcome_elearning_lin_mod.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/alternative/Welcome_elearning_lin_mod.pdf


--------------------------------------------------------------------------------
/alternative/timetable.md:
--------------------------------------------------------------------------------
 1 | ### [http://tinyurl.com/linear-models-r](http://tinyurl.com/linear-models-r)
 2 | ## Introduction to Linear Models with R - Course Schedule
 3 | 
 4 | | Time  | Topic |
 5 | | ------------- | ------------- |
 6 | | 09.45 - 10.15   | Welcome & Introduction to Rstudio and Markdown (Mark)  |
 7 | | 10.15 - 11.30  | Anova (Dominique) {10 min coffee break at 11am*} |
 8 | | 11.30 - 13.00  | Simple Regression (Rob)  |
 9 | | 13.00 - 14.00 | Lunch Break  |
10 | | 14.00 - 15.15 | Multiple Regression (Rob) |
11 | | 15.15 -15.30 | { 15 min Tea break* }  |
12 | | 15.30 - 16.45  | GLMs (Dominique)  |
13 | | 16.45 - 17.15 | Time-series (Rob) |
14 | | 17.15 - 17.25  | Conclusion  |
15 | 
16 | *Coffee, Tea, Water & Cookies provided
17 | 
18 | ![CRUK Cambridge Institute](logos/CRUK_CI_logo.png)
19 | ![MRC Laboratory of Molecular Biology](logos/LMB_logo_small.png)
20 | 
21 | 


--------------------------------------------------------------------------------
/anova+.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ANOVA with R: analysis of the *diet* dataset" 
  3 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  4 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  5 | output:
  6 |   html_document:
  7 |     theme: united 
  8 |     highlight: tango
  9 |     code_folding: show    
 10 |     toc: true           
 11 |     toc_depth: 2       
 12 |     toc_float: true     
 13 |     fig_width: 8
 14 |     fig_height: 6
 15 | ---
 16 | 
 17 | 
 18 | 
 19 | <!--- rmarkdown::render("~/courses/cruk/LinearModelAndExtensions/git_linear-models-r/anova+.Rmd") --->
 20 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/anova+.Rmd") --->
 21 | 
 22 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 23 | # change working directory: should be the directory containg the Markdown files:
 24 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20210514/Practicals/")
 25 | 
 26 | ```
 27 | 
 28 | 
 29 | A full version of the dataset *diet* may be found online on the U. of Sheffield website <https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv>. 
 30 | 
 31 | A slightly modified version is available in the data file is stored under data/diet.csv. The data set contains information on 76 people who undertook one of three diets (referred to as diet _A_, _B_ and _C_). There is background information such as age, gender, and height. The aim of the study was to see which diet was best for losing weight.
 32 | 
 33 | 
 34 | # Section 1: importation and descriptive analysis
 35 | 
 36 | Lets starts by
 37 | 
 38 | * importing the data set *diet* with the function `read.csv()`  
 39 | * defining a new column *weight.loss*, corresponding to the difference between the initial and final weights (respectively the corresponding to the columns `initial.weight` and `final.weight` of the dataset) 
 40 | * displaying _weight loss_ per _diet type_ (column `diet.type`) by means of a boxplot.
 41 | 
 42 | 
 43 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 44 | diet = read.csv("data/diet.csv",row.names=1)
 45 | diet$weight.loss = diet$initial.weight - diet$final.weight 
 46 | diet$diet.type   = factor(diet$diet.type,levels=c("A","B","C"))
 47 | diet$gender      = factor(diet$gender,levels=c("Female","Male"))
 48 | boxplot(weight.loss~diet.type,data=diet,col="light gray",
 49 |         ylab = "Weight loss (kg)", xlab = "Diet type")
 50 | abline(h=0,col="blue")
 51 | ```
 52 | 
 53 | # Section 2: ANOVA
 54 | 
 55 | Lets 
 56 | 
 57 | * perform a Fisher's, Welch's and Kruskal-Wallis one-way ANOVA, respectively by means of the functions `aov()`, `oneway.test()` and `kruskal.test`,  
 58 | * display and analyse the results: Use the function `summary()` to display the results of an R object of class `aov` and the function `print()` otherwise.
 59 | 
 60 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 61 | diet.fisher  = aov(weight.loss~diet.type,data=diet)
 62 | diet.welch   = oneway.test(weight.loss~diet.type,data=diet)
 63 | diet.kruskal = kruskal.test(weight.loss~diet.type,data=diet)
 64 | 
 65 | summary(diet.fisher)
 66 | print(diet.welch)
 67 | print(diet.kruskal)
 68 | ```
 69 | 
 70 | Note that, when the interest lies in the difference between two means, the Fisher's ANOVA (fonction `aov()`) and the Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`) leads to the same results.
 71 | Let check this by comparing the mean weight losses of *Diet A* and *Diet C*. 
 72 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
 73 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",]))
 74 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE)
 75 | ```
 76 | 
 77 | 
 78 | # Section 3: Model check
 79 | 
 80 | Lets first
 81 | 
 82 | * define the Fisher's and Welch's residuals by subtracting the mean of each group to the weight loss of the corresponding participants 
 83 | * define the Kruskal's residual's by subtraction the median of each group to the weight loss of the corresponding participants 
 84 | 
 85 | The mean or median of each group may be obtained by means of the function `tapply()` which allows a apply a function (like `mean` or `median`) to   and by  
 86 | 
 87 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 88 | # mean and median weight loss per group:
 89 | mean_group   = tapply(diet$weight.loss,diet$diet.type,mean)
 90 | median_group = tapply(diet$weight.loss,diet$diet.type,median)
 91 | mean_group
 92 | median_group
 93 | # residuals:
 94 | diet$resid.mean   = (diet$weight.loss - mean_group[as.numeric(diet$diet.type)])
 95 | diet$resid.median = (diet$weight.loss - median_group[as.numeric(diet$diet.type)])
 96 | diet[1:10,]
 97 | ```
 98 | 
 99 | Then, lets 
100 | 
101 | * display a boxplot of the residuals per group to assess if (i) the variance per groups are similar (ii) normality of the residuals per group seems credible
102 | * display a QQ-plot of the residuals of the mean model to assess if normality of the residuals seems credible
103 | 
104 | 
105 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
106 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 
107 | #
108 | boxplot(resid.mean~diet.type,data=diet,main="Residual boxplot per group",col="light gray",xlab="Diet type",ylab="Residuals")
109 | abline(h=0,col="blue")
110 | #
111 | col_group = rainbow(nlevels(diet$diet.type))
112 | qqnorm(diet$resid.mean,col=col_group[as.numeric(diet$diet.type)])
113 | qqline(diet$resid.mean)
114 | legend("top",legend=levels(diet$diet.type),col=col_group,pch=21,ncol=3,box.lwd=NA)
115 | ```
116 | 
117 | Finally, lets 
118 | 
119 | * perform a Shapiro's test to assess is there is enough evidence that the residuals are not normally distributed (by means of the function `shapiro.test()`)
120 | * perform a Bartlett's test to assess is there is enough evidence that the residuals per group do not have different variance (by means of the function `bartlett.test()`. )
121 | 
122 | 
123 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
124 | shapiro.test(diet$resid.mean)
125 | bartlett.test(diet$resid.mean~as.numeric(diet$diet.type))
126 | ```
127 | 
128 | 
129 | # Section 4: Mutiple comparisons
130 | 
131 | Lets
132 | 
133 | * perform a Tukey HSD test to define which group pair(s) have different means (by means of the function `TukeyHSD()`) 
134 | * compare the Tukey HSD confidence interval size for the difference of means between the weight losses of *Diet A* and *Diet B* with the one obtained by means of a Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`)
135 | 
136 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
137 | plot(TukeyHSD(diet.fisher))
138 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="C",],var.equal = TRUE)
139 | ```
140 | 
141 | # Section 5: Two-way ANOVA
142 | 
143 | Lets
144 | 
145 | * perform a two-way ANOVA to assess if the weight loss means are different per levels of the factors _Diet_ and/or _Age_.  
146 | * compare the output of the function `aov()` to the one of the function `lm()`.
147 | 
148 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
149 | diet.fisher = aov(weight.loss~diet.type*gender,data=diet)
150 | summary(diet.fisher)
151 | 
152 | anova(lm(weight.loss~diet.type*gender,data=diet))
153 | ```
154 | 
155 | 
156 | # Section 5: Practicals
157 | 
158 | Analyse the two following datasets with the suitable analysis:
159 | 
160 | ## (i) *amess.csv*
161 | The data for this exercise are to be found in *amess.csv*. The data are the red cell folate levels in three groups of cardiac bypass patients given different levels of nitrous oxide (N2O) and oxygen (O2) ventilation. (There is a reference to the source of this data in Altman, Practical Statistics for Medical Research, p. 208.)
162 | The treatments are
163 | 
164 | * 50% N2O and 50% O2 continuously for 24 hours 
165 | * 50% N2O and 50% O2 during the operation
166 | * No N2O but 35-50% O2 continuously for 24 hours
167 | 
168 | 
169 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
170 | amess = read.csv("data/amess.csv")
171 | amess$treatmnt = as.factor(amess$treatmnt)
172 | boxplot(folate~treatmnt,data=amess,col="light gray")
173 | 
174 | # resid:
175 | mean_treatmnt   = tapply(amess$folate,amess$treatmnt,mean)
176 | amess$resid.mean   = (amess$folate - mean_treatmnt[as.numeric(amess$treatmnt)])
177 | 
178 | # 
179 | bartlett.test(amess$resid.mean~as.numeric(amess$treatmnt))
180 | 
181 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 
182 | #
183 | boxplot(resid.mean~treatmnt,data=amess,main="Residual boxplot per group",col="light gray",xlab="Treatment type",ylab="Residuals")
184 | abline(h=0,col="blue")
185 | #
186 | col_group = rainbow(nlevels(diet$diet.type))
187 | qqnorm(amess$resid.mean,col=col_group[as.numeric(amess$treatmnt)])
188 | qqline(amess$resid.mean)
189 | legend("top",legend=levels(amess$treatmnt),col=col_group,pch=21,ncol=3,box.lwd=NA)
190 | 
191 | 
192 | # welch:
193 | amess.welch = oneway.test(folate~treatmnt,data=amess)
194 | print(amess.welch)
195 | 
196 | amess.aov = aov(folate~treatmnt,data=amess)
197 | summary(amess.aov)
198 | 
199 | ```
200 | 
201 | 
202 | 
203 | 
204 | 
205 | 
206 | ## (ii) *globalBreastCancerRisk.csv*
207 | 
208 | The file *globalBreastCancerRisk.csv* gives the number of new cases of Breast Cancer (per population of 10,000) in various countries around the world, along with various health and lifestyle risk factors. 
209 | 
210 | Let’s suppose we are initially interested in whether the number of breast cancer cases is significantly different in different regions of the world.
211 | 
212 | Visualise the distribution of breast cancer incidence in each continent. Check how many observations belong to each group (continent). Are there any groups that you would consider removing/grouping before performing the analysis ? 
213 | 
214 | 
215 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
216 | breastcancer = read.csv("data/globalBreastCancerRisk.csv",row.names=1)
217 | breastcancer$continent = factor(breastcancer$continent)
218 | 
219 | boxplot(NewCasesOfBreastCancerIn2002~continent,data=breastcancer,col="light gray")
220 | table(breastcancer$continent)
221 | breastcancer$continent2 = as.character(breastcancer$continent)
222 | breastcancer$continent2[breastcancer$continent2=="Oceania"] = "Asia"
223 | breastcancer$continent2 = factor(breastcancer$continent2)
224 | table(breastcancer$continent2)
225 | 
226 | par(mfrow=c(1,2))
227 | boxplot(NewCasesOfBreastCancerIn2002~continent2,data=breastcancer,col="light gray",
228 |         main="original scale")
229 | boxplot(log(NewCasesOfBreastCancerIn2002)~continent2,data=breastcancer,
230 |         main="log scale",col="light gray")
231 | 
232 | 
233 | # resid:
234 | mean_continent2   = tapply(breastcancer$NewCasesOfBreastCancerIn2002,breastcancer$continent2,mean,na.rm=TRUE)
235 | breastcancer$resid.mean   = (breastcancer$NewCasesOfBreastCancerIn2002 - mean_continent2[as.numeric(breastcancer$continent2)])
236 | 
237 | bartlett.test(breastcancer$resid.mean~as.numeric(breastcancer$continent2))
238 | shapiro.test(breastcancer$resid.mean)
239 |     # clear red flags !
240 | 
241 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 
242 | #
243 | boxplot(resid.mean~continent2,data=breastcancer,main="Residual boxplot per group",col="light gray",xlab="Treatment type",ylab="Residuals")
244 | abline(h=0,col="blue")
245 | #
246 | col_group = rainbow(nlevels(breastcancer$continent2))
247 | qqnorm(breastcancer$resid.mean,col=col_group[as.numeric(breastcancer$continent2)])
248 | qqline(breastcancer$resid.mean)
249 | legend("top",legend=levels(breastcancer$continent2),col=col_group,pch=21,ncol=3,box.lwd=NA)
250 | 
251 | 
252 | # kruskal-wallis:
253 | breastcancer.kruskal = kruskal.test(NewCasesOfBreastCancerIn2002~continent2,data=breastcancer)
254 | breastcancer.kruskal
255 | ```
256 | 
257 | 
258 | 
259 | 
260 | 
261 |         
262 |         


--------------------------------------------------------------------------------
/anova.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ---
  4 | title: "ANOVA with R: analysis of the *diet* dataset" 
  5 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  6 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  7 | output:
  8 |   html_document:
  9 |     theme: united 
 10 |     highlight: tango
 11 |     code_folding: show    
 12 |     toc: true           
 13 |     toc_depth: 2       
 14 |     toc_float: true     
 15 |     fig_width: 8
 16 |     fig_height: 6
 17 | ---
 18 | 
 19 | 
 20 | <!--- rmarkdown::render("~/courses/cruk/LinearModelAndExtensions/git_linear-models-r/anova.Rmd") --->
 21 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/anova.Rmd") --->
 22 | 
 23 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 24 | # change working directory: should be the directory containg the Markdown files:
 25 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20210514/Practicals/")
 26 | 
 27 | ```
 28 | 
 29 | 
 30 | A full version of the dataset *diet* may be found online on the U. of Sheffield website <https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv>. 
 31 | 
 32 | A slightly modified version is available in the data file is stored under data/diet.csv. The data set contains information on 76 people who undertook one of three diets (referred to as diet _A_, _B_ and _C_). There is background information such as age, gender, and height. The aim of the study was to see which diet was best for losing weight.
 33 | 
 34 | 
 35 | # Section 1: importation and descriptive analysis
 36 | 
 37 | Lets starts by
 38 | 
 39 | * importing the data set *diet* with the function `read.csv()`  
 40 | * defining a new column *weight.loss*, corresponding to the difference between the initial and final weights (respectively the corresponding to the columns `initial.weight` and `final.weight` of the dataset) 
 41 | * displaying _weight loss_ per _diet type_ (column `diet.type`) by means of a boxplot.
 42 | 
 43 | 
 44 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 45 | diet = read.csv("data/diet.csv",row.names=1)
 46 | diet$weight.loss = diet$initial.weight - diet$final.weight 
 47 | diet$diet.type   = factor(diet$diet.type,levels=c("A","B","C"))
 48 | diet$gender      = factor(diet$gender,levels=c("Female","Male"))
 49 | boxplot(weight.loss~diet.type,data=diet,col="light gray",
 50 |         ylab = "Weight loss (kg)", xlab = "Diet type")
 51 | abline(h=0,col="blue")
 52 | ```
 53 | 
 54 | # Section 2: ANOVA
 55 | 
 56 | Lets 
 57 | 
 58 | * perform a Fisher's, Welch's and Kruskal-Wallis one-way ANOVA, respectively by means of the functions `aov()`, `oneway.test()` and `kruskal.test`,  
 59 | * display and analyse the results: Use the function `summary()` to display the results of an R object of class `aov` and the function `print()` otherwise.
 60 | 
 61 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 62 | diet.fisher  = aov(weight.loss~diet.type,data=diet)
 63 | diet.welch   = oneway.test(weight.loss~diet.type,data=diet)
 64 | diet.kruskal = kruskal.test(weight.loss~diet.type,data=diet)
 65 | 
 66 | summary(diet.fisher)
 67 | print(diet.welch)
 68 | print(diet.kruskal)
 69 | ```
 70 | 
 71 | Note that, when the interest lies in the difference between two means, the Fisher's ANOVA (fonction `aov()`) and the Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`) leads to the same results.
 72 | Let check this by comparing the mean weight losses of *Diet A* and *Diet C*. 
 73 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
 74 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",]))
 75 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE)
 76 | ```
 77 | 
 78 | 
 79 | # Section 3: Model check
 80 | 
 81 | Lets first
 82 | 
 83 | * define the Fisher's and Welch's residuals by subtracting the mean of each group to the weight loss of the corresponding participants 
 84 | * define the Kruskal's residual's by subtraction the median of each group to the weight loss of the corresponding participants 
 85 | 
 86 | The mean or median of each group may be obtained by means of the function `tapply()` which allows a apply a function (like `mean` or `median`) to   and by  
 87 | 
 88 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 89 | # mean and median weight loss per group:
 90 | mean_group   = tapply(diet$weight.loss,diet$diet.type,mean)
 91 | median_group = tapply(diet$weight.loss,diet$diet.type,median)
 92 | mean_group
 93 | median_group
 94 | # residuals:
 95 | diet$resid.mean   = (diet$weight.loss - mean_group[as.numeric(diet$diet.type)])
 96 | diet$resid.median = (diet$weight.loss - median_group[as.numeric(diet$diet.type)])
 97 | diet[1:10,]
 98 | ```
 99 | 
100 | Then, lets 
101 | 
102 | * display a boxplot of the residuals per group to assess if (i) the variance per groups are similar (ii) normality of the residuals per group seems credible
103 | * display a QQ-plot of the residuals of the mean model to assess if normality of the residuals seems credible
104 | 
105 | 
106 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
107 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 
108 | #
109 | boxplot(resid.mean~diet.type,data=diet,main="Residual boxplot per group",col="light gray",xlab="Diet type",ylab="Residuals")
110 | abline(h=0,col="blue")
111 | #
112 | col_group = rainbow(nlevels(diet$diet.type))
113 | qqnorm(diet$resid.mean,col=col_group[as.numeric(diet$diet.type)])
114 | qqline(diet$resid.mean)
115 | legend("top",legend=levels(diet$diet.type),col=col_group,pch=21,ncol=3,box.lwd=NA)
116 | ```
117 | 
118 | Finally, lets 
119 | 
120 | * perform a Shapiro's test to assess is there is enough evidence that the residuals are not normally distributed (by means of the function `shapiro.test()`)
121 | * perform a Bartlett's test to assess is there is enough evidence that the residuals per group do not have different variance (by means of the function `bartlett.test()`. )
122 | 
123 | 
124 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
125 | shapiro.test(diet$resid.mean)
126 | bartlett.test(diet$resid.mean~as.numeric(diet$diet.type))
127 | ```
128 | 
129 | 
130 | # Section 4: Mutiple comparisons
131 | 
132 | Lets
133 | 
134 | * perform a Tukey HSD test to define which group pair(s) have different means (by means of the function `TukeyHSD()`) 
135 | * compare the Tukey HSD confidence interval size for the difference of means between the weight losses of *Diet A* and *Diet B* with the one obtained by means of a Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`)
136 | 
137 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
138 | plot(TukeyHSD(diet.fisher))
139 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="C",],var.equal = TRUE)
140 | ```
141 | 
142 | # Section 5: Two-way ANOVA
143 | 
144 | Lets
145 | 
146 | * perform a two-way ANOVA to assess if the weight loss means are different per levels of the factors _Diet_ and/or _Age_.  
147 | * compare the output of the function `aov()` to the one of the function `lm()`.
148 | 
149 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
150 | diet.fisher = aov(weight.loss~diet.type*gender,data=diet)
151 | summary(diet.fisher)
152 | 
153 | anova(lm(weight.loss~diet.type*gender,data=diet))
154 | ```
155 | 
156 | 
157 | # Section 5: Practicals
158 | 
159 | Analyse the two following datasets with the suitable analysis:
160 | 
161 | ## (i) *amess.csv*
162 | The data for this exercise are to be found in *amess.csv*. The data are the red cell folate levels in three groups of cardiac bypass patients given different levels of nitrous oxide (N2O) and oxygen (O2) ventilation. (There is a reference to the source of this data in Altman, Practical Statistics for Medical Research, p. 208.)
163 | The treatments are
164 | 
165 | * 50% N2O and 50% O2 continuously for 24 hours 
166 | * 50% N2O and 50% O2 during the operation
167 | * No N2O but 35-50% O2 continuously for 24 hours
168 | 
169 | ## (ii) *globalBreastCancerRisk.csv*
170 | 
171 | The file *globalBreastCancerRisk.csv* gives the number of new cases of Breast Cancer (per population of 10,000) in various countries around the world, along with various health and lifestyle risk factors. 
172 | 
173 | Let’s suppose we are initially interested in whether the number of breast cancer cases is significantly different in different regions of the world.
174 | 
175 | Visualise the distribution of breast cancer incidence in each continent. Check how many observations belong to each group (continent). Are there any groups that you would consider removing/grouping before performing the analysis ? 
176 | 
177 | 
178 | 
179 | 
180 |         
181 |         


--------------------------------------------------------------------------------
/anova.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/anova.pdf


--------------------------------------------------------------------------------
/cheat_sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/cheat_sheet.pdf


--------------------------------------------------------------------------------
/conclusion.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/conclusion.pdf


--------------------------------------------------------------------------------
/data/Assay.txt:
--------------------------------------------------------------------------------
 1 | 	micrograms	Optical Density
 2 | 	1.0	0.040
 3 | 	2.0	0.059
 4 | 	3.0	0.083
 5 | 	4.0	0.102
 6 | 	5.0	0.123
 7 | 	6.0	0.139
 8 | 	7.0	0.160
 9 | Unknown 1		0.067
10 | Unknown 2		0.073
11 | Unknown 3		0.098


--------------------------------------------------------------------------------
/data/Bronchitis.csv:
--------------------------------------------------------------------------------
  1 | bron,cigs,poll
  2 | 0,5.15,67.1
  3 | 1,0,66.9
  4 | 0,2.5,66.7
  5 | 0,1.75,65.8
  6 | 0,6.75,64.4
  7 | 0,0,64.4
  8 | 1,0,65.1
  9 | 1,9.5,66.2
 10 | 0,0,65.9
 11 | 0,0.75,67.1
 12 | 0,5.25,67.9
 13 | 1,8,68.1
 14 | 1,5.15,67
 15 | 1,30,66.3
 16 | 0,0,65.7
 17 | 0,0,65.2
 18 | 0,5.25,64.2
 19 | 0,10.05,64.6
 20 | 0,0,63.5
 21 | 1,3.4,63
 22 | 0,0,62.7
 23 | 0,0.55,62.7
 24 | 1,9.5,62.1
 25 | 1,12.5,63.7
 26 | 0,0,63.1
 27 | 0,3.4,63
 28 | 0,2.2,62.7
 29 | 0,6.7,63.1
 30 | 0,1.1,62.4
 31 | 0,1.8,64.4
 32 | 0,0,64.2
 33 | 1,3.6,64.2
 34 | 0,1.6,63
 35 | 0,6.2,62.2
 36 | 0,14.75,62.3
 37 | 0,0.35,63.7
 38 | 1,13.75,63.8
 39 | 0,0,63.1
 40 | 1,7.5,62.7
 41 | 0,1,62.9
 42 | 0,0,62.5
 43 | 1,14.8,61.7
 44 | 1,3.5,61.6
 45 | 0,0,61.6
 46 | 0,0,61.4
 47 | 0,0.25,61.4
 48 | 0,1.55,62
 49 | 1,0,61.8
 50 | 0,0,60.9
 51 | 0,5.9,60.8
 52 | 0,16.45,60.6
 53 | 0,2.65,62.9
 54 | 1,12.5,62.6
 55 | 0,0,62.1
 56 | 0,14.55,61.7
 57 | 1,11,61
 58 | 1,6.75,62.7
 59 | 0,0,62.7
 60 | 1,0,61.7
 61 | 0,1.75,60.9
 62 | 0,2.4,60.6
 63 | 0,10.05,60.4
 64 | 1,12.75,61.7
 65 | 0,0,61.9
 66 | 0,5,61.3
 67 | 0,0.6,60.7
 68 | 0,0,60.8
 69 | 0,0.85,60.5
 70 | 0,0.9,59.7
 71 | 0,0,59.5
 72 | 1,8.75,59.6
 73 | 0,0.8,59.1
 74 | 1,6.6,59.4
 75 | 0,1,58.5
 76 | 0,0,60
 77 | 1,8.15,59.8
 78 | 0,0,59.7
 79 | 1,5,59.4
 80 | 0,2.55,59.2
 81 | 0,1.2,58.6
 82 | 0,0,60.8
 83 | 1,11.25,60.4
 84 | 0,0,60.2
 85 | 0,2,60
 86 | 0,1.9,59.4
 87 | 0,0.45,59.8
 88 | 1,0,59.7
 89 | 0,0,59
 90 | 1,6.9,59
 91 | 0,2.35,58.6
 92 | 0,3.95,59.7
 93 | 0,0.6,59.6
 94 | 1,15,59.4
 95 | 0,0,59.4
 96 | 0,0.95,59.4
 97 | 0,0,59.3
 98 | 0,1.4,54.2
 99 | 0,0.5,54
100 | 0,0.6,53.8
101 | 0,0,53.7
102 | 0,2.45,53.7
103 | 0,1.75,53.1
104 | 0,0,54.4
105 | 0,3.1,54.2
106 | 0,10.05,53.9
107 | 0,0.55,53.2
108 | 0,0.85,53.2
109 | 0,1.1,54.9
110 | 0,0,54.9
111 | 0,0,54.5
112 | 0,1.45,54.2
113 | 0,2.05,54.2
114 | 1,10.5,54
115 | 0,0.5,55.8
116 | 1,9.2,55.5
117 | 0,0.55,55.6
118 | 0,0,55.5
119 | 0,0.96,54.9
120 | 0,1,54.6
121 | 0,0,56.9
122 | 0,5.25,56.4
123 | 1,0,55.9
124 | 0,9,55.8
125 | 0,1.6,55.6
126 | 1,10.9,57.6
127 | 0,0,57.7
128 | 0,0,57.6
129 | 0,2.25,57.8
130 | 0,2.65,57.8
131 | 0,0.55,58.4
132 | 0,0,58.2
133 | 1,4.5,58
134 | 0,15,58.1
135 | 0,0,57.9
136 | 0,0,57.3
137 | 0,4.2,58.3
138 | 0,0.55,58.1
139 | 1,10,57.9
140 | 0,0,57.6
141 | 0,7.1,57.3
142 | 0,3.2,57.1
143 | 1,0,58.9
144 | 1,6.8,58.6
145 | 0,0,58.7
146 | 0,0,57.5
147 | 0,2.35,57.2
148 | 0,24.9,58
149 | 0,2.65,57.9
150 | 1,3.7,57.2
151 | 0,17.1,57.3
152 | 0,0,57.5
153 | 0,0.95,57.2
154 | 0,10.05,53.1
155 | 0,1.15,53
156 | 1,18.25,53
157 | 0,10,52.9
158 | 0,0.75,52.6
159 | 0,0,53.1
160 | 0,4.2,53
161 | 0,0.8,52.9
162 | 0,0.55,52.7
163 | 0,0.95,52.6
164 | 0,0,52.1
165 | 0,3.1,54.1
166 | 0,0.8,53.7
167 | 0,1.55,53.1
168 | 0,0.4,53.3
169 | 0,6.2,53
170 | 0,0.6,53
171 | 0,0.4,53.9
172 | 1,7.5,53.7
173 | 0,7.15,53.4
174 | 0,0.25,53.2
175 | 0,3.6,53.4
176 | 0,0.95,53.2
177 | 0,2.8,54.9
178 | 1,20.25,54.9
179 | 0,0.95,54.6
180 | 0,4.25,54.1
181 | 0,4.15,54.2
182 | 0,10,57.4
183 | 0,3.4,57.3
184 | 0,0,57.3
185 | 0,3.6,56.7
186 | 0,0.9,56.5
187 | 0,0,56.8
188 | 0,0,56.6
189 | 1,6.4,56.5
190 | 0,0.95,56.3
191 | 0,1.06,56.3
192 | 0,13.3,56.2
193 | 0,1.1,56.6
194 | 0,17.2,55.9
195 | 0,1.65,56
196 | 1,5,55.8
197 | 0,2.1,55.7
198 | 0,0.6,57
199 | 1,8.25,56.7
200 | 0,0.9,56.4
201 | 0,0,56.5
202 | 1,12.3,55.2
203 | 0,1.15,56.9
204 | 0,2.2,56.7
205 | 0,3.6,56
206 | 1,10,55.5
207 | 0,0.6,55.3
208 | 0,9.5,56.5
209 | 0,0.7,56.3
210 | 1,9,56.1
211 | 0,0,55.9
212 | 0,0.5,55.5
213 | 0,0.9,55.4
214 | 


--------------------------------------------------------------------------------
/data/OscillationIndex.txt:
--------------------------------------------------------------------------------
  1 | Index
  2 | -8.3
  3 | -4.1
  4 | -4.6
  5 | 1.8
  6 | -11.8
  7 | 4.2
  8 | -2.7
  9 | 2.8
 10 | 6.2
 11 | -3.9
 12 | 3.3
 13 | -0.3
 14 | 2.3
 15 | 4.1
 16 | 6.8
 17 | 6.9
 18 | 6
 19 | -0.7
 20 | -0.7
 21 | 5
 22 | -6.3
 23 | 8.6
 24 | 2.7
 25 | -21
 26 | -5.9
 27 | 4.7
 28 | 12.5
 29 | -3.8
 30 | 6
 31 | -5.7
 32 | 9.8
 33 | 2.1
 34 | -5.6
 35 | -2.5
 36 | -0.4
 37 | 2.1
 38 | 6.4
 39 | 7.9
 40 | 3.6
 41 | -5.3
 42 | -2.7
 43 | -0.2
 44 | 0.7
 45 | 19.6
 46 | 4.7
 47 | -1.8
 48 | 3.9
 49 | -8.2
 50 | 2.9
 51 | 0.3
 52 | -13.5
 53 | -0.7
 54 | 8.8
 55 | -6.2
 56 | 4.5
 57 | 1.3
 58 | 0.3
 59 | 2.4
 60 | -5.2
 61 | 3.3
 62 | 1.1
 63 | -2.2
 64 | -2.1
 65 | 5.4
 66 | 6.9
 67 | 2.7
 68 | -4.1
 69 | 2.8
 70 | 12.8
 71 | 14.9
 72 | 17.2
 73 | 12.4
 74 | 7.6
 75 | 13.6
 76 | 1.7
 77 | 12.5
 78 | 16.5
 79 | 7.2
 80 | 9.4
 81 | 7.9
 82 | -0.4
 83 | -1.8
 84 | 7.5
 85 | -0.3
 86 | -8.8
 87 | -14.8
 88 | -7.8
 89 | -9.9
 90 | -0.8
 91 | -5.2
 92 | -10.4
 93 | -8.9
 94 | -13
 95 | -17.2
 96 | -14.3
 97 | -17.4
 98 | -18.8
 99 | -18.6
100 | -6.5
101 | -30.7
102 | -10.3
103 | -17
104 | -10.4
105 | -10.4
106 | -5.6
107 | -13
108 | -19.1
109 | -18
110 | -7.7
111 | -20.5
112 | -9.1
113 | -9.9
114 | -13.7
115 | -4.7
116 | -6
117 | -5.2
118 | 5.5
119 | 6.5
120 | -1
121 | 3.9
122 | 8.7
123 | 9.2
124 | -4
125 | 12.5
126 | 8.8
127 | 10.1
128 | 2.6
129 | 11.6
130 | 3.3
131 | -7.4
132 | 2.7
133 | 7.5
134 | 5.8
135 | 9.8
136 | 3.6
137 | -9.9
138 | -8.9
139 | 3.2
140 | 4.1
141 | -5.2
142 | -0.4
143 | -3.9
144 | -8.2
145 | 3.3
146 | 2.9
147 | -8.5
148 | -6.5
149 | 2.9
150 | 4.5
151 | 5.7
152 | 10.8
153 | -6.7
154 | 0.3
155 | 6.5
156 | 3.3
157 | 11.2
158 | 8.7
159 | 2.9
160 | -3.4
161 | 5.4
162 | -3.1
163 | 3.7
164 | -2.7
165 | -8.9
166 | -10
167 | -8.8
168 | -9.5
169 | -4
170 | -15.3
171 | -12.3
172 | -1.5
173 | -6.8
174 | -5.5
175 | -5.2
176 | 9.4
177 | -4.5
178 | -12.2
179 | 1.7
180 | 8.7
181 | 6.9
182 | 11.7
183 | -1.6
184 | 8.7
185 | 3.9
186 | -3.6
187 | -3.7
188 | -4.6
189 | 2.1
190 | 4
191 | -4.6
192 | 0.8
193 | -4
194 | -7.1
195 | 6.6
196 | 4.2
197 | -6.8
198 | -7.9
199 | 1.2
200 | 4.1
201 | 0.6
202 | -4.8
203 | -10.9
204 | -1.6
205 | -4
206 | 2.3
207 | 6
208 | -5.9
209 | 6.4
210 | 4.5
211 | 17
212 | 14.6
213 | 13.8
214 | 7.7
215 | 22.6
216 | 19.6
217 | 11.8
218 | 7
219 | 18
220 | 11.8
221 | 21.7
222 | 12.7
223 | 5.7
224 | -5.5
225 | -7.4
226 | -11.5
227 | -1.8
228 | -12.5
229 | -5.2
230 | -11.2
231 | -12.3
232 | -8.5
233 | -8.3
234 | -8.9
235 | -8.1
236 | 0.2
237 | -6.7
238 | 7.7
239 | 5.8
240 | 4.5
241 | -2.2
242 | -1.8
243 | 3.5
244 | 0.4
245 | -12.9
246 | 1.6
247 | -7.1
248 | -6
249 | -0.8
250 | -25.5
251 | -2.5
252 | -1
253 | -16.1
254 | -13
255 | -0.3
256 | -2.7
257 | -5.8
258 | 5
259 | -5.2
260 | -2.2
261 | 5
262 | 4
263 | -2.5
264 | 3.3
265 | 9.4
266 | 2.3
267 | 2.2
268 | 2.3
269 | 11.5
270 | -5.5
271 | 14.6
272 | 1.2
273 | -5.2
274 | 11.4
275 | 12.8
276 | 16.6
277 | 13.6
278 | 14.6
279 | 16.7
280 | 15
281 | 7.9
282 | 10.8
283 | 12.1
284 | 7.4
285 | 8.7
286 | 16.5
287 | 10
288 | 11.1
289 | 10.6
290 | 1.1
291 | 19.9
292 | 2.3
293 | 8.5
294 | 4.5
295 | -3.2
296 | -2.7
297 | -0.1
298 | -11.5
299 | -1.8
300 | 1.4
301 | -8.2
302 | -9.4
303 | -0.3
304 | -11
305 | -4.3
306 | -17.5
307 | -7.1
308 | -2.2
309 | 1.3
310 | -9.3
311 | -0.4
312 | 3.3
313 | 7.5
314 | -3
315 | -0.3
316 | -4.6
317 | -7.3
318 | -8.9
319 | -15
320 | 7
321 | 4.3
322 | 4
323 | -5.3
324 | -4
325 | -4
326 | 0.5
327 | 4.7
328 | 11.2
329 | 6.9
330 | 0.2
331 | -1.7
332 | 4.5
333 | 7.2
334 | 4.7
335 | -2.5
336 | 4.5
337 | 6.3
338 | 7.6
339 | 0.3
340 | 6.8
341 | 5.9
342 | -3.1
343 | 5.7
344 | -20.5
345 | 7.9
346 | 1.8
347 | -2.5
348 | -0.4
349 | -0.3
350 | 1.1
351 | -4.7
352 | 6.8
353 | 12.5
354 | 16.5
355 | -5.2
356 | -3.1
357 | -0.8
358 | 12.1
359 | 5.1
360 | -0.4
361 | 4.5
362 | 5.2
363 | 10.4
364 | 4.2
365 | 0.3
366 | 8.4
367 | 2.7
368 | 5.5
369 | 7.2
370 | 2.5
371 | -10.2
372 | -2.2
373 | -2.8
374 | -5.9
375 | -14.8
376 | -9.1
377 | -12.9
378 | -4.1
379 | -2.2
380 | 5.5
381 | 1.3
382 | 6.9
383 | 5.8
384 | 5.1
385 | 14.2
386 | 14
387 | 14.2
388 | 2.3
389 | -4.3
390 | -4.6
391 | 1.2
392 | 2.1
393 | -10.4
394 | -0.4
395 | -10.9
396 | -21
397 | -10.1
398 | -13.5
399 | -11
400 | -16.7
401 | 0.3
402 | -12.7
403 | -4.7
404 | -12.8
405 | -6
406 | -7.8
407 | 0.3
408 | -0.4
409 | 4.5
410 | -1.8
411 | -2.2
412 | 0.4
413 | -4.8
414 | 14.1
415 | 12.6
416 | 6.5
417 | -3.8
418 | -2.6
419 | 4.5
420 | 0.8
421 | 5.7
422 | 5.8
423 | -0.3
424 | -4.6
425 | -6.8
426 | 3.6
427 | 9.1
428 | -3.6
429 | -3
430 | 14.3
431 | 10
432 | 6.3
433 | 0.3
434 | -2.4
435 | -1.6
436 | -3.4
437 | 0.3
438 | -14.2
439 | -7.6
440 | -0.7
441 | -8.2
442 | -5.6
443 | -1.1
444 | -6.4
445 | -4
446 | -10
447 | -11.6
448 | -0.2
449 | 2.3
450 | -10.8
451 | -12.1
452 | 0.7
453 | -4.5
454 | 2.5
455 | 8.6
456 | -5.2
457 | 3.9
458 | 12.8
459 | 11
460 | 18.8
461 | 16.1
462 | 2.1
463 | 15.5
464 | 16.1
465 | 19.6
466 | 9.2
467 | 1.7
468 | 1.4
469 | 14.2
470 | 15.8
471 | 18.6
472 | 6.8
473 | 0.8
474 | 3.1
475 | 7.2
476 | 1.2
477 | -5.2
478 | -24
479 | -10.9
480 | -17.3
481 | -8.2
482 | -14.1
483 | -11
484 | -3.4
485 | -13.4
486 | -3.6
487 | -15
488 | -0.3
489 | -2.3
490 | 3.3
491 | 10
492 | 5.7
493 | 11.8
494 | 13.4
495 | 10.4
496 | 31.5
497 | 15.6
498 | 20.3
499 | 16
500 | 17
501 | 9.4
502 | 10.6
503 | 1.7
504 | 11.1
505 | 6.3
506 | 12.2
507 | 9.2
508 | -1.5
509 | 0.3
510 | -6
511 | 4.7
512 | 9.4
513 | 12.3
514 | 6.2
515 | 12.8
516 | 19.6
517 | 19.7
518 | 22.2
519 | 18.6
520 | 13.1
521 | 17.6
522 | 11.2
523 | 12.6
524 | 10.8
525 | 0.6
526 | 2.5
527 | 0.3
528 | -11.9
529 | -11.3
530 | -12.4
531 | 3.5
532 | 9.3
533 | -20
534 | -4.1
535 | 8.6
536 | -9.4
537 | -8.2
538 | -9.3
539 | -15.8
540 | -13.7
541 | -11.3
542 | -8.8
543 | -12.9
544 | -14.2
545 | -11.4
546 | -3.6
547 | -26.9
548 | -6
549 | -7.4
550 | 15.8
551 | 4.5
552 | 5.1
553 | 2.1
554 | 1.1
555 | -5.3
556 | -2.1
557 | -2.2
558 | -4.6
559 | 6.2
560 | -3.6
561 | -5.2
562 | 4
563 | 4.5
564 | 13.6
565 | -4.6
566 | 1.7
567 | -2.2
568 | -4.6
569 | -8.3
570 | 2.6
571 | 0.3
572 | -8.4
573 | -11.8
574 | -2.6
575 | -3.9
576 | -1.6
577 | 1.5
578 | -4.7
579 | -0.9
580 | -3.4
581 | -2.2
582 | 2.1
583 | -4.2
584 | -15.6
585 | -5.2
586 | 8.4
587 | 12.1
588 | 8.1
589 | 5.1
590 | 6.4
591 | -5.3
592 | 2.3
593 | 3.4
594 | 8.8
595 | -0.2
596 | 0.7
597 | -2.3
598 | -7.1
599 | -17.2
600 | -17.9
601 | -22.2
602 | -20
603 | -20.5
604 | -30
605 | -22.6
606 | -31.4
607 | -35.7
608 | -25.7
609 | -15.5
610 | 5.5
611 | -3.2
612 | -7
613 | 0.9
614 | 9.9
615 | 4.7
616 | -0.8
617 | -1.2
618 | 0.7
619 | 5.2
620 | -6.5
621 | 1.3
622 | 0.3
623 | -8.1
624 | 0.8
625 | 2.1
626 | 2.3
627 | -4.7
628 | 3.6
629 | -2.7
630 | -4.6
631 | 6.2
632 | -2.7
633 | 12.3
634 | 3.3
635 | -8.8
636 | -2.2
637 | 8.2
638 | 0.5
639 | -5.3
640 | -1.5
641 | 0.8
642 | 7.4
643 | -12.1
644 | -0.3
645 | 0.6
646 | -5.6
647 | 8.6
648 | 2
649 | -7
650 | -4.7
651 | 6.6
652 | -13.5
653 | -15
654 | -7
655 | -14
656 | -15.6
657 | -22.1
658 | -19.6
659 | -17.9
660 | -17.3
661 | -13.1
662 | -10.6
663 | -5.3
664 | -1.5
665 | -5.8
666 | -1.7
667 | -6.2
668 | 1.2
669 | -3
670 | 9.9
671 | -3.9
672 | 10.5
673 | 14.2
674 | 18.7
675 | 15.5
676 | 22
677 | 9.5
678 | 12.7
679 | 8.6
680 | 5.5
681 | 16.7
682 | 14.3
683 | 5.8
684 | 8.7
685 | -5.8
686 | 5.8
687 | 7.9
688 | -2.1
689 | -5.8
690 | -1.9
691 | -18.4
692 | -8.2
693 | -0.7
694 | 13.6
695 | 0
696 | 5.2
697 | -4.4
698 | -7.3
699 | -1.2
700 | -5
701 | -3.2
702 | 4.2
703 | -0.2
704 | -10.1
705 | -11.5
706 | -17.9
707 | -5.5
708 | -1.5
709 | -6.8
710 | -16.2
711 | -13.5
712 | -6.9
713 | -18.3
714 | -26
715 | -10.3
716 | -22.2
717 | -16.5
718 | 1.3
719 | -11.9
720 | -6.5
721 | 1.7
722 | 1.1
723 | -10.4
724 | -7
725 | -7
726 | -10
727 | -8
728 | -10
729 | -19
730 | -10
731 | 


--------------------------------------------------------------------------------
/data/amess.csv:
--------------------------------------------------------------------------------
 1 | folate,treatmnt
 2 | 243,1
 3 | 251,1
 4 | 275,1
 5 | 291,1
 6 | 347,1
 7 | 354,1
 8 | 380,1
 9 | 392,1
10 | 206,2
11 | 210,2
12 | 226,2
13 | 249,2
14 | 255,2
15 | 273,2
16 | 285,2
17 | 295,2
18 | 309,2
19 | 241,3
20 | 258,3
21 | 270,3
22 | 293,3
23 | 328,3
24 | 


--------------------------------------------------------------------------------
/data/clinicalTrials.txt:
--------------------------------------------------------------------------------
 1 | 	Cell.Count	Drug.concentration
 2 | 01-01	4.30	2882.
 3 | 01-02	5.64	4155.
 4 | 01-03	5.15	5286.
 5 | 02-04	5.50	3765.
 6 | 02-05	6.20	2978.
 7 | 02-07	3.00	1638.
 8 | 02-08	2.90	1631.
 9 | 01-09	4.13	2684.
10 | 01-10	4.90	4475.
11 | 02-11	8.40	3540.
12 | 01-12	3.48	1755.
13 | 01-13	3.38	1595.
14 | 02-01	6.10	3259.
15 | 01-02	1.86	808.
16 | 01-04	8.72	3571.
17 | 01-05	4.32	1703.
18 | 02-06	6.80	3984.
19 | 01-07	4.08	2168.
20 | 


--------------------------------------------------------------------------------
/data/crab.csv:
--------------------------------------------------------------------------------
  1 | "Obs","C","S","W","Wt","Sa"
  2 | 1,2,3,28.3,3.05,8
  3 | 2,3,3,26,2.6,4
  4 | 3,3,3,25.6,2.15,0
  5 | 4,4,2,21,1.85,0
  6 | 5,2,3,29,3,1
  7 | 6,1,2,25,2.3,3
  8 | 7,4,3,26.2,1.3,0
  9 | 8,2,3,24.9,2.1,0
 10 | 9,2,1,25.7,2,8
 11 | 10,2,3,27.5,3.15,6
 12 | 11,1,1,26.1,2.8,5
 13 | 12,3,3,28.9,2.8,4
 14 | 13,2,1,30.3,3.6,3
 15 | 14,2,3,22.9,1.6,4
 16 | 15,3,3,26.2,2.3,3
 17 | 16,3,3,24.5,2.05,5
 18 | 17,2,3,30,3.05,8
 19 | 18,2,3,26.2,2.4,3
 20 | 19,2,3,25.4,2.25,6
 21 | 20,2,3,25.4,2.25,4
 22 | 21,4,3,27.5,2.9,0
 23 | 22,4,3,27,2.25,3
 24 | 23,2,2,24,1.7,0
 25 | 24,2,1,28.7,3.2,0
 26 | 25,3,3,26.5,1.97,1
 27 | 26,2,3,24.5,1.6,1
 28 | 27,3,3,27.3,2.9,1
 29 | 28,2,3,26.5,2.3,4
 30 | 29,2,3,25,2.1,2
 31 | 30,3,3,22,1.4,0
 32 | 31,1,1,30.2,3.28,2
 33 | 32,2,2,25.4,2.3,0
 34 | 33,2,1,24.9,2.3,6
 35 | 34,4,3,25.8,2.25,10
 36 | 35,3,3,27.2,2.4,5
 37 | 36,2,3,30.5,3.32,3
 38 | 37,4,3,25,2.1,8
 39 | 38,2,3,30,3,9
 40 | 39,2,1,22.9,1.6,0
 41 | 40,2,3,23.9,1.85,2
 42 | 41,2,3,26,2.28,3
 43 | 42,2,3,25.8,2.2,0
 44 | 43,3,3,29,3.28,4
 45 | 44,1,1,26.5,2.35,0
 46 | 45,3,3,22.5,1.55,0
 47 | 46,2,3,23.8,2.1,0
 48 | 47,3,3,24.3,2.15,0
 49 | 48,2,1,26,2.3,14
 50 | 49,4,3,24.7,2.2,0
 51 | 50,2,1,22.5,1.6,1
 52 | 51,2,3,28.7,3.15,3
 53 | 52,1,1,29.3,3.2,4
 54 | 53,2,1,26.7,2.7,5
 55 | 54,4,3,23.4,1.9,0
 56 | 55,1,1,27.7,2.5,6
 57 | 56,2,3,28.2,2.6,6
 58 | 57,4,3,24.7,2.1,5
 59 | 58,2,1,25.7,2,5
 60 | 59,2,1,27.8,2.75,0
 61 | 60,3,1,27,2.45,3
 62 | 61,2,3,29,3.2,10
 63 | 62,3,3,25.6,2.8,7
 64 | 63,3,3,24.2,1.9,0
 65 | 64,3,3,25.7,1.2,0
 66 | 65,3,3,23.1,1.65,0
 67 | 66,2,3,28.5,3.05,0
 68 | 67,2,1,29.7,3.85,5
 69 | 68,3,3,23.1,1.55,0
 70 | 69,3,3,24.5,2.2,1
 71 | 70,2,3,27.5,2.55,1
 72 | 71,2,3,26.3,2.4,1
 73 | 72,2,3,27.8,3.25,3
 74 | 73,2,3,31.9,3.33,2
 75 | 74,2,3,25,2.4,5
 76 | 75,3,3,26.2,2.22,0
 77 | 76,3,3,28.4,3.2,3
 78 | 77,1,2,24.5,1.95,6
 79 | 78,2,3,27.9,3.05,7
 80 | 79,2,2,25,2.25,6
 81 | 80,3,3,29,2.92,3
 82 | 81,2,1,31.7,3.73,4
 83 | 82,2,3,27.6,2.85,4
 84 | 83,4,3,24.5,1.9,0
 85 | 84,3,3,23.8,1.8,0
 86 | 85,2,3,28.2,3.05,8
 87 | 86,3,3,24.1,1.8,0
 88 | 87,1,1,28,2.62,0
 89 | 88,1,1,26,2.3,9
 90 | 89,3,2,24.7,1.9,0
 91 | 90,2,3,25.8,2.65,0
 92 | 91,1,1,27.1,2.95,8
 93 | 92,2,3,27.4,2.7,5
 94 | 93,3,3,26.7,2.6,2
 95 | 94,2,1,26.8,2.7,5
 96 | 95,1,3,25.8,2.6,0
 97 | 96,4,3,23.7,1.85,0
 98 | 97,2,3,27.9,2.8,6
 99 | 98,2,1,30,3.3,5
100 | 99,2,3,25,2.1,4
101 | 100,2,3,27.7,2.9,5
102 | 101,2,3,28.3,3,15
103 | 102,4,3,25.5,2.25,0
104 | 103,2,3,26,2.15,5
105 | 104,2,3,26.2,2.4,0
106 | 105,3,3,23,1.65,1
107 | 106,2,2,22.9,1.6,0
108 | 107,2,3,25.1,2.1,5
109 | 108,3,1,25.9,2.55,4
110 | 109,4,1,25.5,2.75,0
111 | 110,2,1,26.8,2.55,0
112 | 111,2,1,29,2.8,1
113 | 112,3,3,28.5,3,1
114 | 113,2,2,24.7,2.55,4
115 | 114,2,3,29,3.1,1
116 | 115,2,3,27,2.5,6
117 | 116,4,3,23.7,1.8,0
118 | 117,3,3,27,2.5,6
119 | 118,2,3,24.2,1.65,2
120 | 119,4,3,22.5,1.47,4
121 | 120,2,3,25.1,1.8,0
122 | 121,2,3,24.9,2.2,0
123 | 122,2,3,27.5,2.63,6
124 | 123,2,1,24.3,2,0
125 | 124,2,3,29.5,3.02,4
126 | 125,2,3,26.2,2.3,0
127 | 126,2,3,24.7,1.95,4
128 | 127,3,2,29.8,3.5,4
129 | 128,4,3,25.7,2.15,0
130 | 129,3,3,26.2,2.17,2
131 | 130,4,3,27,2.63,0
132 | 131,3,3,24.8,2.1,0
133 | 132,2,1,23.7,1.95,0
134 | 133,2,3,28.2,3.05,11
135 | 134,2,3,25.2,2,1
136 | 135,2,2,23.2,1.95,4
137 | 136,4,3,25.8,2,3
138 | 137,4,3,27.5,2.6,0
139 | 138,2,2,25.7,2,0
140 | 139,2,3,26.8,2.65,0
141 | 140,3,3,27.5,3.1,3
142 | 141,3,1,28.5,3.25,9
143 | 142,2,3,28.5,3,3
144 | 143,1,1,27.4,2.7,6
145 | 144,2,3,27.2,2.7,3
146 | 145,3,3,27.1,2.55,0
147 | 146,2,3,28,2.8,1
148 | 147,2,1,26.5,1.3,0
149 | 148,3,3,23,1.8,0
150 | 149,3,2,26,2.2,3
151 | 150,3,2,24.5,2.25,0
152 | 151,2,3,25.8,2.3,0
153 | 152,4,3,23.5,1.9,0
154 | 153,4,3,26.7,2.45,0
155 | 154,3,3,25.5,2.25,0
156 | 155,2,3,28.2,2.87,1
157 | 156,2,1,25.2,2,1
158 | 157,2,3,25.3,1.9,2
159 | 158,3,3,25.7,2.1,0
160 | 159,4,3,29.3,3.23,12
161 | 160,3,3,23.8,1.8,6
162 | 161,2,3,27.4,2.9,3
163 | 162,2,3,26.2,2.02,2
164 | 163,2,1,28,2.9,4
165 | 164,2,1,28.4,3.1,5
166 | 165,2,1,33.5,5.2,7
167 | 166,2,3,25.8,2.4,0
168 | 167,3,3,24,1.9,10
169 | 168,2,1,23.1,2,0
170 | 169,2,3,28.3,3.2,0
171 | 170,2,3,26.5,2.35,4
172 | 171,2,3,26.5,2.75,7
173 | 172,3,3,26.1,2.75,3
174 | 173,2,2,24.5,2,0
175 | 


--------------------------------------------------------------------------------
/data/diet.csv:
--------------------------------------------------------------------------------
 1 | "id","gender","age","height","diet.type","initial.weight","final.weight"
 2 | 1,"Female",22,159,"A",58,54.2
 3 | 2,"Female",46,192,"A",60,54
 4 | 3,"Female",55,170,"A",64,63.3
 5 | 4,"Female",33,171,"A",64,61.1
 6 | 5,"Female",50,170,"A",65,62.2
 7 | 6,"Female",50,201,"A",66,64
 8 | 7,"Female",37,174,"A",67,65
 9 | 8,"Female",28,176,"A",69,60.5
10 | 9,"Female",28,165,"A",70,68.1
11 | 10,"Female",45,165,"A",70,66.9
12 | 11,"Female",60,173,"A",72,70.5
13 | 12,"Female",48,156,"A",72,69
14 | 13,"Female",41,163,"A",72,68.4
15 | 14,"Female",37,167,"A",82,81.1
16 | 27,"Female",44,174,"B",58,60.1
17 | 28,"Female",37,172,"B",58,56
18 | 29,"Female",41,165,"B",59,57.3
19 | 30,"Female",43,171,"B",61,56.7
20 | 31,"Female",20,169,"B",62,55
21 | 32,"Female",51,174,"B",63,62.4
22 | 33,"Female",31,163,"B",63,60.3
23 | 34,"Female",54,173,"B",63,59.4
24 | 35,"Female",50,166,"B",65,62
25 | 36,"Female",48,163,"B",66,64
26 | 37,"Female",16,165,"B",68,63.8
27 | 38,"Female",37,167,"B",68,63.3
28 | 39,"Female",30,161,"B",76,72.7
29 | 40,"Female",29,169,"B",77,77.5
30 | 52,"Female",51,165,"C",60,53
31 | 53,"Female",35,169,"C",62,56.4
32 | 54,"Female",21,159,"C",64,60.6
33 | 55,"Female",22,169,"C",65,58.2
34 | 56,"Female",36,160,"C",66,58.2
35 | 57,"Female",20,169,"C",67,61.6
36 | 58,"Female",35,163,"C",67,60.2
37 | 59,"Female",45,155,"C",69,61.8
38 | 60,"Female",58,141,"C",70,63
39 | 61,"Female",37,170,"C",70,62.7
40 | 62,"Female",31,170,"C",72,71.1
41 | 63,"Female",35,171,"C",72,64.4
42 | 64,"Female",56,171,"C",73,68.9
43 | 65,"Female",48,153,"C",75,68.7
44 | 66,"Female",41,157,"C",76,71
45 | 15,"Male",39,168,"A",71,71.6
46 | 16,"Male",31,158,"A",72,70.9
47 | 17,"Male",40,173,"A",74,69.5
48 | 18,"Male",50,160,"A",78,73.9
49 | 19,"Male",43,162,"A",80,71
50 | 20,"Male",25,165,"A",80,77.6
51 | 21,"Male",52,177,"A",83,79.1
52 | 22,"Male",42,166,"A",85,81.5
53 | 23,"Male",39,166,"A",87,81.9
54 | 24,"Male",40,190,"A",88,84.5
55 | 41,"Male",51,191,"B",71,66.8
56 | 42,"Male",38,199,"B",75,72.6
57 | 43,"Male",54,196,"B",75,69.2
58 | 44,"Male",33,190,"B",76,72.5
59 | 45,"Male",45,160,"B",78,72.7
60 | 46,"Male",37,194,"B",78,76.3
61 | 47,"Male",44,163,"B",79,73.6
62 | 48,"Male",40,171,"B",79,72.9
63 | 49,"Male",37,198,"B",79,71.1
64 | 50,"Male",39,180,"B",80,81.4
65 | 51,"Male",31,182,"B",80,75.7
66 | 67,"Male",36,155,"C",71,68.5
67 | 68,"Male",47,179,"C",73,72.1
68 | 69,"Male",29,166,"C",76,72.5
69 | 70,"Male",37,173,"C",78,77.5
70 | 71,"Male",31,177,"C",78,75.2
71 | 72,"Male",26,179,"C",78,69.4
72 | 73,"Male",40,179,"C",79,74.5
73 | 74,"Male",35,183,"C",83,80.2
74 | 75,"Male",49,177,"C",84,79.9
75 | 76,"Male",28,164,"C",85,79.7
76 | 77,"Male",40,167,"C",87,77.8
77 | 78,"Male",51,175,"C",88,81.9
78 | 


--------------------------------------------------------------------------------
/data/genotypes.txt:
--------------------------------------------------------------------------------
1 | "AA"	"AB"	"BB"
2 | "1"	2.51304714898469	6.32886230947307	NA
3 | "2"	6.16876708252332	5.60757586162325	7.63948757433344
4 | "3"	3.18458867678788	8.26959775839993	6.79579919463547
5 | "4"	7.88995995654271	4.27138992776303	7.18864022126384
6 | "5"	5.14639512210706	6.28291661807179	7.48205753600522
7 | "6"	NA	7.27477208906786	7.93472451951159
8 | "7"	NA	7.18481566745089	9.20833919961227
9 | 


--------------------------------------------------------------------------------
/data/globalBreastCancerRisk.csv:
--------------------------------------------------------------------------------
  1 | country,continent,year,lifeExp,pop,gdpPercap,NewCasesOfBreastCancerIn2002,AlcoholComsumption,BloodPressure,BodyMassIndex,Cholestorol,Smoking
  2 | Afghanistan,Asia,2002,42.129,25268405,726.7340548,26.8,0.02,124.2085,20.65274,4.29517,NA
  3 | Albania,Europe,2002,75.651,3508512,4604.211737,57.4,6.68,129.0609,25.27082,4.918646,4
  4 | Algeria,Africa,2002,70.994,31287142,5288.040382,23.5,0.96,130.4024,25.69948,4.848951,0.3
  5 | Angola,Africa,2002,41.003,10866106,2773.287312,23.1,5.4,129.9282,22.26093,4.499115,NA
  6 | Argentina,Americas,2002,74.34,38331121,8797.640716,73.9,10,119.6538,26.7046,5.143871,25.4
  7 | Australia,Oceania,2002,80.37,19546792,30687.75473,83.2,10.02,120.5113,26.25957,5.326858,21.8
  8 | Austria,Europe,2002,78.98,8148312,32417.60769,70.5,13.24,125.8685,24.83051,5.381785,40.1
  9 | Bahrain,Asia,2002,74.795,656397,23403.55927,40.2,3.66,132.4395,27.96036,5.204952,2.9
 10 | Bangladesh,Asia,2002,62.013,135656790,1136.39043,16.6,0.17,124.5601,19.72414,4.400593,3.8
 11 | Belgium,Europe,2002,78.32,10311970,30485.88375,92,10.77,124.8811,25.12181,5.454941,24.1
 12 | Benin,Africa,2002,54.406,7026113,1372.877931,28.1,2.15,129.6162,23.14637,4.29748,NA
 13 | Bolivia,Americas,2002,63.883,8445134,3413.26269,24.7,5.12,122.9154,26.1348,4.745852,29.2
 14 | Bosnia and Herzegovina,Europe,2002,74.09,4165416,6018.975239,58.9,9.63,133.1459,25.96649,4.759707,35.1
 15 | Botswana,Africa,2002,46.634,1630347,11003.60508,33.4,7.96,132.4681,25.52204,4.72277,NA
 16 | Brazil,Americas,2002,71.006,179914212,8131.212843,46,9.16,125.8598,25.36366,4.899372,NA
 17 | Bulgaria,Europe,2002,72.14,7661799,7696.777725,46.2,12.44,129.3342,25.24871,5.095784,27.8
 18 | Burkina Faso,Africa,2002,50.65,12251209,1037.645221,30.6,6.98,129.2223,21.12089,4.127278,11.2
 19 | Burundi,Africa,2002,47.36,7021078,446.4035126,19.5,9.47,131.8824,20.81771,4.170466,NA
 20 | Cambodia,Asia,2002,56.752,12926707,896.2260153,21.5,4.77,117.6568,21.08197,4.443956,6.5
 21 | Cameroon,Africa,2002,49.856,15929988,1934.011449,29.7,7.57,125.9758,23.99221,4.359665,2.2
 22 | Canada,Americas,2002,79.77,31902268,33328.96507,84.3,9.77,120.5054,26.29888,5.263165,18.9
 23 | Chad,Africa,2002,50.525,8835739,1156.18186,16.5,4.38,126.8861,21.18393,4.14814,2.6
 24 | Chile,Americas,2002,77.86,15497046,10778.78385,43.9,8.55,126.4672,27.13405,5.088821,33.6
 25 | China,Asia,2002,72.028,1280400000,3119.280896,18.7,5.91,123.3793,22.67688,4.451926,3.7
 26 | Colombia,Americas,2002,71.682,41008227,5755.259962,30.3,6.17,123.3725,25.88257,5.016418,NA
 27 | Comoros,Africa,2002,62.974,614382,1075.811558,19.5,0.36,130.3333,22.08826,4.379765,13.5
 28 | Costa Rica,Americas,2002,78.123,3834934,7723.447195,30.9,5.55,121.6703,26.29537,4.884666,7.3
 29 | Croatia,Europe,2002,74.876,4481020,11628.38895,62.1,15.11,131.4386,24.78217,5.093568,29.1
 30 | Cuba,Americas,2002,77.158,11226999,6340.646683,31.2,5.51,125.4713,25.73888,4.708419,28.3
 31 | Denmark,Europe,2002,77.18,5374693,32166.50006,88.7,13.37,122.0758,24.74319,5.479227,30.6
 32 | Djibouti,Africa,2002,53.373,447416,1908.260867,19.5,1.87,128.7091,23.74615,4.707457,NA
 33 | Ecuador,Americas,2002,74.173,12921234,5773.044512,23.5,9.38,123.7925,26.55303,4.837632,5.8
 34 | Egypt,Africa,2002,69.806,73312559,4754.604414,24.2,0.37,125.4958,29.20828,4.839036,1.3
 35 | El Salvador,Americas,2002,70.734,6353681,5351.568666,13.6,3.61,119.8362,26.8835,4.708999,NA
 36 | Equatorial Guinea,Africa,2002,49.348,495627,7703.4959,16.5,6.08,132.0235,23.19161,4.695409,NA
 37 | Eritrea,Africa,2002,55.24,4414865,765.3500015,19.5,1.54,123.9032,20.60579,4.252667,1.2
 38 | Ethiopia,Africa,2002,50.725,67946797,530.0535319,24.7,4.02,123.9984,20.1839,4.203238,0.9
 39 | Finland,Europe,2002,78.37,5193039,28204.59057,84.7,12.52,128.6964,25.41902,5.436867,24.4
 40 | France,Europe,2002,79.59,59925035,28926.03234,91.9,13.66,122.8281,24.72699,5.43699,26.7
 41 | Gabon,Africa,2002,56.761,1299304,12521.71392,18.2,9.32,132.3095,24.91982,4.942365,NA
 42 | Gambia,Africa,2002,58.041,1457766,660.5855997,6.4,3.39,129.5809,23.63134,4.303299,2.9
 43 | Germany,Europe,2002,78.67,82350671,30035.80198,79.8,12.81,128.0019,25.57631,5.593814,25.8
 44 | Ghana,Africa,2002,58.453,20550751,1111.984578,28.1,2.97,128.9192,23.47061,4.238891,0.8
 45 | Greece,Europe,2002,78.256,10603863,22514.2548,51.6,10.75,124.3581,24.73157,4.970351,39.8
 46 | Guatemala,Americas,2002,68.978,11178650,4858.347495,25.9,4.03,120.3851,25.94954,4.56032,4.1
 47 | Guinea,Africa,2002,53.676,8807818,945.5835837,15.3,0.76,131.2604,21.94696,4.222444,NA
 48 | Guinea-Bissau,Africa,2002,45.504,1332459,575.7047176,28.1,3.68,129.6245,22.44935,4.170127,NA
 49 | Haiti,Americas,2002,58.137,7607651,1270.364932,4.4,6.61,124.3388,22.67272,4.373766,NA
 50 | Honduras,Americas,2002,68.565,6677328,3099.72866,25.9,4.48,122.2106,25.97615,4.593344,3.4
 51 | Hungary,Europe,2002,72.59,10083313,14843.93556,63,16.27,129.7824,25.59566,5.189877,33.9
 52 | Iceland,Europe,2002,80.5,288030,31163.20196,90,6.31,120.6813,25.50528,5.658249,26.6
 53 | India,Asia,2002,62.879,1034172547,1746.769454,19.1,2.59,124.2951,21.03814,4.547826,3.8
 54 | Indonesia,Asia,2002,68.588,211060000,2873.91287,26.1,0.59,126.4575,22.40431,4.622888,4.5
 55 | Iran,Asia,2002,69.451,66907826,9240.761975,17.1,1.02,125.5227,26.13108,5.216626,5.5
 56 | Iraq,Asia,2002,57.046,24001816,4390.717312,31.7,0.4,126.3037,27.93361,4.916117,2.5
 57 | Ireland,Europe,2002,77.783,3879155,34077.04939,74.9,14.41,126.107,26.19135,5.426795,26
 58 | Israel,Asia,2002,79.696,6029529,21905.59514,90.8,2.89,123.6205,26.72752,5.303616,17.9
 59 | Italy,Europe,2002,80.24,57926999,27968.09817,74.4,10.68,125.5278,24.76686,5.27189,19.2
 60 | Jamaica,Americas,2002,72.047,2664659,6994.774861,43.5,5,124.9047,26.64816,4.742052,9.2
 61 | Japan,Asia,2002,82,127065841,28604.5919,32.7,8.03,123.9938,21.92321,5.197184,14.3
 62 | Jordan,Asia,2002,71.263,5307470,3844.917194,33,0.71,123.1612,29.10914,5.21362,9.8
 63 | Kenya,Africa,2002,50.992,31386842,1287.514732,25.2,4.14,127.8673,22.62218,4.405716,2.2
 64 | Kuwait,Asia,2002,76.904,2111561,35110.10566,31.8,0.1,125.9623,30.34041,5.345846,NA
 65 | Lebanon,Asia,2002,71.028,3677780,9313.93883,52.5,2.23,129.0933,27.12636,4.99548,7
 66 | Lesotho,Africa,2002,44.593,2046772,1275.184575,13.1,5.55,129.5081,26.32388,4.260839,NA
 67 | Liberia,Africa,2002,43.753,2814651,531.4823679,18.8,5.06,130.2577,22.69692,4.176063,NA
 68 | Libya,Africa,2002,72.737,5368585,9534.677467,23.4,0.11,132.6075,28.51032,4.838883,NA
 69 | Madagascar,Africa,2002,57.286,16473477,894.6370822,19.5,1.33,130.8052,20.65751,4.356253,NA
 70 | Malawi,Africa,2002,45.009,11824495,665.4231186,10.5,1.74,130.9474,22.25219,4.331791,6.2
 71 | Malaysia,Asia,2002,73.044,22662365,10206.97794,30.8,0.82,124.6766,24.86649,5.06707,2.8
 72 | Mali,Africa,2002,51.818,10580176,951.4097518,18.2,1.04,127.2195,21.94506,4.195883,2.8
 73 | Mauritania,Africa,2002,62.247,2828858,1579.019543,28.1,0.11,129.162,25.43653,4.372733,3.7
 74 | Mauritius,Africa,2002,71.954,1200206,9021.815894,31.6,3.72,130.6751,25.57969,4.934729,1.1
 75 | Mexico,Americas,2002,74.902,102479927,10742.44053,26.4,8.42,122.6993,27.97481,5.003826,12.4
 76 | Mongolia,Asia,2002,65.033,2674234,2140.739323,6.6,3.24,128.8302,25.20751,4.831112,6.5
 77 | Morocco,Africa,2002,69.615,31167783,3258.495584,22.5,1.46,126.6631,25.65263,4.749074,0.3
 78 | Mozambique,Africa,2002,44.026,18473780,633.6179466,3.9,2.38,133.2081,22.44782,4.385706,3.4
 79 | Namibia,Africa,2002,51.479,1972153,4072.324751,24.7,9.62,132.2566,24.39392,4.592049,10.9
 80 | Nepal,Asia,2002,61.34,25873917,1057.206311,21.8,2.41,124.8058,20.16195,4.342687,26.4
 81 | Netherlands,Europe,2002,78.53,16122830,33724.75778,86.7,10.05,123.8227,25.11372,5.389425,30.3
 82 | New Zealand,Oceania,2002,79.11,3908037,23189.80135,91.9,9.62,121.1642,26.72162,5.380087,27.5
 83 | Nicaragua,Americas,2002,70.836,5146848,2474.548819,23.9,5.37,122.9739,26.78785,4.589401,NA
 84 | Niger,Africa,2002,54.496,11140655,601.0745012,23.3,0.34,132.2249,21.47314,4.089284,NA
 85 | Nigeria,Africa,2002,46.608,119901274,1615.286395,31.2,12.28,134.2376,23.24367,4.110205,1.2
 86 | Norway,Europe,2002,79.05,4535591,44683.97525,74.8,7.81,127.886,25.32467,5.410416,30.4
 87 | Oman,Asia,2002,74.193,2713462,19774.83687,13.2,0.94,128.2915,26.41259,5.110034,1.3
 88 | Pakistan,Asia,2002,63.61,153403524,2092.712441,50.1,0.06,126.2059,23.07639,4.591925,6.6
 89 | Panama,Americas,2002,74.712,2990875,7356.031934,29,6.85,123.0864,26.76206,4.912917,NA
 90 | Paraguay,Americas,2002,70.755,5884491,3783.674243,34.4,7.88,123.7607,25.36778,4.846562,14.8
 91 | Peru,Americas,2002,69.906,26769436,5909.020073,35.1,6.9,120.7746,25.84887,4.82459,NA
 92 | Philippines,Asia,2002,70.303,82995088,2650.921068,46.6,6.38,122.4797,23.00286,4.85752,9.8
 93 | Poland,Europe,2002,74.67,38625976,12002.23908,50.3,13.25,130.0524,25.7373,5.206347,27.2
 94 | Portugal,Europe,2002,77.29,10433867,19970.90787,55.5,14.55,129.172,25.9666,5.236896,31
 95 | Romania,Europe,2002,71.322,22404337,7885.360081,44.3,15.3,129.3499,24.94229,4.968129,24.5
 96 | Rwanda,Africa,2002,43.413,7852401,785.6537648,8.8,9.8,132.8907,21.56031,4.285954,NA
 97 | Saudi Arabia,Asia,2002,71.626,24501530,19014.54118,24.7,0.25,127.3608,28.75728,4.991571,3.6
 98 | Senegal,Africa,2002,61.6,10870037,1519.635262,18.4,0.6,129.346,23.53851,4.330656,1.5
 99 | Sierra Leone,Africa,2002,41.012,5359092,699.489713,28.1,9.72,133.424,22.99407,4.110871,NA
100 | Singapore,Asia,2002,78.77,4197776,36023.1054,48.7,1.55,123.2673,23.23321,4.835357,NA
101 | Slovenia,Europe,2002,76.66,2011497,20660.01936,58.9,15.19,130.9412,26.39933,5.274947,21.1
102 | Somalia,Africa,2002,45.936,7753310,882.0818218,19.5,0.5,129.8451,22.18624,4.362204,NA
103 | South Africa,Africa,2002,53.365,44433622,7710.946444,35,9.46,130.0713,28.50839,4.563033,9.1
104 | Spain,Europe,2002,79.78,40152517,24835.47166,50.9,11.62,123.514,26.07576,5.19105,30.9
105 | Sri Lanka,Asia,2002,70.815,19576783,3015.378833,23.6,0.79,124.3827,22.71949,4.627639,2.6
106 | Sudan,Africa,2002,56.369,37090298,1993.398314,22.5,2.56,128.1803,22.50693,4.520633,NA
107 | Swaziland,Africa,2002,43.869,1130269,4128.116943,12.3,5.7,130.7529,27.73696,4.571466,3.2
108 | Sweden,Europe,2002,80.04,8954175,29341.63093,87.8,10.1,125.0731,24.99897,5.18503,24.5
109 | Switzerland,Europe,2002,80.62,7361757,34480.95771,81.7,11.06,121.6552,24.04547,5.349241,22.2
110 | Syria,Asia,2002,73.053,17155814,4090.925331,44.8,1.43,127.6258,28.23422,4.885778,NA
111 | Tanzania,Africa,2002,49.651,34593779,899.0742111,21.1,6.75,128.5271,22.5268,4.250212,4.3
112 | Thailand,Asia,2002,68.564,62806748,5913.187529,16.6,7.08,120.5435,23.9693,5.039856,3.4
113 | Togo,Africa,2002,57.561,4977378,886.2205765,28.1,1.99,129.8265,22.07173,4.169999,NA
114 | Trinidad and Tobago,Americas,2002,68.976,1101832,11460.60023,51.1,6.28,124.3713,27.50818,4.756425,7.6
115 | Tunisia,Africa,2002,73.042,9770575,5722.895655,19.6,1.29,129.5677,27.26516,4.830195,1.9
116 | Turkey,Europe,2002,70.845,67308928,6508.085718,22,2.87,126.0769,27.89841,4.855077,19.2
117 | Uganda,Africa,2002,47.813,24739869,927.7210018,18.3,11.93,132.7281,21.9589,4.288926,3.2
118 | United Kingdom,Europe,2002,78.471,59912431,29478.99919,87.2,13.37,127.7889,26.431,5.45187,34.7
119 | United States,Americas,2002,77.31,287675526,39097.09955,101.1,9.44,119.7817,27.75614,5.286608,21.5
120 | Uruguay,Americas,2002,75.307,3363085,7727.002004,83.1,8.14,124.619,25.88269,5.001928,28
121 | Venezuela,Americas,2002,72.766,24287670,8605.047831,34.3,8.23,124.9664,27.32819,4.802348,27
122 | Vietnam,Asia,2002,73.017,80908147,1764.456677,16.2,3.77,121.4875,20.47612,4.565701,2.5
123 | Zambia,Africa,2002,39.193,10595811,1071.613938,13,3.85,129.7063,22.45268,4.455595,5
124 | Zimbabwe,Africa,2002,39.989,11926563,672.0386227,19,5.08,130.9397,24.65855,4.40654,4.4
125 | 


--------------------------------------------------------------------------------
/data/lactoferrin.csv:
--------------------------------------------------------------------------------
 1 | conc,growth
 2 | 1,13.3222079379203
 3 | 2,10.8305328089083
 4 | 3,11.1765715064331
 5 | 4,8.98395336946397
 6 | 5,8.1206954901671
 7 | 6,8.04378381984629
 8 | 7,5.24744940390419
 9 | 8,7.2136631215058
10 | 9,2.27301595493662
11 | 10,2.13181537401631
12 | 


--------------------------------------------------------------------------------
/data/pollution.csv:
--------------------------------------------------------------------------------
 1 | pollution,temp,industry,population,wind,rain,rainy.days
 2 | 24,61.5,368,497,9.1,48.34,115
 3 | 30,55.6,291,593,8.3,43.11,123
 4 | 56,55.9,775,622,9.5,35.89,105
 5 | 28,51,137,176,8.7,15.17,89
 6 | 14,68.4,136,529,8.8,54.47,116
 7 | 46,47.6,44,116,8.8,33.36,135
 8 | 9,66.2,641,844,10.9,35.94,78
 9 | 35,49.9,1064,1513,10.1,30.96,129
10 | 26,57.8,197,299,7.6,42.59,115
11 | 61,50.4,347,520,9.4,36.22,147
12 | 29,57.3,434,757,9.3,38.98,111
13 | 28,52.3,361,746,9.7,38.74,121
14 | 14,51.5,181,347,10.9,30.18,98
15 | 18,59.4,275,448,7.9,46,119
16 | 17,51.9,454,515,9,12.95,86
17 | 23,54,462,453,7.1,39.04,132
18 | 47,55,625,905,9.6,41.31,111
19 | 13,61,91,132,8.2,48.52,100
20 | 31,55.2,35,71,6.6,40.75,148
21 | 12,56.7,453,716,8.7,20.66,67
22 | 10,70.3,213,582,6,7.05,36
23 | 110,50.6,3344,3369,10.4,34.44,122
24 | 56,49.1,412,158,9,43.37,127
25 | 10,68.9,721,1233,10.8,48.19,103
26 | 69,54.6,1692,1950,9.6,39.93,115
27 | 8,56.6,125,277,12.7,30.58,82
28 | 36,54,80,80,9,40.25,114
29 | 16,45.7,569,717,11.8,29.07,123
30 | 29,51.1,379,531,9.4,38.79,164
31 | 29,43.5,669,744,10.6,25.94,137
32 | 65,49.7,1007,751,10.9,34.99,155
33 | 9,68.3,204,361,8.4,56.77,113
34 | 10,75.5,207,335,9,59.8,128
35 | 26,51.5,266,540,8.6,37.01,134
36 | 31,59.3,96,308,10.6,44.68,116
37 | 10,61.6,337,624,9.2,49.1,105
38 | 11,47.1,391,463,12.4,36.11,166
39 | 14,54.5,381,507,10,37,99
40 | 17,49,104,201,11.2,30.85,103
41 | 11,56.8,46,244,8.9,7.77,58
42 | 94,50,343,179,10.6,42.75,125
43 | 


--------------------------------------------------------------------------------
/data/protein-expression.csv:
--------------------------------------------------------------------------------
 1 | A,B,C,D,E
 2 | 0.4,0.26,0.24,1.04,0.74
 3 | 1.5,0.47,0.25,2.78,0.99
 4 | 0.98,0.42,1.01,0.82,1.26
 5 | 0.33,0.64,0.77,1.65,1.5
 6 | 0.75,0.32,0.47,0.49,0.3
 7 | 1.48,0.65,0.47,0.97,0.34
 8 | 1.18,0.43,0.46,1.39,0.77
 9 | 0.33,0.67,0.65,3.24,1.94
10 | 1.42,0.43,0.41,1.12,2.62
11 | 2.09,0.7,0.81,2.82,1.42
12 | 1.37,0.79,1.2,1.27,0.73
13 | 1.23,0.89,1.08,1.6,2.09
14 | ,,0.34,1.98,1.52
15 | ,,1.98,9.32,1.67
16 | ,,1.39,2.31,3.4
17 | ,,1.12,4.19,2.16
18 | ,,3.14,1.73,2.31
19 | ,,2.78,5.16,1.32
20 | 


--------------------------------------------------------------------------------
/data/students.csv:
--------------------------------------------------------------------------------
  1 | "day","cases"
  2 | 1,6
  3 | 2,8
  4 | 3,12
  5 | 3,9
  6 | 4,3
  7 | 4,3
  8 | 4,11
  9 | 6,5
 10 | 7,7
 11 | 8,3
 12 | 8,8
 13 | 8,4
 14 | 8,6
 15 | 12,8
 16 | 14,3
 17 | 15,6
 18 | 17,3
 19 | 17,2
 20 | 17,2
 21 | 18,6
 22 | 19,3
 23 | 19,7
 24 | 20,7
 25 | 23,2
 26 | 23,2
 27 | 23,8
 28 | 24,3
 29 | 24,6
 30 | 25,5
 31 | 26,7
 32 | 27,6
 33 | 28,4
 34 | 29,4
 35 | 34,3
 36 | 36,3
 37 | 36,5
 38 | 42,3
 39 | 42,3
 40 | 43,3
 41 | 43,5
 42 | 44,3
 43 | 44,5
 44 | 44,6
 45 | 44,3
 46 | 45,3
 47 | 46,3
 48 | 48,3
 49 | 48,2
 50 | 49,3
 51 | 49,1
 52 | 53,3
 53 | 53,3
 54 | 53,5
 55 | 54,4
 56 | 55,4
 57 | 56,3
 58 | 56,5
 59 | 58,4
 60 | 60,3
 61 | 63,5
 62 | 65,3
 63 | 67,4
 64 | 67,2
 65 | 68,3
 66 | 71,3
 67 | 71,1
 68 | 72,3
 69 | 72,2
 70 | 72,5
 71 | 73,4
 72 | 74,3
 73 | 74,0
 74 | 74,3
 75 | 75,3
 76 | 75,4
 77 | 80,0
 78 | 81,3
 79 | 81,3
 80 | 81,4
 81 | 81,0
 82 | 88,2
 83 | 88,2
 84 | 90,1
 85 | 93,1
 86 | 93,2
 87 | 94,0
 88 | 95,2
 89 | 95,1
 90 | 95,1
 91 | 96,0
 92 | 96,0
 93 | 97,1
 94 | 98,1
 95 | 100,2
 96 | 101,2
 97 | 102,1
 98 | 103,1
 99 | 104,1
100 | 105,1
101 | 106,0
102 | 107,0
103 | 108,0
104 | 109,1
105 | 110,1
106 | 111,0
107 | 112,0
108 | 113,0
109 | 114,0
110 | 115,0
111 | 


--------------------------------------------------------------------------------
/data/treatments.txt:
--------------------------------------------------------------------------------
1 | 	Control	Treatment 1	Treatment 2	Treatment 3
2 | GS	54.	43.	78.	111.
3 | JM	23.	34.	65.	99.
4 | HM	45.	65.	99.	78.
5 | DR	54.	77.	79.	90.
6 | PS	45.	46.	87.	95.


--------------------------------------------------------------------------------
/glm+.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | ---
  5 | title: "GLM with R" 
  6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  8 | output:
  9 |   html_document:
 10 |     theme: united 
 11 |     highlight: tango
 12 |     code_folding: show    
 13 |     toc: true           
 14 |     toc_depth: 2       
 15 |     toc_float: true     
 16 |     fig_width: 8
 17 |     fig_height: 6
 18 | ---
 19 | 
 20 | 
 21 | 
 22 | <!--- rmarkdown::render("~/courses/cruk/LinearModelAndExtensions/git_linear-models-r/glm+.Rmd") --->
 23 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/glm+.Rmd") --->
 24 | 
 25 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 26 | # change working directory: should be the directory containg the Markdown files:
 27 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20200310/Practicals/")
 28 | 
 29 | # install gamlss package if needed
 30 |     # install.packages("gamlss")
 31 | ```
 32 | 
 33 | # Section 1: Logistic regression
 34 | 
 35 | We will analyse the data collected by Jones (Unpublished BSc dissertation, University of Southampton, 1975). The aim of the study was to define if the probability of having Bronchitis is influenced by smoking and/or pollution.
 36 | 
 37 | The data are stored under data/Bronchitis.csv and contains information on 212 participants.
 38 | 
 39 | 
 40 | ### Section 1.1: importation and descriptive analysis
 41 | 
 42 | Lets starts by
 43 | 
 44 | * importing the data set *Bronchitis* with the function `read.csv()`  
 45 | * displaying _bron_ (a dichotomous variable which equals 1 for participants having bronchitis and 0 otherwise) as a function of _cigs_, the number of cigarettes smoked daily.
 46 | 
 47 | 
 48 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 49 | Bronchitis = read.csv("data/Bronchitis.csv",header=TRUE)
 50 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4",
 51 |         ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes")
 52 | abline(h=c(0,1),col="light blue")
 53 | ```
 54 | 
 55 | # Section 1.2: Model fit
 56 | 
 57 | Lets 
 58 | 
 59 | * fit a logistic model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`.  
 60 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`.
 61 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 62 | fit.glm = glm(bron~cigs,data=Bronchitis,family=binomial)
 63 | 
 64 | library(gamlss)
 65 | fit.gamlss = gamlss(bron~cigs,data=Bronchitis,family=BI)
 66 | 
 67 | summary(fit.glm)
 68 | ```
 69 | 
 70 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot:
 71 | 
 72 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 73 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4",
 74 |         ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes")
 75 | abline(h=c(0,1),col="light blue")
 76 | 
 77 | axe.x = seq(0,40,length=1000)
 78 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])/(1+exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2]))
 79 | lines(axe.x,f.x,col="pink2",lwd=2)
 80 | ```
 81 | 
 82 | ## Section 1.3: Model selection
 83 | 
 84 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 
 85 | 
 86 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 87 | anova(fit.glm,test="LRT")
 88 | ```
 89 | 
 90 | ## Section 1.3: Model check
 91 | 
 92 | Lets assess is the model fit seems satisfactory by means 
 93 | 
 94 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`,
 95 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`,
 96 | 
 97 | 
 98 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
 99 | # deviance
100 | par(mfrow=c(2,2),mar=c(3,5,3,0))
101 | plot(fit.glm)
102 | # randomised normalised quantile residuals
103 | plot(gamlss(bron~cigs,data=Bronchitis,family=BI))
104 | ```
105 | 
106 | ## Section 1.4: Fun
107 | 
108 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
109 | # long format:
110 | long = data.frame(mi = rep(c("MI","No MI"),c(104+189,11037+11034)),
111 |                   treatment = rep(c("Aspirin","Placebo","Aspirin","Placebo"),c(104,189,11037,11034)))
112 | # short format: 2 by 2 table
113 | table2by2 = table(long$treatment,long$mi)
114 | 
115 | #
116 | chisq.test(table2by2)
117 | prop.test(table2by2[,"MI"],apply(table2by2,1,sum))
118 | summary(glm(I(mi=="MI")~treatment,data=long,family="binomial"))
119 | ```
120 | 
121 | 
122 | # Section 2: Poisson regression
123 | 
124 | The dataset *students.csv* shows the number of high school students diagnosed with an infectious disease for each day from the initial disease outbreak. 
125 | 
126 | # Section 2.2: Importation
127 | 
128 | Lets 
129 | 
130 | * import the dataset by means of the function `read.csv()`
131 | * display the daily number of students diagnosed with the disease (variable `cases`) as a function of the days since the outbreak (variable `day`). 
132 | 
133 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
134 | students = read.csv("data/students.csv",header=TRUE)
135 | plot(students$day,students$cases,col="blue4",
136 |         ylab = "Number of diagnosed students", xlab = "Days since initial outbreak")
137 | abline(h=c(0),col="light blue")
138 | ```
139 | 
140 | # Section 2.2: Model fit
141 | 
142 | Lets 
143 | 
144 | * fit a poisson model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`.  
145 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`.
146 | 
147 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
148 | fit.glm = glm(cases~day,data=students,family=poisson)
149 | 
150 | library(gamlss)
151 | fit.gamlss = gamlss(cases~day,data=students,,family=PO)
152 | 
153 | summary(fit.glm)
154 | ```
155 | 
156 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot:
157 | 
158 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
159 | plot(students$day,students$cases,col="blue4",
160 |         ylab = "Number of diagnosed students", xlab = "Days since initial outbreak")
161 | abline(h=c(0),col="light blue")
162 | 
163 | axe.x = seq(0,120,length=1000)
164 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])
165 | lines(axe.x,f.x,col="pink2",lwd=2)
166 | ```
167 | 
168 | ## Section 2.3: Model selection
169 | 
170 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 
171 | 
172 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
173 | anova(fit.glm,test="LRT")
174 | ```
175 | 
176 | ## Section 2.3: Model check
177 | 
178 | Lets assess is the model fit seems satisfactory by means 
179 | 
180 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`,
181 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`,
182 | 
183 | 
184 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
185 | # deviance
186 | par(mfrow=c(2,2),mar=c(3,5,3,0))
187 | plot(fit.glm)
188 | # randomised normalised quantile residuals
189 | plot(fit.gamlss)
190 | ```
191 | 
192 | 
193 | 
194 | # Section 6: Practicals
195 | 
196 | 
197 | ### (i) *Bronchitis.csv*
198 | Analyse further the Bronchitis data of Jones (1975) by 
199 | 
200 | * first investigating if the probability of having bronchitis also depends on _pollution_ (variable `poll`),
201 | * second investigating if there is an interaction between the variables `cigs` and `poll`.
202 | 
203 | 
204 | Lets plot the data first. 
205 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
206 | Bronchitis = read.csv("data/Bronchitis.csv",header=TRUE)
207 | # plot
208 | plot(Bronchitis$poll,Bronchitis$bron,col="blue4",
209 |         ylab = "Absence/Presense of Bronchitis", xlab = "Pollution level")
210 | abline(h=c(0,1),col="light blue")
211 | ```
212 | No obvious relationship between pollution and bronchitis is visible by means of this plot.
213 | 
214 | Lets fit a model assuming that the probability of getting bronchitis is a function of the pollution level.
215 | 
216 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
217 | # fit1:
218 | fit.glm = glm(bron~poll,data=Bronchitis,family=binomial)
219 | summary(fit.glm)
220 | ```
221 | 
222 | The intercept of the previous fit allows to define the probability of getting bronchitis when the level of pollution equals 0.
223 | As a zero level of pollution is (i) out of range (ii) not a realistic value, we will 
224 | 
225 | * create the variable `poll_centered` defined as the pollution level minus the mean so that the intercept corresponds to the probability of getting bronchitis for an average pollution level in Cardiff,
226 | * refit the model
227 | 
228 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
229 | # fit2:
230 | Bronchitis$poll_centered = Bronchitis$poll-mean(Bronchitis$poll)
231 | fit.glm = glm(bron~poll_centered,data=Bronchitis,family=binomial)
232 | library(gamlss)
233 | fit.gamlss = gamlss(bron~poll_centered,data=Bronchitis,family=BI)
234 | summary(fit.glm)
235 | ```
236 | 
237 | Lets 
238 | 
239 | * perform a model check but plotting the randomised quantile residuals of gamlss a few times
240 | * plot the fitted probabilities
241 | 
242 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
243 | # model check
244 | plot(gamlss(bron~poll_centered,data=Bronchitis,family=BI))
245 | plot(gamlss(bron~poll_centered,data=Bronchitis,family=BI))
246 | plot(gamlss(bron~poll_centered,data=Bronchitis,family=BI))
247 | 
248 | # plot fit
249 | plot(Bronchitis$poll_centered,Bronchitis$bron,col="blue4",
250 |         ylab = "Absence/Presense of Bronchitis", xlab = "Pollution level")
251 | abline(h=c(0,1),col="light blue")
252 | axe.x = seq(-10,10,length=1000)
253 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])/(1+exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2]))
254 | lines(axe.x,f.x,col="pink2",lwd=2)
255 | ```
256 | Model check suggests a good fit. Lets finally check if the interaction is significant:
257 |     
258 | 
259 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
260 | # interaction ?
261 | fit.glm = glm(bron~poll_centered*cigs,data=Bronchitis,family=binomial)
262 | summary(fit.glm)
263 | anova(fit.glm,test="LRT")
264 | ```
265 | Interaction is not significant.
266 | 
267 | 
268 | 
269 | ### (ii) *myocardialinfarction.csv*
270 | 
271 | The file *myocardialinfarction.csv* indicates if a participant had a myocardial infarction attack (variable `infarction`) as well the participant's treatment (variable `treatment`).  
272 | 
273 | Does _Aspirin_ decrease the probability to have a myocardial infarction attack ?
274 | 
275 | Lets (i) import the dataset, (ii) change the levels of the factor `treatment` so that `Placebo` corresponds to the reference group, (iii) and finally plot the (sample) probabilities to get an attack by treatment group
276 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
277 | # import
278 | myocardialinfarction = read.csv("data/myocardialinfarction.csv")
279 | # by default, Aspirin is the reference group as the alphabetic order is used
280 | myocardialinfarction$treatment = factor(myocardialinfarction$treatment,
281 |                                          levels=c("Placebo","Aspirin"))
282 | # plot
283 | par(mfrow=c(1,1),mar=c(3,4,3,1))
284 | pi.group =  tapply(myocardialinfarction$infarction=="attack",myocardialinfarction$treatment,mean)
285 | table.group = tapply(myocardialinfarction$infarction=="attack",myocardialinfarction$treatment,table)
286 | temp = barplot(pi.group,plot=FALSE)
287 | barplot(pi.group,ylab="Probability",xlab="",
288 |         main = "Probability of myocardial infarction\n per treatment group",names=rep("",2),
289 |         cex.axis=.6,axes=FALSE,cex.main=1.4)
290 | axis(2,las=2,cex.axis=.8)
291 | axis(1,temp[,1],names(pi.group),cex.axis=1.25,tick=FALSE)
292 | for(gw in 1:length(pi.group)){
293 |     text(temp[gw],pi.group[gw]/2,.p(table.group[[gw]]["TRUE"]," / ",sum(table.group[[gw]])),
294 |          col="red",cex=1.5)
295 |     }   
296 | ```
297 | The barplot seems to suggest that the treatment (aspirin) reduces the risk of myocardial infarction. Lets fit a logistic model to assess if this difference is significant. 
298 | Note that, in this case (a dichotomous outcome and a dichotomous predictor), a test of equality of proportions or an independence test could also do the job. 
299 | With a logistic model, other predictors could easily be added to the model and the beta parameter corresponding to the treatment can be interpreted by means of odd ratios (or relative risk ratios when prevalences are *small*, as we will note at the end of this practical).
300 | 
301 | 
302 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
303 | # model fit
304 | fit.glm = glm(I(infarction=="attack")~treatment,data=myocardialinfarction,family=binomial)
305 | summary(fit.glm)
306 | # test of equality of (independent) proportions
307 | prop.test(unlist(lapply(table.group,function(x)x[2])),unlist(lapply(table.group,sum)))
308 | # test of independence
309 | chisq.test(matrix(unlist(table.group),ncol=2))
310 | ```
311 | The three methods lead to the same conclusion: there is a significant difference between the probabilities of having a myocardial infarction of the two treatment groups.
312 | Note that the two last methods get exactly the same results (as they use the same test X-squared statsitic).
313 | 
314 | Lets define the fitted probabilities to get an attack and compare them to the sample probabilities: they should match:
315 | 
316 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
317 | # pi_placebo:
318 | pi_placebo = exp(-4.04971)/(1+exp(-4.04971))
319 | pi_placebo
320 | pi.group[1]
321 | # pi_aspirin
322 | pi_aspirin = exp( -4.04971-0.60544)/(1+exp( -4.04971-0.60544))
323 | pi_aspirin
324 | pi.group[2]
325 | ```
326 | 
327 | Finally, note that when prevalence are small, the exponential of the logistic regression corresponding to the treatment **may** also be interpreted as relative risk ratios. Indeed:  
328 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
329 | # interpreation of exp(beta1) when prevalences are small
330 | pi_aspirin/pi_placebo
331 | exp(-0.60544)
332 | c(exp(-0.60544-qnorm(.975)*0.12284),exp(-0.60544+qnorm(.975)*0.12284))
333 | ```
334 | Thus, aspirin strongly reduces the risk of myocardial infarction.
335 | 
336 | 
337 | 
338 | 
339 | ### (ii) *crabs.csv*
340 | 
341 | This data set is derived from Agresti (2007, Table 3.2, pp.76-77). It gives 6 variables for each of 173 female horseshoe crabs:
342 | 
343 | * Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weightweight (Wt)
344 | * C: the crab's colour,
345 | * S: the crab's spine condition, 
346 | * Wt: the crab's weight,
347 | * W: the crab's carapace width, 
348 | * Sa: the response outcome, i.e., the number of satellites. 
349 | 
350 | Check if the width of female's back can explain the number of satellites attached by fitting a Poisson regression model with width.
351 | 
352 | 
353 | Lets import the datasset, fit a poisson loglinear model, plot the fit and perfom a model check :
354 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
355 | crabs = read.csv("data/crab.csv",header=TRUE)
356 | 
357 | # plot:
358 | plot(crabs$W,crabs$Sa,col="blue4",
359 |         ylab = "Number of satellites", xlab = "width of female's back")
360 | abline(h=c(0),col="light blue")
361 | 
362 | 
363 | # fit
364 | fit.glm = glm(Sa~W,data=crabs,family=poisson)
365 | library(gamlss)
366 | 
367 | 
368 | # plot fit:
369 | plot(crabs$W,crabs$Sa,col="blue4",
370 |         ylab = "Number of satellites", xlab = "width of female's back")
371 | abline(h=c(0),col="light blue")
372 | 
373 | axe.x = seq(15,40,length=1000)
374 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])
375 | lines(axe.x,f.x,col="pink2",lwd=2)
376 |     # not a great fit...
377 | 
378 | # model check
379 | plot(gamlss(Sa~W,data=crabs,family=PO))
380 | plot(gamlss(Sa~W,data=crabs,family=PO))
381 | plot(gamlss(Sa~W,data=crabs,family=PO))
382 | 
383 |     # confirm lack of fit -> bin the estimates
384 |     
385 | # 2 alternative models
386 | plot(gamlss(Sa~W,data=crabs,family=ZIP))
387 | plot(gamlss(Sa~W,data=crabs,family=NBI))
388 |     # check ?ZIP and ?NBI for detail       
389 | ```
390 | Reasonably, there is a lack of fit> the estimates are not to be trusted.
391 | 
392 | 
393 |         
394 |         


--------------------------------------------------------------------------------
/glm.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "GLM with R" 
  3 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri "
  4 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  5 | output:
  6 |   html_document:
  7 |     theme: united 
  8 |     highlight: tango
  9 |     code_folding: show    
 10 |     toc: true           
 11 |     toc_depth: 2       
 12 |     toc_float: true     
 13 |     fig_width: 8
 14 |     fig_height: 6
 15 | ---
 16 | 
 17 | 
 18 | 
 19 | <!--- rmarkdown::render("~/courses/cruk/LinearModelAndExtensions/git_linear-models-r/glm.Rmd") --->
 20 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/glm.Rmd") --->
 21 | 
 22 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 23 | # change working directory: should be the directory containg the Markdown files:
 24 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20210514/Practicals/")
 25 | 
 26 | # install gamlss package if needed
 27 |     # install.packages("gamlss")
 28 | ```
 29 | 
 30 | # Section 1: Logistic regression
 31 | 
 32 | We will analyse the data collected by Jones (Unpublished BSc dissertation, University of Southampton, 1975). The aim of the study was to define if the probability of having Bronchitis is influenced by smoking and/or pollution.
 33 | 
 34 | The data are stored under data/Bronchitis.csv and contains information on 212 participants.
 35 | 
 36 | 
 37 | ### Section 1.1: importation and descriptive analysis
 38 | 
 39 | Lets starts by
 40 | 
 41 | * importing the data set *Bronchitis* with the function `read.csv()`  
 42 | * displaying _bron_ (a dichotomous variable which equals 1 for participants having bronchitis and 0 otherwise) as a function of _cigs_, the number of cigarettes smoked daily.
 43 | 
 44 | 
 45 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 46 | Bronchitis = read.csv("data/Bronchitis.csv",header=TRUE)
 47 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4",
 48 |         ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes")
 49 | abline(h=c(0,1),col="light blue")
 50 | ```
 51 | 
 52 | # Section 1.2: Model fit
 53 | 
 54 | Lets 
 55 | 
 56 | * fit a logistic model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`.  
 57 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`.
 58 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 59 | fit.glm = glm(bron~cigs,data=Bronchitis,family=binomial)
 60 | 
 61 | library(gamlss)
 62 | fit.gamlss = gamlss(bron~cigs,data=Bronchitis,family=BI)
 63 | 
 64 | summary(fit.glm)
 65 | ```
 66 | 
 67 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot:
 68 | 
 69 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 70 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4",
 71 |         ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes")
 72 | abline(h=c(0,1),col="light blue")
 73 | 
 74 | axe.x = seq(0,40,length=1000)
 75 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])/(1+exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2]))
 76 | lines(axe.x,f.x,col="pink2",lwd=2)
 77 | ```
 78 | 
 79 | ## Section 1.3: Model selection
 80 | 
 81 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 
 82 | 
 83 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
 84 | anova(fit.glm,test="LRT")
 85 | ```
 86 | 
 87 | ## Section 1.3: Model check
 88 | 
 89 | Lets assess is the model fit seems satisfactory by means 
 90 | 
 91 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`,
 92 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`,
 93 | 
 94 | 
 95 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
 96 | # deviance
 97 | par(mfrow=c(2,2),mar=c(3,5,3,0))
 98 | plot(fit.glm)
 99 | # randomised normalised quantile residuals
100 | plot(gamlss(bron~cigs,data=Bronchitis,family=BI))
101 | ```
102 | 
103 | ## Section 1.4: Fun
104 | 
105 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
106 | # long format:
107 | long = data.frame(mi = rep(c("MI","No MI"),c(104+189,11037+11034)),
108 |                   treatment = rep(c("Aspirin","Placebo","Aspirin","Placebo"),c(104,189,11037,11034)))
109 | # short format: 2 by 2 table
110 | table2by2 = table(long$treatment,long$mi)
111 | 
112 | #
113 | chisq.test(table2by2)
114 | prop.test(table2by2[,"MI"],apply(table2by2,1,sum))
115 | summary(glm(I(mi=="MI")~treatment,data=long,family="binomial"))
116 | ```
117 | 
118 | 
119 | # Section 2: Poisson regression
120 | 
121 | The dataset *students.csv* shows the number of high school students diagnosed with an infectious disease for each day from the initial disease outbreak. 
122 | 
123 | # Section 2.2: Importation
124 | 
125 | Lets 
126 | 
127 | * import the dataset by means of the function `read.csv()`
128 | * display the daily number of students diagnosed with the disease (variable `cases`) as a function of the days since the outbreak (variable `day`). 
129 | 
130 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
131 | students = read.csv("data/students.csv",header=TRUE)
132 | plot(students$day,students$cases,col="blue4",
133 |         ylab = "Number of diagnosed students", xlab = "Days since initial outbreak")
134 | abline(h=c(0),col="light blue")
135 | ```
136 | 
137 | # Section 2.2: Model fit
138 | 
139 | Lets 
140 | 
141 | * fit a poisson model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`.  
142 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`.
143 | 
144 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
145 | fit.glm = glm(cases~day,data=students,family=poisson)
146 | 
147 | library(gamlss)
148 | fit.gamlss = gamlss(cases~day,data=students,,family=PO)
149 | 
150 | summary(fit.glm)
151 | ```
152 | 
153 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot:
154 | 
155 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
156 | plot(students$day,students$cases,col="blue4",
157 |         ylab = "Number of diagnosed students", xlab = "Days since initial outbreak")
158 | abline(h=c(0),col="light blue")
159 | 
160 | axe.x = seq(0,120,length=1000)
161 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])
162 | lines(axe.x,f.x,col="pink2",lwd=2)
163 | ```
164 | 
165 | ## Section 2.3: Model selection
166 | 
167 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 
168 | 
169 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 
170 | anova(fit.glm,test="LRT")
171 | ```
172 | 
173 | ## Section 2.3: Model check
174 | 
175 | Lets assess is the model fit seems satisfactory by means 
176 | 
177 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`,
178 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`,
179 | 
180 | 
181 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
182 | # deviance
183 | par(mfrow=c(2,2),mar=c(3,5,3,0))
184 | plot(fit.glm)
185 | # randomised normalised quantile residuals
186 | plot(fit.gamlss)
187 | ```
188 | 
189 | 
190 | 
191 | # Section 6: Practicals
192 | 
193 | 
194 | ### (i) *Bronchitis.csv*
195 | Analyse further the Bronchitis data of Jones (1975) by 
196 | 
197 | * first investigating if the probability of having bronchitis also depends on _pollution_ (variable `poll`),
198 | * second investigating if there is an interaction between the variables `cigs` and `poll`.
199 | 
200 | 
201 | ### (ii) *myocardialinfarction.csv*
202 | 
203 | The file *myocardialinfarction.csv* indicates if a participant had a myocardial infarction attack (variable `infarction`) as well the participant's treatment (variable `treatment`).  
204 | 
205 | Does _Aspirin_ decrease the probability to have a myocardial infarction attack ?
206 | 
207 | 
208 | ### (ii) *crabs.csv*
209 | 
210 | This data set is derived from Agresti (2007, Table 3.2, pp.76-77). It gives 6 variables for each of 173 female horseshoe crabs:
211 | 
212 | * Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weightweight (Wt)
213 | * C: the crab's colour,
214 | * S: the crab's spine condition, 
215 | * Wt: the crab's weight,
216 | * W: the crab's carapace width, 
217 | * Sa: the response outcome, i.e., the number of satellites. 
218 | 
219 | Check if the width of female's back can explain the number of satellites attached by fitting a Poisson regression model with width.
220 | 
221 |       
222 |         


--------------------------------------------------------------------------------
/glm.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/glm.pdf


--------------------------------------------------------------------------------
/images/examplePlots.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/images/examplePlots.png


--------------------------------------------------------------------------------
/images/plot-char.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/images/plot-char.png


--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
 1 | ## Introduction to Linear Modelling with R
 2 |  
 3 | ## Description
 4 | 
 5 | The course will cover ANOVA, linear regression and some extensions. It will be a mixture of lectures and hands-on time using RStudio to analyse data.
 6 | 
 7 |  [Timetable](timetable.pdf)
 8 | 
 9 | # Aims: During this course you will learn about: 
10 | 
11 | - ANOVA
12 | - Simple and multiple regression
13 | - Generalised Linear Models 
14 | - Introduction to more advanced topics, like non-linear models and time series.
15 | 
16 | # Objectives: After this course you should be able to
17 | 
18 | - Realise the connection between t-tests, ANOVA and linear regression 
19 | - Fit a linear regression
20 | - Check if the assumptions of linear regression are met by the data and what to do if they are not
21 | - Know when linear regression is not appropriate and have an idea of which alternative method might be appropriate
22 | - Know when you need to seek help with analysis as the data structure is too complex for the methods taught
23 | 
24 | # Course Data
25 | 
26 | - Please Download [this zip file](Course_Data.zip) to have all the datasets and R files used in this course
27 | 
28 | # Feedback
29 | - After the course, please fill in this [feedback form](https://www.surveymonkey.co.uk/r/LINMODMARCH). Thank you.
30 | 
31 | # Other courses
32 | - The CRUK-CI Bioinformatics Core facility run a catalogue of courses. [Please visit for more details](https://www.cruk.cam.ac.uk/core-facilities/bioinformatics-core/programme).
33 | 
34 | # Materials
35 | 
36 | - Course Introduction
37 |   + Tutorial [HTML](r-recap.nb.html)
38 |   + Tutorial [R markdown](r-recap.Rmd)
39 |   + Cheat Sheet [PDF](cheat_sheet.pdf)
40 | - ANOVA
41 |   + Slides [PDF](anova.pdf)
42 |   + Tutorial [HTML](anova.html)
43 |   + Tutorial [R markdown](anova.Rmd)
44 | - Simple Regression
45 |   + Slides [PDF](simple_regression.pdf)
46 |   + Tutorial [HTML](simple_regression.html)
47 |   + Tutorial [R markdown](simple_regression.Rmd)
48 | - Multiple Regression
49 |   + Slides [PDF](multiple_regression.pdf)
50 |   + Tutorial [HTML](multiple_regression.html)
51 |   + Tutorial [R markdown](multiple_regression.Rmd)
52 | - Generalised Linear Models
53 |   + Slides [PDF](glm.pdf)
54 |   + Tutorial [HTML](glm.html)
55 |   + Tutorial [R markdown](glm.Rmd)
56 | - Time Series Models
57 |   + Slides [PDF](time_series.pdf)
58 |   + Tutorial [HTML](time_series_analysis.html)
59 |   + Tutorial [R markdown](time_series_analysis.Rmd)
60 |   
61 |   
62 | 
63 | # Pre-requisites
64 | 
65 |  **This course assumes basic knowledge of statistics and use of R** , which would be obtained from our Introductory Statistics Course and an "Introduction to R for Solving Biological Problems" run at the Genetics department (or equivalent).
66 |  
67 |  - [Introduction to Solving Biological Problems with R](http://cambiotraining.github.io/r-intro/)
68 |  - [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
69 | 
70 | ## Going further
71 | - [Transforming data](http://rcompanion.org/handbook/I_12.html)
72 | 
73 | 


--------------------------------------------------------------------------------
/install.R:
--------------------------------------------------------------------------------
1 | options(repos = c("CRAN" = "http://cran.ma.imperial.ac.uk"))
2 | 
3 | install.packages(c("MASS", "gamlss"))
4 | 


--------------------------------------------------------------------------------
/logos/CRUK_CI_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/logos/CRUK_CI_logo.png


--------------------------------------------------------------------------------
/logos/LMB_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/logos/LMB_logo.png


--------------------------------------------------------------------------------
/logos/LMB_logo_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/logos/LMB_logo_small.png


--------------------------------------------------------------------------------
/logos/Logos.txt:
--------------------------------------------------------------------------------
1 | Folder for organisational logos
2 | 


--------------------------------------------------------------------------------
/multiple_regression+.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | ---
  5 | title: "Multiple Regression with R" 
  6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  8 | output:
  9 |   html_document:
 10 |     theme: united 
 11 |     highlight: tango
 12 |     code_folding: show    
 13 |     toc: true           
 14 |     toc_depth: 2       
 15 |     toc_float: true     
 16 |     fig_width: 8
 17 |     fig_height: 6
 18 | ---
 19 | 
 20 | 
 21 | 
 22 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/multiple_regression+.Rmd") --->
 23 | 
 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 25 | # change working directory: should be the directory containg the Markdown files:
 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/")
 27 | 
 28 | ```
 29 | 
 30 | # Section 1: Multiple Regression
 31 | 
 32 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Now we will expand this by considering `Height` as another predictor.
 33 | 
 34 | Start by plotting the dataset:
 35 | ```{r}
 36 | plot(trees)
 37 | ```
 38 | 
 39 | This plots all variables against each other, enabling visual information about correlations within the dataset.
 40 | 
 41 | Re-create the original model of `Volume` against `Girth`:
 42 | ```{r}
 43 | m1 = lm(Volume~Girth,data=trees)
 44 | summary(m1)
 45 | ```
 46 | 
 47 | Now include `Height` as an additional variable:
 48 | ```{r}
 49 | m2 = lm(Volume~Girth+Height,data=trees)
 50 | summary(m2)
 51 | ```
 52 | 
 53 | Note that the R^2 has improved, yet the `Height` term is less significant than the other two parameters.
 54 | 
 55 | Try including the interaction term between `Girth` and `Height`:
 56 | ```{r}
 57 | m3 = lm(Volume~Girth*Height,data=trees)
 58 | summary(m3)
 59 | ```
 60 | 
 61 | All terms are highly significant. Note that the `Height` is more significant than in the previous model, despite the introduction of an additional parameter.
 62 | 
 63 | We'll now try a different functional form - rather than looking for an additive model, we can explore a multiplicative model by applying a log-log transformation (leaving out the interaction term for now).
 64 | ```{r}
 65 | m4 = lm(log(Volume)~log(Girth)+log(Height),data=trees)
 66 | summary(m4)
 67 | ```
 68 | 
 69 | All terms are significant. Note that the residual standard error is much lower than for the previous models. However, this value cannot be compared with the previous models due to transforming the response variable. The R^2 value has increased further, despite reducing the number of parameters from four to three.
 70 | ```{r}
 71 | confint(m4)
 72 | ```
 73 | 
 74 | Looking at the confidence intervals for the parameters reveals that the estimated power of `Girth` is around 2, and `Height` around 1. This makes a lot of sense, given the well-known dimensional relationship between `Volume`, `Girth` and `Height`!
 75 | 
 76 | For completeness, we'll now add the interaction term.
 77 | ```{r}
 78 | m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees)
 79 | summary(m5)
 80 | ```
 81 | 
 82 | The R^2 value has increased (of course, as all we've done is add an additional parameter), but interestingly none of the four terms are significant. This means that none of the individual terms alone are vital for the model - there is duplication of information between the variables. So we will revert back to the previous model.
 83 | 
 84 | Given that it would be reasonable to expect the power of `Girth` to be 2, and Height to be 1, we will now fix those parameters, and instead just estimate the one remaining parameter.
 85 | ```{r}
 86 | m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees)
 87 | summary(m6)
 88 | ```
 89 | 
 90 | Note that there is no R^2 (as only the intercept was included in the model), and that the Residual Standard Error is incomparable with previous models due to changing the response variable.
 91 | 
 92 | We can alternatively construct a model with the response being y, and the error term additive rather than multiplicative.
 93 | ```{r}
 94 | m7 = lm(Volume~0+I(Girth^2):Height,data=trees)
 95 | summary(m7)
 96 | ```
 97 | 
 98 | Note that the parameter estimates for the last two models are slightly different... this is due to differences in the error model.
 99 | 
100 | # Section 2: Model Selection
101 | 
102 | Of the last two models, the one with the log-Normal error model would seem to have the more Normal residuals. This can be inspected by looking at diagnostic plots, by and using the `shapiro.test()`:
103 | ```{r}
104 | plot(m6)
105 | plot(m7)
106 | shapiro.test(residuals(m6))
107 | shapiro.test(residuals(m7))
108 | ```
109 | 
110 | The Akaike Information Criterion (AIC) can help to make decisions regarding which model is the most appropriate. Now calculate the AIC for each of the above models:
111 | ```{r}
112 | summary(m1)
113 | AIC(m1)
114 | summary(m2)
115 | AIC(m2)
116 | summary(m3)
117 | AIC(m3)
118 | summary(m4)
119 | AIC(m4)
120 | summary(m5)
121 | AIC(m5)
122 | summary(m6)
123 | AIC(m6)
124 | summary(m7)
125 | AIC(m7)
126 | ```
127 | 
128 | Whilst the AIC can help differentiate between similar models, it cannot help deciding between models that have different responses. Which model would you select as the most appropriate?
129 | 
130 | # Section 3: Stepwise Regression
131 | 
132 | The in-built dataset `swiss` contains data pertaining to fertility, along with a variety of socioeconomic indicators. We want to select a sensible model using stepwise regression. First regress `Fertility` agains all available indicators:
133 | ```{r}
134 | m8 = lm(Fertility~.,data=swiss)
135 | summary(m8)
136 | ```
137 | 
138 | Are all terms significant?
139 | 
140 | Now use stepwise regression, performing backward elimination in order to automatically remove inappropriate terms:
141 | ```{r}
142 | library(MASS)
143 | summary(stepAIC(m8))
144 | ```
145 | 
146 | Are all terms significant? Is this model suitable? What are the pro's and con's of this approach? 
147 | 
148 | # Section 4: Non-Linear Models
149 | 
150 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. 
151 | 
152 | In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Using Multiple Regression we trialled various models, including some that had multiple predictor variables and/or involved log-log transformations to explore power relationships.
153 | 
154 | However, due to limitations of the method, we were not able to explore other options such as a parameterised power relationship with an additive error model. We will now attempt to fit this model:
155 | 
156 | $y = \beta_0x_1^{\beta_1}x_2^{\beta_2}+\varepsilon$
157 | 
158 | Parameters for non-linear models may be estimated using the `nls` package in R.
159 | 
160 | ```{r}
161 | volume = trees$Volume
162 | height = trees$Height
163 | girth = trees$Girth
164 | m9 = nls(volume~beta0*girth^beta1*height^beta2,start=list(beta0=1,beta1=2,beta2=1))
165 | summary(m9)
166 | ```
167 | Note that the parameters `beta0`, `beta1` and `beta2` weren't defined prior to the function call - `nls` knew what to do with them. Also note that we had to provide starting points for the parameters. What happens if you change them?
168 | 
169 | Are all terms significant? Is this model appropriate? What else could be tried to achieve a better model?
170 | 
171 | # Section 5: Practical Exercises
172 | 
173 | ## Puromycin
174 | 
175 | The in-built R dataset `Puromycin` contains data regarding the reaction velocity versus
176 | substrate concentration in an enzymatic reaction involving untreated cells or cells
177 | treated with Puromycin.
178 | 
179 | - Plot `conc` (concentration) against `rate`. What is the nature of the relationship
180 | between `conc` and `rate`?
181 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
182 | plot(conc~rate,data=Puromycin)
183 | # There is a non-linear positive relationship between conc and rate
184 | ```
185 | 
186 | - Find a transformation that linearises the data and stabilises the variance,
187 | making it possible to use linear regression. Create the corresponding linear
188 | regression model. Are all terms significant?
189 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
190 | plot(log(conc)~rate,data=Puromycin)
191 | m10 = lm(log(conc)~rate,data=Puromycin)
192 | plot(m10)
193 | summary(m10)
194 | # Both terms are significant
195 | ```
196 | 
197 | - Add the `state` term to the model. What type of variable is this? Is the
198 | inclusion of this term appropriate?
199 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
200 | m11 = lm(log(conc)~rate+state,data=Puromycin)
201 | plot(m11)
202 | summary(m11)
203 | # `state` is a boolean factor or indicator variable
204 | # The inclusion of `state` is appropriate, as the term is significant and the diagnostic plots look reasonable 
205 | ```
206 | 
207 | - Now add a term representing the interaction between `rate` and `state`. Are all
208 | terms significant? What can you conclude?
209 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
210 | m12 = lm(log(conc)~rate*state,data=Puromycin)
211 | summary(m12)
212 | # The `state` term is not significant when the interaction between `rate` and `state` is included in the model. So it may be better to remove the `state` term from the model.
213 | ```
214 | 
215 | - Given this information, create the regression model you believe to be the most
216 | appropriate for modelling `conc`. Regenerate the plot of `conc` against `rate`.
217 | Draw curves corresponding to the fitted values of the final model onto this
218 | plot (note that two separate curves should be drawn, corresponding to the
219 | two levels of `state`).
220 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
221 | m13 = lm(log(conc)~rate+rate:state,data=Puromycin)
222 | summary(m13)
223 | 
224 | # Solution one:
225 | plot(conc~rate,data=Puromycin)
226 | idx = order(Puromycin$rate)
227 | treated = Puromycin$state[idx] == "treated"
228 | untreated = Puromycin$state[idx] == "untreated"
229 | lines(exp(fitted(m13))[idx][treated]~Puromycin$rate[idx][treated])
230 | lines(exp(fitted(m13))[idx][untreated]~Puromycin$rate[idx][untreated],col="red")
231 | 
232 | # Solution two (better - more general):
233 | plot(conc~rate,data=Puromycin)
234 | xvals = range(Puromycin$rate)[1]:range(Puromycin$rate)[2]
235 | lines(exp(coef(m13)[1] + coef(m13)[2]*xvals) ~ xvals)
236 | lines(exp(coef(m13)[1] + coef(m13)[2]*xvals + coef(m13)[3]*xvals) ~ xvals, col="red")
237 | ```
238 | 
239 | ## Attitude
240 | 
241 | The in-built R dataset `attitude` contains data from a survey of clerical employees.
242 | 
243 | - Create a linear model regressing `rating` on `complaints`, and store the model
244 | in a variable.
245 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
246 | m14 = lm(rating~complaints,data=attitude)
247 | ```
248 | 
249 | - Use the step function to perform forward selection stepwise regression, in order to automatically add appropriate terms, using a command similar to:
250 | `new_model = step(original_model,.~.+privileges+learning+raises+critical+advance)`
251 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
252 | m15 = step(m14,.~.+privileges+learning+raises+critical+advance)
253 | ```
254 | 
255 | - Which term(s) were added? What is Akaike's Information Criterion (AIC) corresponding to the final model? Are all terms in the resulting model significant? Check diagnostic plots. Do you think this is a suitable model?
256 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
257 | # just the `learning` term was added
258 | # AIC = 118.00 for the final model
259 | summary(m15)
260 | # Only the `complaints` term is significant in the final model - the intercept and coefficient of `learning` are not significant.
261 | plot(m15)
262 | # Despite the residuals not being perfectly Normally distributed, the model does seem reasonable.
263 | ```
264 | 


--------------------------------------------------------------------------------
/multiple_regression.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | ---
  5 | title: "Multiple Regression with R" 
  6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  8 | output:
  9 |   html_document:
 10 |     theme: united 
 11 |     highlight: tango
 12 |     code_folding: show    
 13 |     toc: true           
 14 |     toc_depth: 2       
 15 |     toc_float: true     
 16 |     fig_width: 8
 17 |     fig_height: 6
 18 | ---
 19 | 
 20 | 
 21 | 
 22 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/multiple_regression.Rmd") --->
 23 | 
 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 25 | # change working directory: should be the directory containg the Markdown files:
 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/")
 27 | 
 28 | ```
 29 | 
 30 | # Section 1: Multiple Regression
 31 | 
 32 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Now we will expand this by considering `Height` as another predictor.
 33 | 
 34 | Start by plotting the dataset:
 35 | ```{r}
 36 | plot(trees)
 37 | ```
 38 | 
 39 | This plots all variables against each other, enabling visual information about correlations within the dataset.
 40 | 
 41 | Re-create the original model of `Volume` against `Girth`:
 42 | ```{r}
 43 | m1 = lm(Volume~Girth,data=trees)
 44 | summary(m1)
 45 | ```
 46 | 
 47 | Now include `Height` as an additional variable:
 48 | ```{r}
 49 | m2 = lm(Volume~Girth+Height,data=trees)
 50 | summary(m2)
 51 | ```
 52 | 
 53 | Note that the R^2 has improved, yet the `Height` term is less significant than the other two parameters.
 54 | 
 55 | Try including the interaction term between `Girth` and `Height`:
 56 | ```{r}
 57 | m3 = lm(Volume~Girth*Height,data=trees)
 58 | summary(m3)
 59 | ```
 60 | 
 61 | All terms are highly significant. Note that the `Height` is more significant than in the previous model, despite the introduction of an additional parameter.
 62 | 
 63 | We'll now try a different functional form - rather than looking for an additive model, we can explore a multiplicative model by applying a log-log transformation (leaving out the interaction term for now).
 64 | ```{r}
 65 | m4 = lm(log(Volume)~log(Girth)+log(Height),data=trees)
 66 | summary(m4)
 67 | ```
 68 | 
 69 | All terms are significant. Note that the residual standard error is much lower than for the previous models. However, this value cannot be compared with the previous models due to transforming the response variable. The R^2 value has increased further, despite reducing the number of parameters from four to three.
 70 | ```{r}
 71 | confint(m4)
 72 | ```
 73 | 
 74 | Looking at the confidence intervals for the parameters reveals that the estimated power of `Girth` is around 2, and `Height` around 1. This makes a lot of sense, given the well-known dimensional relationship between `Volume`, `Girth` and `Height`!
 75 | 
 76 | For completeness, we'll now add the interaction term.
 77 | ```{r}
 78 | m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees)
 79 | summary(m5)
 80 | ```
 81 | 
 82 | The R^2 value has increased (of course, as all we've done is add an additional parameter), but interestingly none of the four terms are significant. This means that none of the individual terms alone are vital for the model - there is duplication of information between the variables. So we will revert back to the previous model.
 83 | 
 84 | Given that it would be reasonable to expect the power of `Girth` to be 2, and Height to be 1, we will now fix those parameters, and instead just estimate the one remaining parameter.
 85 | ```{r}
 86 | m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees)
 87 | summary(m6)
 88 | ```
 89 | 
 90 | Note that there is no R^2 (as only the intercept was included in the model), and that the Residual Standard Error is incomparable with previous models due to changing the response variable.
 91 | 
 92 | We can alternatively construct a model with the response being y, and the error term additive rather than multiplicative.
 93 | ```{r}
 94 | m7 = lm(Volume~0+I(Girth^2):Height,data=trees)
 95 | summary(m7)
 96 | ```
 97 | 
 98 | Note that the parameter estimates for the last two models are slightly different... this is due to differences in the error model.
 99 | 
100 | # Section 2: Model Selection
101 | 
102 | Of the last two models, the one with the log-Normal error model would seem to have the more Normal residuals. This can be inspected by looking at diagnostic plots, by and using the `shapiro.test()`:
103 | ```{r}
104 | plot(m6)
105 | plot(m7)
106 | shapiro.test(residuals(m6))
107 | shapiro.test(residuals(m7))
108 | ```
109 | 
110 | The Akaike Information Criterion (AIC) can help to make decisions regarding which model is the most appropriate. Now calculate the AIC for each of the above models:
111 | ```{r}
112 | summary(m1)
113 | AIC(m1)
114 | summary(m2)
115 | AIC(m2)
116 | summary(m3)
117 | AIC(m3)
118 | summary(m4)
119 | AIC(m4)
120 | summary(m5)
121 | AIC(m5)
122 | summary(m6)
123 | AIC(m6)
124 | summary(m7)
125 | AIC(m7)
126 | ```
127 | 
128 | Whilst the AIC can help differentiate between similar models, it cannot help deciding between models that have different responses. Which model would you select as the most appropriate?
129 | 
130 | # Section 3: Stepwise Regression
131 | 
132 | The in-built dataset `swiss` contains data pertaining to fertility, along with a variety of socioeconomic indicators. We want to select a sensible model using stepwise regression. First regress `Fertility` agains all available indicators:
133 | ```{r}
134 | m8 = lm(Fertility~.,data=swiss)
135 | summary(m8)
136 | ```
137 | 
138 | Are all terms significant?
139 | 
140 | Now use stepwise regression, performing backward elimination in order to automatically remove inappropriate terms:
141 | ```{r}
142 | library(MASS)
143 | summary(stepAIC(m8))
144 | ```
145 | 
146 | Are all terms significant? Is this model suitable? What are the pro's and con's of this approach? 
147 | 
148 | # Section 4: Non-Linear Models
149 | 
150 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. 
151 | 
152 | In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Using Multiple Regression we trialled various models, including some that had multiple predictor variables and/or involved log-log transformations to explore power relationships.
153 | 
154 | However, due to limitations of the method, we were not able to explore other options such as a parameterised power relationship with an additive error model. We will now attempt to fit this model:
155 | 
156 | $y = \beta_0x_1^{\beta_1}x_2^{\beta_2}+\varepsilon$
157 | 
158 | Parameters for non-linear models may be estimated using the `nls` package in R.
159 | 
160 | ```{r}
161 | volume = trees$Volume
162 | height = trees$Height
163 | girth = trees$Girth
164 | m9 = nls(volume~beta0*girth^beta1*height^beta2,start=list(beta0=1,beta1=2,beta2=1))
165 | summary(m9)
166 | ```
167 | Note that the parameters `beta0`, `beta1` and `beta2` weren't defined prior to the function call - `nls` knew what to do with them. Also note that we had to provide starting points for the parameters. What happens if you change them?
168 | 
169 | Are all terms significant? Is this model appropriate? What else could be tried to achieve a better model?
170 | 
171 | # Section 5: Practical Exercises
172 | 
173 | ## Puromycin
174 | 
175 | The in-built R dataset `Puromycin` contains data regarding the reaction velocity versus
176 | substrate concentration in an enzymatic reaction involving untreated cells or cells
177 | treated with Puromycin.
178 | 
179 | - Plot `conc` (concentration) against `rate`. What is the nature of the relationship
180 | between `conc` and `rate`?
181 | - Find a transformation that linearises the data and stabilises the variance,
182 | making it possible to use linear regression. Create the corresponding linear
183 | regression model. Are all terms significant?
184 | - Add the `state` term to the model. What type of variable is this? Is the
185 | inclusion of this term appropriate?
186 | - Now add a term representing the interaction between `rate` and `state`. Are all
187 | terms significant? What can you conclude?
188 | - Given this information, create the regression model you believe to be the most
189 | appropriate for modelling `conc`. Regenerate the plot of `conc` against `rate`.
190 | Draw curves corresponding to the fitted values of the final model onto this
191 | plot (note that two separate curves should be drawn, corresponding to the
192 | two levels of `state`).
193 | 
194 | ## Attitude
195 | 
196 | The in-built R dataset `attitude` contains data from a survey of clerical employees.
197 | 
198 | - Create a linear model regressing `rating` on `complaints`, and store the model
199 | in a variable.
200 | - Use the step function to perform forward selection stepwise regression, in order to automatically add appropriate terms, using a command similar to:
201 | `new_model = step(original_model,.~.+privileges+learning+raises+critical+advance)`
202 | - Which term(s) were added? What is Akaike's Information Criterion (AIC) corresponding to the final model? Are all terms in the resulting model significant? Check diagnostic plots. Do you think this is a suitable model?
203 | 
204 | 


--------------------------------------------------------------------------------
/multiple_regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/multiple_regression.pdf


--------------------------------------------------------------------------------
/r-recap.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | ---
  3 | title: "Recap of Statistical Analysis in R"
  4 | author: Chandra Chilamakuri, Dominique-Laurent Couturier, Rob Nicholls 
  5 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  6 | output:
  7 |   html_notebook:
  8 |     toc: yes
  9 |     toc_float: yes
 10 | ---
 11 | 
 12 | <!--- rmarkdown::render("~/courses/cruk/LinearModelAndExtensions/20200310/Practicals/r-recap.Rmd") --->
 13 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/r-recap.Rmd") --->
 14 | 
 15 | # Introduction
 16 | 
 17 | The purpose of this section is to review some of the key concepts in basic R usage, and statistical testing
 18 | 
 19 | - Reading data into R
 20 | - The data-frame representation of data in R
 21 | - Selecting rows and columns from a data frame
 22 | - Computing numerical summaries
 23 | - Basic plotting
 24 | - Getting help on functions in RStudio
 25 | 
 26 | 
 27 | ## About this tutorial
 28 | 
 29 | - The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
 30 | - However, for this course we will use a relatively new feature called R-notebooks.
 31 | - An R-notebook mixes plain text with R code
 32 |     + The R code can be run from inside the document and the results are displayed directly underneath
 33 | - Each chunk of R code looks something like this.
 34 | 
 35 | ```{r}
 36 | 
 37 | ```
 38 | 
 39 | - Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
 40 | - Or you can press the green triangle on the right-hand side to run everything in the chunk
 41 |     + Try this now!
 42 | 
 43 | ```{r}
 44 | print("Hello World")
 45 | ```
 46 | 
 47 | - You can add R chunks by pressing CRTL + ALT + I
 48 |     + or using the Insert menu option
 49 |     + (can also include code from other languages such as Python or bash)
 50 | 
 51 | The document may also contain other formatting options that are used to render the HTML (or PDF, Word) output.
 52 | 
 53 | Here is some *italic* text, but we can also write in **bold**, or write things
 54 | 
 55 | - in 
 56 | - a 
 57 | - list
 58 |     + which include sub-lists
 59 | 
 60 | 
 61 | 
 62 | # Example Analysis
 63 | 
 64 | We will use a dataset from The University of Sheffield Mathematics and Statistics Help group ((MASH)(https://www.sheffield.ac.uk/mash/statistics2/anova)).
 65 | 
 66 | > The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet.
 67 | 
 68 | 
 69 | ## Reading and inspecting the data
 70 | 
 71 | 
 72 | Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by executing the following R command:-
 73 | 
 74 | ```{r}
 75 | getwd()
 76 | ```
 77 | 
 78 | *N.B.*Here, a set of open and closed brackets () is used to run the `getwd` function with no arguments.   
 79 | *Note if you are following this material on a Windows machine as opposed to a Linux or MacOS machine 
 80 | you will get a path like C:\Users\Fred. If you want to use the complementing R command 'setwd()' to set
 81 | the working directory you MUST escape the \ i.e. setwd("C:\\Users\\Fred").  
 82 | We can also list the files in a specific directory with:-
 83 | 
 84 | ```{r}
 85 | list.files("data/")
 86 | ```
 87 | 
 88 | A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory.
 89 | 
 90 | ```{r}
 91 | file.exists("data/diet.csv")
 92 | ```
 93 | 
 94 | 
 95 | 
 96 | - Assuming the file can be found, we can use the `read.csv` function to import the data. Other functions can be used to read tab-delimited files (read.delim) or a generic read.table function. A data frame object is created.
 97 | - The file name `diet.csv` is the only *argument* to the function `read.csv`
 98 |     + arguments are listed inside the brackets
 99 |     + for functions requiring more than one argument (input), arguments are separated by commas
100 |     + a function may have default values for some arguments; meaning they do not need to be specified
101 | - The characters `<-` are used to tell R to create a variable
102 |     + without this, the data are not loaded into memory and you won't be able to work with them
103 | - If you get an error saying `Error in file(file, “rt”) : cannot open the connection...`, you might need to change your working directory or make sure the file name is typed correctly (R is case-sensitive)
104 | - Typing the name of an object will cause R to print the contents to the screen
105 | 
106 | ```{r}
107 | diet <- read.csv("data/diet.csv")
108 | diet
109 | ```
110 | 
111 | ### A note on importing your own data
112 | 
113 | If you are trying to read your own data, and encounter an error at this stage, you may need to consider if your data are in the correct form for analysis. Like most programming languages, R will struggle if your spreadsheet has been heavily formatted to include colours, formulas and special formatting.
114 | 
115 | These references will guide you through some of the pitfalls and common mistakes to avoid when formatting data
116 | 
117 | - [Formatting data tables in Spreadsheets](http://www.datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html)
118 | - [Data Organisation tutorial by Karl Broman](http://kbroman.org/dataorg/)
119 | - [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide/blob/master/README.md)
120 | 
121 | 
122 | 
123 | 
124 | `diet` is an example of a data frame. The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).
125 | 
126 | - the `summary` function will provide a overview of the contents of each column in the table
127 |     + the type of summary provided depends on the data type in each column
128 | 
129 | ```{r}
130 | summary(diet)
131 | ```
132 | 
133 | - particular columns can be accessed using the `$` operator
134 |     + ***TIP*** RStudio will allow auto-complete using the *Tab* key
135 |     
136 | ```{r}
137 | diet$gender
138 | diet$age
139 | 
140 | 
141 | ```
142 | 
143 | We can create new columns based on existing ones
144 | 
145 | ```{r}
146 | diet$weight.loss <- diet$final.weight - diet$initial.weight
147 | 
148 | ```
149 | 
150 | Subsetting rows and columns is done using the `[rows, columns]` syntax; where `rows` and `columns` are *vectors* containing the rows and columns you want
151 | 
152 | - you can choose to omit either vector to show all rows and columns. *However, you still need to remember the `,`
153 | 
154 | ```{r}
155 | diet[1:5,]
156 | diet[,2:3]
157 | ```
158 | 
159 | Logical tests can be used to select rows. e.g. using `==`, `<`, `>`
160 | 
161 | ```{r}
162 | diet$diet.type == "A"
163 | 
164 | dietA <- diet[diet$diet.type == "A",]
165 | dietA
166 | ```
167 | 
168 | 
169 | 
170 | ## Visualisation 
171 | 
172 | All your favourite types of plot can be created in R
173 | 
174 | 
175 | - Simple plots are supported in the *base* distribution of R (what you get automatically when you download R). 
176 |     + `boxplot`, `hist`, `barplot`,... all of which are extensions of the basic `plot` function
177 | - Many different customisations are possible
178 |     + colour, overlay points / text, legends, multi-panel figures
179 | - ***You need to think about how best to visualise your data*** 
180 |     + http://www.bioinformatics.babraham.ac.uk/training.html#figuredesign
181 | - R cannot prevent you from creating a plotting disaster: 
182 |     + http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6?op=1&IR=T
183 | 
184 |     
185 | Plots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame.
186 | 
187 | - a histogram is commonly-used to examine the distribution of a particular variable
188 | 
189 | ```{r}
190 | hist(diet$weight.loss)
191 | ```
192 | 
193 | - a boxplot is often used to compare distributions visually
194 |     + if given a data-frame, each column will be shown as a separate box
195 |     + otherwise the formula syntax `~` is used to define x and y variables
196 | 
197 | ```{r}
198 | boxplot(diet$weight.loss~diet$diet.type)
199 | 
200 | ```
201 | 
202 | - scatter plots can be constructed by given two vectors as arguments to `plot`
203 | 
204 | ```{r}
205 | plot(diet$age,diet$initial.weight)
206 | ```
207 | 
208 | 
209 | *Lots* of customisations are possible to enhance the appaerance of our plots. Not for the faint-hearted, the help pages `?plot` and `?par` give the full details. In short,
210 | 
211 | - Axis labels, and titles can be specified as character strings. 
212 | 
213 | - R recognises many preset names as colours. To get a full list use `colours()`, or check this [online reference](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).
214 |     + can also use `*R*ed, *G*reen, *B*lue values; which you might get from a paint program
215 | - Plotting characters can be specified using a pre-defined number
216 | 
217 | ```{r}
218 | boxplot(diet$weight.loss~diet$diet.type, 
219 |         ylab="Weight Loss", 
220 |         xlab="Diet Type",
221 |         col=c("yellow","blue","red"),
222 |         main="Weight Loss According to diet type")
223 | ```
224 | 
225 | You can get help on any of the functions that we will be using in this course by using the '?' or 'help()' commands. The help will appear in the help pane (usually bottom RH corner) .
226 | 
227 | ```{r}
228 | ?lm
229 | ```
230 | 
231 | ```{r}
232 | help(lm)
233 | ```
234 | 


--------------------------------------------------------------------------------
/simple_regression+.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | ---
  5 | title: "Simple Regression with R" 
  6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  8 | output:
  9 |   html_document:
 10 |     theme: united 
 11 |     highlight: tango
 12 |     code_folding: show    
 13 |     toc: true           
 14 |     toc_depth: 2       
 15 |     toc_float: true     
 16 |     fig_width: 8
 17 |     fig_height: 6
 18 | ---
 19 | 
 20 | 
 21 | 
 22 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/simple_regression+.Rmd") --->
 23 | 
 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 25 | # change working directory: should be the directory containg the Markdown files:
 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/")
 27 | 
 28 | ```
 29 | 
 30 | # Section 1: Correlation Coefficients
 31 | 
 32 | We'll start by generating some synthetic data to investigate correlation coefficients.
 33 | 
 34 | Generate 50 random numbers in the range [0,50]:
 35 | ```{r}
 36 | x = runif(50,0,50)
 37 | ```
 38 | 
 39 | Now let's generate some y-values that are linearly correlated with the x-values with gradient=1, applying a random Normal offset (with sd=5):
 40 | ```{r}
 41 | y = x + rnorm(50,0,5)
 42 | ```
 43 | 
 44 | Plotting y against x, you'll observe a positive linear relationship:
 45 | ```{r}
 46 | plot(y~x)
 47 | ```
 48 | 
 49 | This strong linear relationship is reflected in the correlation coefficient and in the coefficient of determination (R^2):
 50 | ```{r}
 51 | pearson_cor_coef = cor(x,y)
 52 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2)
 53 | ```
 54 | 
 55 | If the data exhibit a negative linear correlation then the correlation coefficient will become strong and negative, whilst the R^2 value will remain strong and positive:
 56 | ```{r}
 57 | y = -x + rnorm(50,0,5)
 58 | plot(y~x)
 59 | pearson_cor_coef = cor(x,y)
 60 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2)
 61 | ```
 62 | 
 63 | If data are uncorrelated then both the correlation coefficient and R^2 values will be close to zero:
 64 | ```{r}
 65 | y = rnorm(50,0,5)
 66 | plot(y~x)
 67 | pearson_cor_coef = cor(x,y)
 68 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2)
 69 | ```
 70 | 
 71 | The significance of a correlation can be tested using `cor.test()`, which also provides a 95% confidence interval on the correlation:
 72 | ```{r}
 73 | cor.test(x,y)
 74 | ```
 75 | 
 76 | In this case, the value 0 is contained within the confidence interval, indivating that there is insufficient evidence to reject the null hypothesis that the true correlation is equal to zero.
 77 | 
 78 | # Section 2: Simple Regression
 79 | 
 80 | Now let's look at some real data.
 81 | 
 82 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees.
 83 | 
 84 | We will now attempt to construct a simple linear model that uses `Girth` to predict `Volume`.
 85 | ```{r}
 86 | plot(Volume~Girth,data=trees)
 87 | m1 = lm(Volume~Girth,data=trees)
 88 | abline(m1)
 89 | cor.test(trees$Volume,trees$Girth)
 90 | ```
 91 | 
 92 | It is evident that `Volume` and `Girth` are highly correlated.
 93 | 
 94 | The summary for the linear model provides information regarding the quality of the model:
 95 | ```{r}
 96 | summary(m1)
 97 | ```
 98 | 
 99 | Model residuals can be readily accessed using the `residuals()` function:
100 | ```{r}
101 | hist(residuals(m1),breaks=10,col="light grey")
102 | ```
103 | 
104 | Diagnostic plots for the model can reveal whether or not modelling assumptions are reasonable. In this case, there is visual evidence to suggest that the assumptions are not satisfied - note in particular the trend observed in the plot of residuals vs fitted values:
105 | ```{r}
106 | plot(m1)
107 | ```
108 | 
109 | # Section 3: Assessing the quality of linear models
110 | 
111 | Let's see what happens if we try to describe a non-linear relationship using a linear model. Consider the sine function in the range [0,1.5*pi):
112 | ```{r}
113 | z = seq(0,1.5*pi,0.2)
114 | plot(sin(z)~z)
115 | m2 = lm(sin(z)~z)
116 | abline(m2)
117 | ```
118 | 
119 | In this case, it is clear that a linear model is not appropriate for describing the relationship. However, we are able to fit a linear model, and the linear model summary does not identify any major concerns:
120 | ```{r}
121 | summary(m2)
122 | ```
123 | Here we see that the overall p-value is low enough to suggest that the model has significant utility, and both terms (the intercept and the coefficient of `z`) are significantly different from zero. The R^2 value of 0.5422 is high enough to indicate that there is a reasonably strong correlation between `sin(z)` and `z` in this range. 
124 | 
125 | This information is misleading, as we know that a linear model is inappropriate in this case. Indeed, the linear model summary does not check whether the underlying model assumptions are satisfied. 
126 | 
127 | By observing strong patterns in the diagnostic plots, we can see that the modelling assumptions are not satisified in this case.
128 | ```{r}
129 | plot(m2)
130 | ```
131 | 
132 | 
133 | # Section 4: Modelling Non-Linear Relationships
134 | 
135 | It is sometimes possible to use linear models to describe non-linear relationships (which is perhaps counterintuitive!). This can be achieved by applying transformations to the variable(s) in order to linearise the relationship, whilst ensuring that modelling assumptions are satisfied.
136 | 
137 | Another in-built dataset `cars` provides the speeds and associated stopping distances of cars in the 1920s.
138 | 
139 | Let's construct a linear model to predict stopping distance using speed:
140 | 
141 | ```{r}
142 | plot(dist~speed,data=cars)
143 | m3 = lm(dist~speed,data=cars)
144 | abline(m3)
145 | summary(m3)
146 | ```
147 | 
148 | The model summary indicates that the intercept term does not have significant utility. So that term could/should be removed from the model.
149 | 
150 | In addition, the plot of residuals versus fitted values indicates potential issues with variance stability:
151 | ```{r}
152 | plot(m3)
153 | ```
154 | 
155 | In this case, variance stability can be aided by a square-root transformation of the response variable:
156 | ```{r}
157 | plot(sqrt(dist)~speed,data=cars)
158 | m4 = lm(sqrt(dist)~speed,data=cars)
159 | abline(m4)
160 | plot(m4)
161 | summary(m4)
162 | ```
163 | 
164 | The R^2 value is improved over the previous model.
165 | Note that again that the intercept term is not significant.
166 | 
167 | We'll now try a log-log transformation, that is applying a log transformation to the predictor and response variables. This represents a power relationship between the two variables.
168 | ```{r}
169 | plot(log(dist)~log(speed),data=cars)
170 | m5 = lm(log(dist)~log(speed),data=cars)
171 | abline(m5)
172 | plot(m5)
173 | summary(m5)
174 | ```
175 | 
176 | The R^2 value is improved, and the diagnostic plots don't look too unreasonable. However, again the intercept term does not have significant utility. So we'll now remove it from the model:
177 | ```{r}
178 | m6 = lm(log(dist)~0+log(speed),data=cars)
179 | plot(m6)
180 | summary(m6)
181 | ```
182 | 
183 | This model seems reasonable. However, remember that R^2 values corresponding to models without an intercept aren't meaningful (or at least can't be compared against models with an intercept term).
184 | 
185 | We can now transform the model back, and display the regression curve on the plot:
186 | ```{r}
187 | plot(dist~speed,data=cars)
188 | x = order(cars$speed)
189 | lines(exp(fitted(m6))[x]~cars$speed[x])
190 | ```
191 | 
192 | # Section 5: Relationship between the t-test, ANOVA and linear regression
193 | 
194 | In the ANOVA session we looked at the `diet` dataset, and performed the t-test and ANOVA. Here's a recap:
195 | 
196 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
197 | # import
198 | diet = read.csv("data/diet.csv",row.names=1)
199 | diet$weight.loss = diet$initial.weight - diet$final.weight 
200 | diet$diet.type   = factor(diet$diet.type,levels=c("A","B","C"))
201 | diet$gender      = factor(diet$gender,levels=c("Female","Male"))
202 | # comparison
203 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE)
204 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",]))
205 | ```
206 | 
207 | Note that the p-values for both the t-test and ANOVA are the same. This is because these tests are equivalent (in the 2-sample case). They both test the same hypothesis.
208 | 
209 | Also, the F-test statistic is equal to the square of the t-test statistic (-2.8348^2 = 8.036). Again, this is only true for the 2-sample case.
210 | 
211 | Now let's use a different strategy. Instead of directly testing whether there is a difference between the two groups, let's attempt to create a linear model describing the relationship between `weight.loss` and `diet.type`. Indeed, it is possible to construct a linear model where the independent variable(s) are categorical - they do not have to be continuous or even ordinal!
212 | 
213 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
214 | summary(lm(weight.loss~diet.type,data=diet[diet$diet.type!="B",]))
215 | ```
216 | 
217 | You can see that the p-value corresponding to the `diet.type` term is the same as the overall p-value of the linear model, which is also the same as the p-value from the t-test and ANOVA. Note also that the F-test statistic is the same as given by the ANOVA.
218 | 
219 | So, we are also able to use the linear model to test the hypothesis that there is a difference between the two diet groups, as well as provide a more detailed description of the relationship between `weight.loss` and `diet.type`. 
220 | 
221 | # Section 6: Practical Exercises
222 | 
223 | ## Old Faithful
224 | 
225 | The inbuilt R dataset `faithful` pertains to the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
226 | 
227 | - Create a simple linear regression model that models the eruption duration `faithful$eruptions` using waiting time `faithful$waiting` as the independent variable, storing the model in a variable. Look at the summary of the model.
228 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
229 | m7 = lm(eruptions~waiting,data=faithful)
230 | summary(m7)
231 | ```
232 | + What are the values of the estimates of the intercept and coefficient of 'waiting'?
233 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
234 | # intercept = -1.874016
235 | # coef of waiting = 0.075628
236 | ```
237 | + What is the R^2 value?
238 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
239 | # R^2 = 0.8115
240 | ```
241 | + Does the model have significant utility?
242 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
243 | # Yes, the model does have significant utility
244 | ```
245 | + Are neither, one, or both of the parameters significantly different from zero?
246 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
247 | # Both of the parameters are significantly different from zero
248 | ```
249 | + Can you conclude that there is a linear relationship between the two variables?
250 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
251 | # In the absence of other information, this summary would indicate a linear relationship between the two variables. However, we cannot conclude that without first checking that the modelling assumptions have been satistified...
252 | ```
253 | - Plot the eruption duration against waiting time. Is there anything noticeable about the data?
254 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
255 | plot(eruptions~waiting,data=faithful)
256 | # The observations appear to cluster in two groups.
257 | ```
258 | - Draw the regression line corresponding to your model onto the plot. Based on this graphical representation, does the model seem reasonable?
259 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
260 | plot(eruptions~waiting,data=faithful)
261 | abline(m7)
262 | # At a glance, the model seems to describe the overall dependence of eruptions on waiting time reasonably well. However, this is misleading...
263 | ```
264 | - Generate the four diagnostic plots corresponding to your model. Contemplate the appropriateness of the model for describing the relationship between eruption duration and waiting time.
265 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
266 | plot(m7)
267 | # There is strong systematic behaviour in the plot of residuals versus fitted values. This indicates that the relationship/dependence is different or more complicated than can be described with the simple linear model.
268 | # Specifically, it should be identified what causes observations to fall into one or other of the two groups. Differences between the two groups should be accounted for when modelling the relationship. It seems that the direct dependence of `eruptions` on `waiting` is not as strong as is indicated by the simple linear model.
269 | ```
270 | 
271 | ## Anscombe datasets
272 | 
273 | Consider the inbuilt R dataset `anscombe`. This dataset contains four x-y datasets, contained in the columns: (x1,y1), (x2,y2), (x3,y3) and (x4,y4).
274 | 
275 | - For each of the four datasets, calculate and test the correlation between the x and y variables. What do you conclude?
276 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
277 | cor(anscombe$x1,anscombe$y1)
278 | cor.test(anscombe$x1,anscombe$y1)
279 | cor(anscombe$x2,anscombe$y2)
280 | cor.test(anscombe$x2,anscombe$y2)
281 | cor(anscombe$x3,anscombe$y3)
282 | cor.test(anscombe$x3,anscombe$y3)
283 | cor(anscombe$x4,anscombe$y4)
284 | cor.test(anscombe$x4,anscombe$y4)
285 | # All four datasets seem to exhibit positive linear relationships, with the same correlation and the same p-value.
286 | ```
287 | - For each of the four datasets, create a linear model that regresses y on x. Look at the summaries corresponding to these models. What do you conclude?
288 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
289 | summary(lm(anscombe$y1~anscombe$x1))
290 | summary(lm(anscombe$y2~anscombe$x2))
291 | summary(lm(anscombe$y3~anscombe$x3))
292 | summary(lm(anscombe$y4~anscombe$x4))
293 | # The summaries are essentially identical for all four linear models.
294 | ```
295 | - For each of the four datasets, create a plot of y against x. What do you conclude?
296 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
297 | plot(anscombe$y1~anscombe$x1)
298 | plot(anscombe$y2~anscombe$x2)
299 | plot(anscombe$y3~anscombe$x3)
300 | plot(anscombe$y4~anscombe$x4)
301 | # The four datasets are very different, with very different relationships between the x and y variables.
302 | # This demonstrates how very different datasets can appear to be very similar when looking solely at summary statistics.
303 | # We conclude that it is always important to peform exploratory data analysis, and look at the data before modelling.
304 | ```
305 | 
306 | 
307 | ## Pharmacokinetics of Indomethacin
308 | 
309 | Consider the inbuilt R dataset `Indometh`, which contains data on the pharmacokinetics of indometacin.
310 | 
311 | - Plot `Indometh$time` versus `Indometh$conc` (concentration). What is the nature of the relationship
312 | between `time` and `conc`?
313 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
314 | plot(time~conc,data=Indometh)
315 | # There is a non-linear negative relationship between time and conc
316 | ```
317 | - Apply monotonic transformations to the data so that a simple linear regression model can be used to model the relationship (ensure both linearity and stabilised variance, within reason). Create a plot of the transformed data, to confirm that the relationship seems linear.
318 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
319 | plot(log(time)~log(conc),data=Indometh)
320 | ```
321 | - After creating the linear model, inspect the diagnostic plots to ensure that the
322 | assumptions are not violated (too much). Are there any outliers with large influence? What are the parameter estimates? Are both terms significant?
323 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
324 | m8 = lm(log(time)~log(conc),data=Indometh)
325 | plot(m8)
326 | # The diagnostic plots indicate that the residuals aren't perfectly Normally distributed, but the modelling assumptions aren't violated so much as to inhibit construction of a model.
327 | summary(m8)
328 | # Intercept = -0.4203
329 | # Coefficient of log(conc) = -0.9066
330 | # Both terms are significantly different from zero.
331 | ```
332 | - Add a line to the plot showing the linear relationship between the transformed data.
333 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
334 | plot(log(time)~log(conc),data=Indometh)
335 | abline(m8)
336 | ```
337 | - Now regenerate the original plot of `time` versus `conc` (i.e. the untransformed
338 | data). Using the `lines` function, add a curve to the plot corresponding to the
339 | fitted values of the model.
340 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
341 | plot(time~conc,data=Indometh)
342 | idx <- order(Indometh$conc)
343 | lines(exp(fitted(m8))[idx]~Indometh$conc[idx])
344 | ```
345 | 


--------------------------------------------------------------------------------
/simple_regression.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | ---
  5 | title: "Simple Regression with R" 
  6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
  7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
  8 | output:
  9 |   html_document:
 10 |     theme: united 
 11 |     highlight: tango
 12 |     code_folding: show    
 13 |     toc: true           
 14 |     toc_depth: 2       
 15 |     toc_float: true     
 16 |     fig_width: 8
 17 |     fig_height: 6
 18 | ---
 19 | 
 20 | 
 21 | 
 22 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/simple_regression.Rmd") --->
 23 | 
 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 
 25 | # change working directory: should be the directory containg the Markdown files:
 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/")
 27 | 
 28 | ```
 29 | 
 30 | # Section 1: Correlation Coefficients
 31 | 
 32 | We'll start by generating some synthetic data to investigate correlation coefficients.
 33 | 
 34 | Generate 50 random numbers in the range [0,50]:
 35 | ```{r}
 36 | x = runif(50,0,50)
 37 | ```
 38 | 
 39 | Now let's generate some y-values that are linearly correlated with the x-values with gradient=1, applying a random Normal offset (with sd=5):
 40 | ```{r}
 41 | y = x + rnorm(50,0,5)
 42 | ```
 43 | 
 44 | Plotting y against x, you'll observe a positive linear relationship:
 45 | ```{r}
 46 | plot(y~x)
 47 | ```
 48 | 
 49 | This strong linear relationship is reflected in the correlation coefficient and in the coefficient of determination (R^2):
 50 | ```{r}
 51 | pearson_cor_coef = cor(x,y)
 52 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2)
 53 | ```
 54 | 
 55 | If the data exhibit a negative linear correlation then the correlation coefficient will become strong and negative, whilst the R^2 value will remain strong and positive:
 56 | ```{r}
 57 | y = -x + rnorm(50,0,5)
 58 | plot(y~x)
 59 | pearson_cor_coef = cor(x,y)
 60 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2)
 61 | ```
 62 | 
 63 | If data are uncorrelated then both the correlation coefficient and R^2 values will be close to zero:
 64 | ```{r}
 65 | y = rnorm(50,0,5)
 66 | plot(y~x)
 67 | pearson_cor_coef = cor(x,y)
 68 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2)
 69 | ```
 70 | 
 71 | The significance of a correlation can be tested using `cor.test()`, which also provides a 95% confidence interval on the correlation:
 72 | ```{r}
 73 | cor.test(x,y)
 74 | ```
 75 | 
 76 | In this case, the value 0 is contained within the confidence interval, indivating that there is insufficient evidence to reject the null hypothesis that the true correlation is equal to zero.
 77 | 
 78 | # Section 2: Simple Regression
 79 | 
 80 | Now let's look at some real data.
 81 | 
 82 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees.
 83 | 
 84 | We will now attempt to construct a simple linear model that uses `Girth` to predict `Volume`.
 85 | ```{r}
 86 | plot(Volume~Girth,data=trees)
 87 | m1 = lm(Volume~Girth,data=trees)
 88 | abline(m1)
 89 | cor.test(trees$Volume,trees$Girth)
 90 | ```
 91 | 
 92 | It is evident that `Volume` and `Girth` are highly correlated.
 93 | 
 94 | The summary for the linear model provides information regarding the quality of the model:
 95 | ```{r}
 96 | summary(m1)
 97 | ```
 98 | 
 99 | Model residuals can be readily accessed using the `residuals()` function:
100 | ```{r}
101 | hist(residuals(m1),breaks=10,col="light grey")
102 | ```
103 | 
104 | Diagnostic plots for the model can reveal whether or not modelling assumptions are reasonable. In this case, there is visual evidence to suggest that the assumptions are not satisfied - note in particular the trend observed in the plot of residuals vs fitted values:
105 | ```{r}
106 | plot(m1)
107 | ```
108 | 
109 | # Section 3: Assessing the quality of linear models
110 | 
111 | Let's see what happens if we try to describe a non-linear relationship using a linear model. Consider the sine function in the range [0,1.5*pi):
112 | ```{r}
113 | z = seq(0,1.5*pi,0.2)
114 | plot(sin(z)~z)
115 | m0 = lm(sin(z)~z)
116 | abline(m0)
117 | ```
118 | 
119 | In this case, it is clear that a linear model is not appropriate for describing the relationship. However, we are able to fit a linear model, and the linear model summary does not identify any major concerns:
120 | ```{r}
121 | summary(m0)
122 | ```
123 | Here we see that the overall p-value is low enough to suggest that the model has significant utility, and both terms (the intercept and the coefficient of `z`) are significantly different from zero. The R^2 value of 0.5422 is high enough to indicate that there is a reasonably strong correlation between `sin(z)` and `z` in this range. 
124 | 
125 | This information is misleading, as we know that a linear model is inappropriate in this case. Indeed, the linear model summary does not check whether the underlying model assumptions are satisfied. 
126 | 
127 | By observing strong patterns in the diagnostic plots, we can see that the modelling assumptions are not satisified in this case.
128 | ```{r}
129 | plot(m0)
130 | ```
131 | 
132 | 
133 | # Section 4: Modelling Non-Linear Relationships
134 | 
135 | It is sometimes possible to use linear models to describe non-linear relationships (which is perhaps counterintuitive!). This can be achieved by applying transformations to the variable(s) in order to linearise the relationship, whilst ensuring that modelling assumptions are satisfied.
136 | 
137 | Another in-built dataset `cars` provides the speeds and associated stopping distances of cars in the 1920s.
138 | 
139 | Let's construct a linear model to predict stopping distance using speed:
140 | 
141 | ```{r}
142 | plot(dist~speed,data=cars)
143 | m2 = lm(dist~speed,data=cars)
144 | abline(m2)
145 | summary(m2)
146 | ```
147 | 
148 | The model summary indicates that the intercept term does not have significant utility. So that term could/should be removed from the model.
149 | 
150 | In addition, the plot of residuals versus fitted values indicates potential issues with variance stability:
151 | ```{r}
152 | plot(m2)
153 | ```
154 | 
155 | In this case, variance stability can be aided by a square-root transformation of the response variable:
156 | ```{r}
157 | plot(sqrt(dist)~speed,data=cars)
158 | m3 = lm(sqrt(dist)~speed,data=cars)
159 | abline(m3)
160 | plot(m3)
161 | summary(m3)
162 | ```
163 | 
164 | The R^2 value is improved over the previous model.
165 | Note that again that the intercept term is not significant.
166 | 
167 | We'll now try a log-log transformation, that is applying a log transformation to the predictor and response variables. This represents a power relationship between the two variables.
168 | ```{r}
169 | plot(log(dist)~log(speed),data=cars)
170 | m4 = lm(log(dist)~log(speed),data=cars)
171 | abline(m4)
172 | plot(m4)
173 | summary(m4)
174 | ```
175 | 
176 | The R^2 value is improved, and the diagnostic plots don't look too unreasonable. However, again the intercept term does not have significant utility. So we'll now remove it from the model:
177 | ```{r}
178 | m5 = lm(log(dist)~0+log(speed),data=cars)
179 | plot(m5)
180 | summary(m5)
181 | ```
182 | 
183 | This model seems reasonable. However, remember that R^2 values corresponding to models without an intercept aren't meaningful (or at least can't be compared against models with an intercept term).
184 | 
185 | We can now transform the model back, and display the regression curve on the plot:
186 | ```{r}
187 | plot(dist~speed,data=cars)
188 | x = order(cars$speed)
189 | lines(exp(fitted(m5))[x]~cars$speed[x])
190 | ```
191 | 
192 | # Section 5: Relationship between the t-test, ANOVA and linear regression
193 | 
194 | In the ANOVA session we looked at the `diet` dataset, and performed the t-test and ANOVA. Here's a recap:
195 | 
196 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
197 | # import
198 | diet = read.csv("data/diet.csv",row.names=1)
199 | diet$weight.loss = diet$initial.weight - diet$final.weight 
200 | diet$diet.type   = factor(diet$diet.type,levels=c("A","B","C"))
201 | diet$gender      = factor(diet$gender,levels=c("Female","Male"))
202 | # comparison
203 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE)
204 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",]))
205 | ```
206 | 
207 | Note that the p-values for both the t-test and ANOVA are the same. This is because these tests are equivalent (in the 2-sample case). They both test the same hypothesis.
208 | 
209 | Also, the F-test statistic is equal to the square of the t-test statistic (-2.8348^2 = 8.036). Again, this is only true for the 2-sample case.
210 | 
211 | Now let's use a different strategy. Instead of directly testing whether there is a difference between the two groups, let's attempt to create a linear model describing the relationship between `weight.loss` and `diet.type`. Indeed, it is possible to construct a linear model where the independent variable(s) are categorical - they do not have to be continuous or even ordinal!
212 | 
213 | ```{r message = FALSE, warning = FALSE, echo = TRUE}
214 | summary(lm(weight.loss~diet.type,data=diet[diet$diet.type!="B",]))
215 | ```
216 | 
217 | You can see that the p-value corresponding to the `diet.type` term is the same as the overall p-value of the linear model, which is also the same as the p-value from the t-test and ANOVA. Note also that the F-test statistic is the same as given by the ANOVA.
218 | 
219 | So, we are also able to use the linear model to test the hypothesis that there is a difference between the two diet groups, as well as provide a more detailed description of the relationship between `weight.loss` and `diet.type`. 
220 | 
221 | # Section 6: Practical Exercises
222 | 
223 | ## Old Faithful
224 | 
225 | The inbuilt R dataset `faithful` pertains to the waiting time between eruptions and
226 | the duration of the eruption for the Old Faithful geyser in Yellowstone National
227 | Park, Wyoming, USA.
228 | 
229 | - Create a simple linear regression model that models the eruption duration `faithful$eruptions` using waiting time `faithful$waiting` as the independent variable, storing the model in a variable. Look at the summary of the model.
230 |     + What are the values of the estimates of the intercept and coefficient of 'waiting'?
231 |     + What is the R^2 value?
232 |     + Does the model have significant utility?
233 |     + Are neither, one, or both of the parameters significantly different from zero?
234 |     + Can you conclude that there is a linear relationship between the two variables?
235 | - Plot the eruption duration against waiting time. Is there anything noticeable
236 | about the data?
237 | - Draw the regression line corresponding to your model onto the plot. Based on this graphical representation, does the model seem reasonable?
238 | - Generate the four diagnostic plots corresponding to your model. Contemplate the appropriateness of the model for describing the relationship between eruption duration and waiting time.
239 | 
240 | ## Anscombe datasets
241 | 
242 | Consider the inbuilt R dataset `anscombe`. This dataset contains four x-y datasets,
243 | contained in the columns: (x1,y1), (x2,y2), (x3,y3) and (x4,y4).
244 | 
245 | - For each of the four datasets, calculate and test the correlation between the x and y
246 | variables. What do you conclude?
247 | - For each of the four datasets, create a linear model that regresses y on x. Look
248 | at the summaries corresponding to these models. What do you conclude?
249 | - For each of the four datasets, create a plot of y against x. What do you
250 | conclude?
251 | 
252 | ## Pharmacokinetics of Indomethacin
253 | 
254 | Consider the inbuilt R dataset `Indometh`, which contains data on the pharmacokinetics of indometacin.
255 | 
256 | - Plot `Indometh$time` versus `Indometh$conc` (concentration). What is the nature of the relationship
257 | between `time` and `conc`?
258 | - Apply monotonic transformations to the data so that a simple linear regression model can be used to model the relationship (ensure both linearity and stabilised variance, within reason). Create a plot of the transformed data, to confirm that the relationship seems linear.
259 | - After creating the linear model, inspect the diagnostic plots to ensure that the
260 | assumptions are not violated (too much). Are there any outliers with large influence? What are the parameter estimates? Are both terms significant?
261 | - Add a line to the plot showing the linear relationship between the transformed data.
262 | - Now regenerate the original plot of `time` versus `conc` (i.e. the untransformed
263 | data). Using the `lines` function, add a curve to the plot corresponding to the
264 | fitted values of the model.
265 | 
266 | 


--------------------------------------------------------------------------------
/simple_regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/simple_regression.pdf


--------------------------------------------------------------------------------
/time_series.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/time_series.pdf


--------------------------------------------------------------------------------
/time_series_analysis.Rmd:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | ---
 5 | title: "Time Series Analysis with R" 
 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri"
 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
 8 | output:
 9 |   html_document:
10 |     theme: united 
11 |     highlight: tango
12 |     code_folding: show    
13 |     toc: true           
14 |     toc_depth: 2       
15 |     toc_float: true     
16 |     fig_width: 8
17 |     fig_height: 6
18 | ---
19 | 
20 | 
21 | 
22 | <!--- rmarkdown::render("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/time_series_analysis.Rmd") --->
23 | 
24 | ```{r message = FALSE, warning = FALSE, echo = FALSE,eval=FALSE} 
25 | # change working directory: should be the directory containg the Markdown files:
26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/")
27 | 
28 | ```
29 | 
30 | We will consider a dataset corresponding to the Monthly Southern Oscillation Index, measured as the difference in sea-surface air pressure between Darwin and Tahiti.
31 | 
32 | ```{r}
33 | x=read.table("data/OscillationIndex.txt",header=TRUE)
34 | x$Index
35 | plot(x$Index,type="l",ylab="Oscillation Index")
36 | ```
37 | 
38 | Now let's look at the autocorrelation function corresponding to this dataset:
39 | ```{r}
40 | acf(x$Index,lag.max=70,main="")
41 | ```
42 | 
43 | There is clear long-range oscillatory behaviour in the autocorrelation function, indicating that the process is not stationary.
44 | 
45 | We should consider an integrated (ARIMA) model, so let's calculate and plot the first differences, as well as the associated autocorrelation function:
46 | ```{r}
47 | plot(diff(x$Index),type="l",ylab="Oscillation Index (d=1)")
48 | acf(diff(x$Index),lag.max=70,main="")
49 | ```
50 | 
51 | That's more promising - there is one large negative peak at lag=1, after which the autocorrelation function decays rapidly and stays small. This indicates that this process is covariance stationary. This also indicates that the Moving Average (MA) part of the model may be of order 1. So an ARIMA(0,1,1) model might be a possibility.
52 | 
53 | Now let's look at the partial autocorrelation function:
54 | ```{r}
55 | pacf(diff(x$Index),lag.max=70,main="")
56 | ```
57 | 
58 | This also looks promising. There are four negative peaks before the PACF decays below the significance threshold. That indicates that the AutoRegressive (AR) part of the model may have order up to 4.
59 | 
60 | Now we'll try to create an ARIMA(0,1,1) model:
61 | ```{r}
62 | arima(x$Index,order=c(0,1,1))
63 | ```
64 | 
65 | Note that the standard error of the coefficient indicates significance of the term. 
66 | 
67 | Now try creating other ARIMA models, and compare.
68 | 
69 | There are a variety of time series datasets in the in-built R "datasets" package. Type `data()` to get a full list. For example, the datasets called `lh`, `ldeaths` and `presidents` are particularly appropriate for this type of analysis. Other datasets also contain time series data, including: `nhtemp`, `lynx`, `Nile`, `co2` and `WWWusage`. Explore such datasets - look at autocorrelation and partial autocorrelation functions, identify whether the datasets are suitable for time series analysis, and try fitting ARIMA models.
70 | 


--------------------------------------------------------------------------------
/timetable.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/timetable.pdf


--------------------------------------------------------------------------------