├── Course_Data.zip ├── README.md ├── _config.yml ├── alternative ├── LM_Cheatsheet.html ├── LM_Cheatsheet.md ├── LM_Cheatsheet.pdf ├── Welcome_elearning_lin_mod.pdf └── timetable.md ├── anova+.Rmd ├── anova+.html ├── anova.Rmd ├── anova.html ├── anova.pdf ├── cheat_sheet.pdf ├── conclusion.pdf ├── data ├── Assay.txt ├── Bronchitis.csv ├── OscillationIndex.txt ├── amess.csv ├── clinicalTrials.txt ├── crab.csv ├── diet.csv ├── genotypes.txt ├── globalBreastCancerRisk.csv ├── lactoferrin.csv ├── myocardialinfarction.csv ├── pollution.csv ├── protein-expression.csv ├── students.csv └── treatments.txt ├── glm+.Rmd ├── glm+.html ├── glm.Rmd ├── glm.html ├── glm.pdf ├── gml.html ├── images ├── examplePlots.png └── plot-char.png ├── index.md ├── install.R ├── logos ├── CRUK_CI_logo.png ├── LMB_logo.png ├── LMB_logo_small.png └── Logos.txt ├── multiple_regression+.Rmd ├── multiple_regression+.html ├── multiple_regression.Rmd ├── multiple_regression.html ├── multiple_regression.pdf ├── r-recap.Rmd ├── r-recap.nb.html ├── simple_regression+.Rmd ├── simple_regression+.html ├── simple_regression.Rmd ├── simple_regression.html ├── simple_regression.pdf ├── time_series.pdf ├── time_series_analysis.Rmd ├── time_series_analysis.html └── timetable.pdf /Course_Data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/Course_Data.zip -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Linear Modelling with R 2 | 3 | ## Description 4 | 5 | The course will cover ANOVA, linear regression and some extensions. It will be a mixture of lectures and hands-on time using RStudio to analyse data. 6 | 7 | 8 | # Aims: During this course you will learn about: 9 | 10 | - ANOVA 11 | - Simple and multiple regression 12 | - Generalised Linear Models 13 | - Introduction to more advanced topics, like non-linear models and time series. 14 | 15 | # Objectives: After this course you should be able to 16 | 17 | - Realise the connection between t-tests, ANOVA and linear regression 18 | - Fit a linear regression 19 | - Check if the assumptions of linear regression are met by the data and what to do if they are not 20 | - Know when linear regression is not appropriate and have an idea of which alternative method might be appropriate 21 | - Know when you need to seek help with analysis as the data structure is too complex for the methods taught 22 | 23 | # Pre-requisites 24 | 25 | This course assumes basic knowledge of statistics and use of R, which would be obtained from our Introductory Statistics Course and an "Introduction to R for Solving Biological Problems" run at the Genetics department (or equivalent). 26 | 27 | - [Introduction to Solving Biological Problems with R](http://cambiotraining.github.io/r-intro/) 28 | - [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/) 29 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-slate 2 | google_analytics: UA-63148050-14 3 | -------------------------------------------------------------------------------- /alternative/LM_Cheatsheet.md: -------------------------------------------------------------------------------- 1 | --- 2 | output: 3 | html_document: default 4 | pdf_document: default 5 | --- 6 | ![](logos/CRUK_CI_logo.png)![](logos/LMB_logo_small.png) 7 | ## Linear Modelling with R Course Cheatsheet 8 | 9 | | **ANOVA**. | Notes | 10 | | --- | --- | 11 | | aov() | linear model with categorical predictors | 12 | | oneway.test() | (heteroscedastic) linear model with a categorical predictor | 13 | | kruskal.test | rank-based linear model with a categorical predictor | 14 | | t.test() | (heteroscedastic and heteroscedastic) linear model with a binary predictor | 15 | | tapply() | apply a function to each vector element | 16 | | qqnorm() | normal quantile-quantile plot | 17 | | shapiro.test() | test of normality | 18 | | bartlett.test() | test of equality of variance between groups | 19 | 20 | | **Simple Regression** | Notes | 21 | | --- | --- | 22 | | cor() | correlation between between 2 variables | 23 | | cor.test() | test for (linear or rank) association between 2 variables | 24 | | residuals() | extract residuals from a model fit | 25 | | lm() | linear model fit | 26 | 27 | | **Multiple Regression** | Notes | 28 | | --- | --- | 29 | | AIC() | Akaike's information criterion for a fitted model | 30 | | stepAIC() | AIC based stepwise model selection | 31 | | nls() | non-linear least squares | 32 | 33 | | **Generalised Linear Models** | Notes | 34 | | --- | --- | 35 | | install.packages() | blah | 36 | | glm() | generalised linear model fit | 37 | | gamlss() | generalised linear and additive model fit | 38 | | anova() | comparison of embedded models | 39 | | chisq.test() | Pearson's chi-square test | 40 | | prop.test() | test of equality of proportions | 41 | 42 | | **Time Series and Non-Linear Models** | Notes | 43 | | --- | --- | 44 | | acf() | auto-correlation function | 45 | | pacf() | partial auto-correlation function | 46 | | arima() | ARIMA Modelling of time series | 47 | 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /alternative/LM_Cheatsheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/alternative/LM_Cheatsheet.pdf -------------------------------------------------------------------------------- /alternative/Welcome_elearning_lin_mod.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/alternative/Welcome_elearning_lin_mod.pdf -------------------------------------------------------------------------------- /alternative/timetable.md: -------------------------------------------------------------------------------- 1 | ### [http://tinyurl.com/linear-models-r](http://tinyurl.com/linear-models-r) 2 | ## Introduction to Linear Models with R - Course Schedule 3 | 4 | | Time | Topic | 5 | | ------------- | ------------- | 6 | | 09.45 - 10.15 | Welcome & Introduction to Rstudio and Markdown (Mark) | 7 | | 10.15 - 11.30 | Anova (Dominique) {10 min coffee break at 11am*} | 8 | | 11.30 - 13.00 | Simple Regression (Rob) | 9 | | 13.00 - 14.00 | Lunch Break | 10 | | 14.00 - 15.15 | Multiple Regression (Rob) | 11 | | 15.15 -15.30 | { 15 min Tea break* } | 12 | | 15.30 - 16.45 | GLMs (Dominique) | 13 | | 16.45 - 17.15 | Time-series (Rob) | 14 | | 17.15 - 17.25 | Conclusion | 15 | 16 | *Coffee, Tea, Water & Cookies provided 17 | 18 | ![CRUK Cambridge Institute](logos/CRUK_CI_logo.png) 19 | ![MRC Laboratory of Molecular Biology](logos/LMB_logo_small.png) 20 | 21 | -------------------------------------------------------------------------------- /anova+.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "ANOVA with R: analysis of the *diet* dataset" 3 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 4 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 5 | output: 6 | html_document: 7 | theme: united 8 | highlight: tango 9 | code_folding: show 10 | toc: true 11 | toc_depth: 2 12 | toc_float: true 13 | fig_width: 8 14 | fig_height: 6 15 | --- 16 | 17 | 18 | 19 | 20 | 21 | 22 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 23 | # change working directory: should be the directory containg the Markdown files: 24 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20210514/Practicals/") 25 | 26 | ``` 27 | 28 | 29 | A full version of the dataset *diet* may be found online on the U. of Sheffield website . 30 | 31 | A slightly modified version is available in the data file is stored under data/diet.csv. The data set contains information on 76 people who undertook one of three diets (referred to as diet _A_, _B_ and _C_). There is background information such as age, gender, and height. The aim of the study was to see which diet was best for losing weight. 32 | 33 | 34 | # Section 1: importation and descriptive analysis 35 | 36 | Lets starts by 37 | 38 | * importing the data set *diet* with the function `read.csv()` 39 | * defining a new column *weight.loss*, corresponding to the difference between the initial and final weights (respectively the corresponding to the columns `initial.weight` and `final.weight` of the dataset) 40 | * displaying _weight loss_ per _diet type_ (column `diet.type`) by means of a boxplot. 41 | 42 | 43 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 44 | diet = read.csv("data/diet.csv",row.names=1) 45 | diet$weight.loss = diet$initial.weight - diet$final.weight 46 | diet$diet.type = factor(diet$diet.type,levels=c("A","B","C")) 47 | diet$gender = factor(diet$gender,levels=c("Female","Male")) 48 | boxplot(weight.loss~diet.type,data=diet,col="light gray", 49 | ylab = "Weight loss (kg)", xlab = "Diet type") 50 | abline(h=0,col="blue") 51 | ``` 52 | 53 | # Section 2: ANOVA 54 | 55 | Lets 56 | 57 | * perform a Fisher's, Welch's and Kruskal-Wallis one-way ANOVA, respectively by means of the functions `aov()`, `oneway.test()` and `kruskal.test`, 58 | * display and analyse the results: Use the function `summary()` to display the results of an R object of class `aov` and the function `print()` otherwise. 59 | 60 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 61 | diet.fisher = aov(weight.loss~diet.type,data=diet) 62 | diet.welch = oneway.test(weight.loss~diet.type,data=diet) 63 | diet.kruskal = kruskal.test(weight.loss~diet.type,data=diet) 64 | 65 | summary(diet.fisher) 66 | print(diet.welch) 67 | print(diet.kruskal) 68 | ``` 69 | 70 | Note that, when the interest lies in the difference between two means, the Fisher's ANOVA (fonction `aov()`) and the Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`) leads to the same results. 71 | Let check this by comparing the mean weight losses of *Diet A* and *Diet C*. 72 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 73 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",])) 74 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE) 75 | ``` 76 | 77 | 78 | # Section 3: Model check 79 | 80 | Lets first 81 | 82 | * define the Fisher's and Welch's residuals by subtracting the mean of each group to the weight loss of the corresponding participants 83 | * define the Kruskal's residual's by subtraction the median of each group to the weight loss of the corresponding participants 84 | 85 | The mean or median of each group may be obtained by means of the function `tapply()` which allows a apply a function (like `mean` or `median`) to and by 86 | 87 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 88 | # mean and median weight loss per group: 89 | mean_group = tapply(diet$weight.loss,diet$diet.type,mean) 90 | median_group = tapply(diet$weight.loss,diet$diet.type,median) 91 | mean_group 92 | median_group 93 | # residuals: 94 | diet$resid.mean = (diet$weight.loss - mean_group[as.numeric(diet$diet.type)]) 95 | diet$resid.median = (diet$weight.loss - median_group[as.numeric(diet$diet.type)]) 96 | diet[1:10,] 97 | ``` 98 | 99 | Then, lets 100 | 101 | * display a boxplot of the residuals per group to assess if (i) the variance per groups are similar (ii) normality of the residuals per group seems credible 102 | * display a QQ-plot of the residuals of the mean model to assess if normality of the residuals seems credible 103 | 104 | 105 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 106 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 107 | # 108 | boxplot(resid.mean~diet.type,data=diet,main="Residual boxplot per group",col="light gray",xlab="Diet type",ylab="Residuals") 109 | abline(h=0,col="blue") 110 | # 111 | col_group = rainbow(nlevels(diet$diet.type)) 112 | qqnorm(diet$resid.mean,col=col_group[as.numeric(diet$diet.type)]) 113 | qqline(diet$resid.mean) 114 | legend("top",legend=levels(diet$diet.type),col=col_group,pch=21,ncol=3,box.lwd=NA) 115 | ``` 116 | 117 | Finally, lets 118 | 119 | * perform a Shapiro's test to assess is there is enough evidence that the residuals are not normally distributed (by means of the function `shapiro.test()`) 120 | * perform a Bartlett's test to assess is there is enough evidence that the residuals per group do not have different variance (by means of the function `bartlett.test()`. ) 121 | 122 | 123 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 124 | shapiro.test(diet$resid.mean) 125 | bartlett.test(diet$resid.mean~as.numeric(diet$diet.type)) 126 | ``` 127 | 128 | 129 | # Section 4: Mutiple comparisons 130 | 131 | Lets 132 | 133 | * perform a Tukey HSD test to define which group pair(s) have different means (by means of the function `TukeyHSD()`) 134 | * compare the Tukey HSD confidence interval size for the difference of means between the weight losses of *Diet A* and *Diet B* with the one obtained by means of a Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`) 135 | 136 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 137 | plot(TukeyHSD(diet.fisher)) 138 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="C",],var.equal = TRUE) 139 | ``` 140 | 141 | # Section 5: Two-way ANOVA 142 | 143 | Lets 144 | 145 | * perform a two-way ANOVA to assess if the weight loss means are different per levels of the factors _Diet_ and/or _Age_. 146 | * compare the output of the function `aov()` to the one of the function `lm()`. 147 | 148 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 149 | diet.fisher = aov(weight.loss~diet.type*gender,data=diet) 150 | summary(diet.fisher) 151 | 152 | anova(lm(weight.loss~diet.type*gender,data=diet)) 153 | ``` 154 | 155 | 156 | # Section 5: Practicals 157 | 158 | Analyse the two following datasets with the suitable analysis: 159 | 160 | ## (i) *amess.csv* 161 | The data for this exercise are to be found in *amess.csv*. The data are the red cell folate levels in three groups of cardiac bypass patients given different levels of nitrous oxide (N2O) and oxygen (O2) ventilation. (There is a reference to the source of this data in Altman, Practical Statistics for Medical Research, p. 208.) 162 | The treatments are 163 | 164 | * 50% N2O and 50% O2 continuously for 24 hours 165 | * 50% N2O and 50% O2 during the operation 166 | * No N2O but 35-50% O2 continuously for 24 hours 167 | 168 | 169 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 170 | amess = read.csv("data/amess.csv") 171 | amess$treatmnt = as.factor(amess$treatmnt) 172 | boxplot(folate~treatmnt,data=amess,col="light gray") 173 | 174 | # resid: 175 | mean_treatmnt = tapply(amess$folate,amess$treatmnt,mean) 176 | amess$resid.mean = (amess$folate - mean_treatmnt[as.numeric(amess$treatmnt)]) 177 | 178 | # 179 | bartlett.test(amess$resid.mean~as.numeric(amess$treatmnt)) 180 | 181 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 182 | # 183 | boxplot(resid.mean~treatmnt,data=amess,main="Residual boxplot per group",col="light gray",xlab="Treatment type",ylab="Residuals") 184 | abline(h=0,col="blue") 185 | # 186 | col_group = rainbow(nlevels(diet$diet.type)) 187 | qqnorm(amess$resid.mean,col=col_group[as.numeric(amess$treatmnt)]) 188 | qqline(amess$resid.mean) 189 | legend("top",legend=levels(amess$treatmnt),col=col_group,pch=21,ncol=3,box.lwd=NA) 190 | 191 | 192 | # welch: 193 | amess.welch = oneway.test(folate~treatmnt,data=amess) 194 | print(amess.welch) 195 | 196 | amess.aov = aov(folate~treatmnt,data=amess) 197 | summary(amess.aov) 198 | 199 | ``` 200 | 201 | 202 | 203 | 204 | 205 | 206 | ## (ii) *globalBreastCancerRisk.csv* 207 | 208 | The file *globalBreastCancerRisk.csv* gives the number of new cases of Breast Cancer (per population of 10,000) in various countries around the world, along with various health and lifestyle risk factors. 209 | 210 | Let’s suppose we are initially interested in whether the number of breast cancer cases is significantly different in different regions of the world. 211 | 212 | Visualise the distribution of breast cancer incidence in each continent. Check how many observations belong to each group (continent). Are there any groups that you would consider removing/grouping before performing the analysis ? 213 | 214 | 215 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 216 | breastcancer = read.csv("data/globalBreastCancerRisk.csv",row.names=1) 217 | breastcancer$continent = factor(breastcancer$continent) 218 | 219 | boxplot(NewCasesOfBreastCancerIn2002~continent,data=breastcancer,col="light gray") 220 | table(breastcancer$continent) 221 | breastcancer$continent2 = as.character(breastcancer$continent) 222 | breastcancer$continent2[breastcancer$continent2=="Oceania"] = "Asia" 223 | breastcancer$continent2 = factor(breastcancer$continent2) 224 | table(breastcancer$continent2) 225 | 226 | par(mfrow=c(1,2)) 227 | boxplot(NewCasesOfBreastCancerIn2002~continent2,data=breastcancer,col="light gray", 228 | main="original scale") 229 | boxplot(log(NewCasesOfBreastCancerIn2002)~continent2,data=breastcancer, 230 | main="log scale",col="light gray") 231 | 232 | 233 | # resid: 234 | mean_continent2 = tapply(breastcancer$NewCasesOfBreastCancerIn2002,breastcancer$continent2,mean,na.rm=TRUE) 235 | breastcancer$resid.mean = (breastcancer$NewCasesOfBreastCancerIn2002 - mean_continent2[as.numeric(breastcancer$continent2)]) 236 | 237 | bartlett.test(breastcancer$resid.mean~as.numeric(breastcancer$continent2)) 238 | shapiro.test(breastcancer$resid.mean) 239 | # clear red flags ! 240 | 241 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 242 | # 243 | boxplot(resid.mean~continent2,data=breastcancer,main="Residual boxplot per group",col="light gray",xlab="Treatment type",ylab="Residuals") 244 | abline(h=0,col="blue") 245 | # 246 | col_group = rainbow(nlevels(breastcancer$continent2)) 247 | qqnorm(breastcancer$resid.mean,col=col_group[as.numeric(breastcancer$continent2)]) 248 | qqline(breastcancer$resid.mean) 249 | legend("top",legend=levels(breastcancer$continent2),col=col_group,pch=21,ncol=3,box.lwd=NA) 250 | 251 | 252 | # kruskal-wallis: 253 | breastcancer.kruskal = kruskal.test(NewCasesOfBreastCancerIn2002~continent2,data=breastcancer) 254 | breastcancer.kruskal 255 | ``` 256 | 257 | 258 | 259 | 260 | 261 | 262 | -------------------------------------------------------------------------------- /anova.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | --- 4 | title: "ANOVA with R: analysis of the *diet* dataset" 5 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 6 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 7 | output: 8 | html_document: 9 | theme: united 10 | highlight: tango 11 | code_folding: show 12 | toc: true 13 | toc_depth: 2 14 | toc_float: true 15 | fig_width: 8 16 | fig_height: 6 17 | --- 18 | 19 | 20 | 21 | 22 | 23 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 24 | # change working directory: should be the directory containg the Markdown files: 25 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20210514/Practicals/") 26 | 27 | ``` 28 | 29 | 30 | A full version of the dataset *diet* may be found online on the U. of Sheffield website . 31 | 32 | A slightly modified version is available in the data file is stored under data/diet.csv. The data set contains information on 76 people who undertook one of three diets (referred to as diet _A_, _B_ and _C_). There is background information such as age, gender, and height. The aim of the study was to see which diet was best for losing weight. 33 | 34 | 35 | # Section 1: importation and descriptive analysis 36 | 37 | Lets starts by 38 | 39 | * importing the data set *diet* with the function `read.csv()` 40 | * defining a new column *weight.loss*, corresponding to the difference between the initial and final weights (respectively the corresponding to the columns `initial.weight` and `final.weight` of the dataset) 41 | * displaying _weight loss_ per _diet type_ (column `diet.type`) by means of a boxplot. 42 | 43 | 44 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 45 | diet = read.csv("data/diet.csv",row.names=1) 46 | diet$weight.loss = diet$initial.weight - diet$final.weight 47 | diet$diet.type = factor(diet$diet.type,levels=c("A","B","C")) 48 | diet$gender = factor(diet$gender,levels=c("Female","Male")) 49 | boxplot(weight.loss~diet.type,data=diet,col="light gray", 50 | ylab = "Weight loss (kg)", xlab = "Diet type") 51 | abline(h=0,col="blue") 52 | ``` 53 | 54 | # Section 2: ANOVA 55 | 56 | Lets 57 | 58 | * perform a Fisher's, Welch's and Kruskal-Wallis one-way ANOVA, respectively by means of the functions `aov()`, `oneway.test()` and `kruskal.test`, 59 | * display and analyse the results: Use the function `summary()` to display the results of an R object of class `aov` and the function `print()` otherwise. 60 | 61 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 62 | diet.fisher = aov(weight.loss~diet.type,data=diet) 63 | diet.welch = oneway.test(weight.loss~diet.type,data=diet) 64 | diet.kruskal = kruskal.test(weight.loss~diet.type,data=diet) 65 | 66 | summary(diet.fisher) 67 | print(diet.welch) 68 | print(diet.kruskal) 69 | ``` 70 | 71 | Note that, when the interest lies in the difference between two means, the Fisher's ANOVA (fonction `aov()`) and the Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`) leads to the same results. 72 | Let check this by comparing the mean weight losses of *Diet A* and *Diet C*. 73 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 74 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",])) 75 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE) 76 | ``` 77 | 78 | 79 | # Section 3: Model check 80 | 81 | Lets first 82 | 83 | * define the Fisher's and Welch's residuals by subtracting the mean of each group to the weight loss of the corresponding participants 84 | * define the Kruskal's residual's by subtraction the median of each group to the weight loss of the corresponding participants 85 | 86 | The mean or median of each group may be obtained by means of the function `tapply()` which allows a apply a function (like `mean` or `median`) to and by 87 | 88 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 89 | # mean and median weight loss per group: 90 | mean_group = tapply(diet$weight.loss,diet$diet.type,mean) 91 | median_group = tapply(diet$weight.loss,diet$diet.type,median) 92 | mean_group 93 | median_group 94 | # residuals: 95 | diet$resid.mean = (diet$weight.loss - mean_group[as.numeric(diet$diet.type)]) 96 | diet$resid.median = (diet$weight.loss - median_group[as.numeric(diet$diet.type)]) 97 | diet[1:10,] 98 | ``` 99 | 100 | Then, lets 101 | 102 | * display a boxplot of the residuals per group to assess if (i) the variance per groups are similar (ii) normality of the residuals per group seems credible 103 | * display a QQ-plot of the residuals of the mean model to assess if normality of the residuals seems credible 104 | 105 | 106 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 107 | par(mfrow=c(1,2),mar=c(4.5,4.5,2,0)) 108 | # 109 | boxplot(resid.mean~diet.type,data=diet,main="Residual boxplot per group",col="light gray",xlab="Diet type",ylab="Residuals") 110 | abline(h=0,col="blue") 111 | # 112 | col_group = rainbow(nlevels(diet$diet.type)) 113 | qqnorm(diet$resid.mean,col=col_group[as.numeric(diet$diet.type)]) 114 | qqline(diet$resid.mean) 115 | legend("top",legend=levels(diet$diet.type),col=col_group,pch=21,ncol=3,box.lwd=NA) 116 | ``` 117 | 118 | Finally, lets 119 | 120 | * perform a Shapiro's test to assess is there is enough evidence that the residuals are not normally distributed (by means of the function `shapiro.test()`) 121 | * perform a Bartlett's test to assess is there is enough evidence that the residuals per group do not have different variance (by means of the function `bartlett.test()`. ) 122 | 123 | 124 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 125 | shapiro.test(diet$resid.mean) 126 | bartlett.test(diet$resid.mean~as.numeric(diet$diet.type)) 127 | ``` 128 | 129 | 130 | # Section 4: Mutiple comparisons 131 | 132 | Lets 133 | 134 | * perform a Tukey HSD test to define which group pair(s) have different means (by means of the function `TukeyHSD()`) 135 | * compare the Tukey HSD confidence interval size for the difference of means between the weight losses of *Diet A* and *Diet B* with the one obtained by means of a Student's t-test (function `t.test()` with argument `var.equal` set to `TRUE`) 136 | 137 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 138 | plot(TukeyHSD(diet.fisher)) 139 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="C",],var.equal = TRUE) 140 | ``` 141 | 142 | # Section 5: Two-way ANOVA 143 | 144 | Lets 145 | 146 | * perform a two-way ANOVA to assess if the weight loss means are different per levels of the factors _Diet_ and/or _Age_. 147 | * compare the output of the function `aov()` to the one of the function `lm()`. 148 | 149 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 150 | diet.fisher = aov(weight.loss~diet.type*gender,data=diet) 151 | summary(diet.fisher) 152 | 153 | anova(lm(weight.loss~diet.type*gender,data=diet)) 154 | ``` 155 | 156 | 157 | # Section 5: Practicals 158 | 159 | Analyse the two following datasets with the suitable analysis: 160 | 161 | ## (i) *amess.csv* 162 | The data for this exercise are to be found in *amess.csv*. The data are the red cell folate levels in three groups of cardiac bypass patients given different levels of nitrous oxide (N2O) and oxygen (O2) ventilation. (There is a reference to the source of this data in Altman, Practical Statistics for Medical Research, p. 208.) 163 | The treatments are 164 | 165 | * 50% N2O and 50% O2 continuously for 24 hours 166 | * 50% N2O and 50% O2 during the operation 167 | * No N2O but 35-50% O2 continuously for 24 hours 168 | 169 | ## (ii) *globalBreastCancerRisk.csv* 170 | 171 | The file *globalBreastCancerRisk.csv* gives the number of new cases of Breast Cancer (per population of 10,000) in various countries around the world, along with various health and lifestyle risk factors. 172 | 173 | Let’s suppose we are initially interested in whether the number of breast cancer cases is significantly different in different regions of the world. 174 | 175 | Visualise the distribution of breast cancer incidence in each continent. Check how many observations belong to each group (continent). Are there any groups that you would consider removing/grouping before performing the analysis ? 176 | 177 | 178 | 179 | 180 | 181 | -------------------------------------------------------------------------------- /anova.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/anova.pdf -------------------------------------------------------------------------------- /cheat_sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/cheat_sheet.pdf -------------------------------------------------------------------------------- /conclusion.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/conclusion.pdf -------------------------------------------------------------------------------- /data/Assay.txt: -------------------------------------------------------------------------------- 1 | micrograms Optical Density 2 | 1.0 0.040 3 | 2.0 0.059 4 | 3.0 0.083 5 | 4.0 0.102 6 | 5.0 0.123 7 | 6.0 0.139 8 | 7.0 0.160 9 | Unknown 1 0.067 10 | Unknown 2 0.073 11 | Unknown 3 0.098 -------------------------------------------------------------------------------- /data/Bronchitis.csv: -------------------------------------------------------------------------------- 1 | bron,cigs,poll 2 | 0,5.15,67.1 3 | 1,0,66.9 4 | 0,2.5,66.7 5 | 0,1.75,65.8 6 | 0,6.75,64.4 7 | 0,0,64.4 8 | 1,0,65.1 9 | 1,9.5,66.2 10 | 0,0,65.9 11 | 0,0.75,67.1 12 | 0,5.25,67.9 13 | 1,8,68.1 14 | 1,5.15,67 15 | 1,30,66.3 16 | 0,0,65.7 17 | 0,0,65.2 18 | 0,5.25,64.2 19 | 0,10.05,64.6 20 | 0,0,63.5 21 | 1,3.4,63 22 | 0,0,62.7 23 | 0,0.55,62.7 24 | 1,9.5,62.1 25 | 1,12.5,63.7 26 | 0,0,63.1 27 | 0,3.4,63 28 | 0,2.2,62.7 29 | 0,6.7,63.1 30 | 0,1.1,62.4 31 | 0,1.8,64.4 32 | 0,0,64.2 33 | 1,3.6,64.2 34 | 0,1.6,63 35 | 0,6.2,62.2 36 | 0,14.75,62.3 37 | 0,0.35,63.7 38 | 1,13.75,63.8 39 | 0,0,63.1 40 | 1,7.5,62.7 41 | 0,1,62.9 42 | 0,0,62.5 43 | 1,14.8,61.7 44 | 1,3.5,61.6 45 | 0,0,61.6 46 | 0,0,61.4 47 | 0,0.25,61.4 48 | 0,1.55,62 49 | 1,0,61.8 50 | 0,0,60.9 51 | 0,5.9,60.8 52 | 0,16.45,60.6 53 | 0,2.65,62.9 54 | 1,12.5,62.6 55 | 0,0,62.1 56 | 0,14.55,61.7 57 | 1,11,61 58 | 1,6.75,62.7 59 | 0,0,62.7 60 | 1,0,61.7 61 | 0,1.75,60.9 62 | 0,2.4,60.6 63 | 0,10.05,60.4 64 | 1,12.75,61.7 65 | 0,0,61.9 66 | 0,5,61.3 67 | 0,0.6,60.7 68 | 0,0,60.8 69 | 0,0.85,60.5 70 | 0,0.9,59.7 71 | 0,0,59.5 72 | 1,8.75,59.6 73 | 0,0.8,59.1 74 | 1,6.6,59.4 75 | 0,1,58.5 76 | 0,0,60 77 | 1,8.15,59.8 78 | 0,0,59.7 79 | 1,5,59.4 80 | 0,2.55,59.2 81 | 0,1.2,58.6 82 | 0,0,60.8 83 | 1,11.25,60.4 84 | 0,0,60.2 85 | 0,2,60 86 | 0,1.9,59.4 87 | 0,0.45,59.8 88 | 1,0,59.7 89 | 0,0,59 90 | 1,6.9,59 91 | 0,2.35,58.6 92 | 0,3.95,59.7 93 | 0,0.6,59.6 94 | 1,15,59.4 95 | 0,0,59.4 96 | 0,0.95,59.4 97 | 0,0,59.3 98 | 0,1.4,54.2 99 | 0,0.5,54 100 | 0,0.6,53.8 101 | 0,0,53.7 102 | 0,2.45,53.7 103 | 0,1.75,53.1 104 | 0,0,54.4 105 | 0,3.1,54.2 106 | 0,10.05,53.9 107 | 0,0.55,53.2 108 | 0,0.85,53.2 109 | 0,1.1,54.9 110 | 0,0,54.9 111 | 0,0,54.5 112 | 0,1.45,54.2 113 | 0,2.05,54.2 114 | 1,10.5,54 115 | 0,0.5,55.8 116 | 1,9.2,55.5 117 | 0,0.55,55.6 118 | 0,0,55.5 119 | 0,0.96,54.9 120 | 0,1,54.6 121 | 0,0,56.9 122 | 0,5.25,56.4 123 | 1,0,55.9 124 | 0,9,55.8 125 | 0,1.6,55.6 126 | 1,10.9,57.6 127 | 0,0,57.7 128 | 0,0,57.6 129 | 0,2.25,57.8 130 | 0,2.65,57.8 131 | 0,0.55,58.4 132 | 0,0,58.2 133 | 1,4.5,58 134 | 0,15,58.1 135 | 0,0,57.9 136 | 0,0,57.3 137 | 0,4.2,58.3 138 | 0,0.55,58.1 139 | 1,10,57.9 140 | 0,0,57.6 141 | 0,7.1,57.3 142 | 0,3.2,57.1 143 | 1,0,58.9 144 | 1,6.8,58.6 145 | 0,0,58.7 146 | 0,0,57.5 147 | 0,2.35,57.2 148 | 0,24.9,58 149 | 0,2.65,57.9 150 | 1,3.7,57.2 151 | 0,17.1,57.3 152 | 0,0,57.5 153 | 0,0.95,57.2 154 | 0,10.05,53.1 155 | 0,1.15,53 156 | 1,18.25,53 157 | 0,10,52.9 158 | 0,0.75,52.6 159 | 0,0,53.1 160 | 0,4.2,53 161 | 0,0.8,52.9 162 | 0,0.55,52.7 163 | 0,0.95,52.6 164 | 0,0,52.1 165 | 0,3.1,54.1 166 | 0,0.8,53.7 167 | 0,1.55,53.1 168 | 0,0.4,53.3 169 | 0,6.2,53 170 | 0,0.6,53 171 | 0,0.4,53.9 172 | 1,7.5,53.7 173 | 0,7.15,53.4 174 | 0,0.25,53.2 175 | 0,3.6,53.4 176 | 0,0.95,53.2 177 | 0,2.8,54.9 178 | 1,20.25,54.9 179 | 0,0.95,54.6 180 | 0,4.25,54.1 181 | 0,4.15,54.2 182 | 0,10,57.4 183 | 0,3.4,57.3 184 | 0,0,57.3 185 | 0,3.6,56.7 186 | 0,0.9,56.5 187 | 0,0,56.8 188 | 0,0,56.6 189 | 1,6.4,56.5 190 | 0,0.95,56.3 191 | 0,1.06,56.3 192 | 0,13.3,56.2 193 | 0,1.1,56.6 194 | 0,17.2,55.9 195 | 0,1.65,56 196 | 1,5,55.8 197 | 0,2.1,55.7 198 | 0,0.6,57 199 | 1,8.25,56.7 200 | 0,0.9,56.4 201 | 0,0,56.5 202 | 1,12.3,55.2 203 | 0,1.15,56.9 204 | 0,2.2,56.7 205 | 0,3.6,56 206 | 1,10,55.5 207 | 0,0.6,55.3 208 | 0,9.5,56.5 209 | 0,0.7,56.3 210 | 1,9,56.1 211 | 0,0,55.9 212 | 0,0.5,55.5 213 | 0,0.9,55.4 214 | -------------------------------------------------------------------------------- /data/OscillationIndex.txt: -------------------------------------------------------------------------------- 1 | Index 2 | -8.3 3 | -4.1 4 | -4.6 5 | 1.8 6 | -11.8 7 | 4.2 8 | -2.7 9 | 2.8 10 | 6.2 11 | -3.9 12 | 3.3 13 | -0.3 14 | 2.3 15 | 4.1 16 | 6.8 17 | 6.9 18 | 6 19 | -0.7 20 | -0.7 21 | 5 22 | -6.3 23 | 8.6 24 | 2.7 25 | -21 26 | -5.9 27 | 4.7 28 | 12.5 29 | -3.8 30 | 6 31 | -5.7 32 | 9.8 33 | 2.1 34 | -5.6 35 | -2.5 36 | -0.4 37 | 2.1 38 | 6.4 39 | 7.9 40 | 3.6 41 | -5.3 42 | -2.7 43 | -0.2 44 | 0.7 45 | 19.6 46 | 4.7 47 | -1.8 48 | 3.9 49 | -8.2 50 | 2.9 51 | 0.3 52 | -13.5 53 | -0.7 54 | 8.8 55 | -6.2 56 | 4.5 57 | 1.3 58 | 0.3 59 | 2.4 60 | -5.2 61 | 3.3 62 | 1.1 63 | -2.2 64 | -2.1 65 | 5.4 66 | 6.9 67 | 2.7 68 | -4.1 69 | 2.8 70 | 12.8 71 | 14.9 72 | 17.2 73 | 12.4 74 | 7.6 75 | 13.6 76 | 1.7 77 | 12.5 78 | 16.5 79 | 7.2 80 | 9.4 81 | 7.9 82 | -0.4 83 | -1.8 84 | 7.5 85 | -0.3 86 | -8.8 87 | -14.8 88 | -7.8 89 | -9.9 90 | -0.8 91 | -5.2 92 | -10.4 93 | -8.9 94 | -13 95 | -17.2 96 | -14.3 97 | -17.4 98 | -18.8 99 | -18.6 100 | -6.5 101 | -30.7 102 | -10.3 103 | -17 104 | -10.4 105 | -10.4 106 | -5.6 107 | -13 108 | -19.1 109 | -18 110 | -7.7 111 | -20.5 112 | -9.1 113 | -9.9 114 | -13.7 115 | -4.7 116 | -6 117 | -5.2 118 | 5.5 119 | 6.5 120 | -1 121 | 3.9 122 | 8.7 123 | 9.2 124 | -4 125 | 12.5 126 | 8.8 127 | 10.1 128 | 2.6 129 | 11.6 130 | 3.3 131 | -7.4 132 | 2.7 133 | 7.5 134 | 5.8 135 | 9.8 136 | 3.6 137 | -9.9 138 | -8.9 139 | 3.2 140 | 4.1 141 | -5.2 142 | -0.4 143 | -3.9 144 | -8.2 145 | 3.3 146 | 2.9 147 | -8.5 148 | -6.5 149 | 2.9 150 | 4.5 151 | 5.7 152 | 10.8 153 | -6.7 154 | 0.3 155 | 6.5 156 | 3.3 157 | 11.2 158 | 8.7 159 | 2.9 160 | -3.4 161 | 5.4 162 | -3.1 163 | 3.7 164 | -2.7 165 | -8.9 166 | -10 167 | -8.8 168 | -9.5 169 | -4 170 | -15.3 171 | -12.3 172 | -1.5 173 | -6.8 174 | -5.5 175 | -5.2 176 | 9.4 177 | -4.5 178 | -12.2 179 | 1.7 180 | 8.7 181 | 6.9 182 | 11.7 183 | -1.6 184 | 8.7 185 | 3.9 186 | -3.6 187 | -3.7 188 | -4.6 189 | 2.1 190 | 4 191 | -4.6 192 | 0.8 193 | -4 194 | -7.1 195 | 6.6 196 | 4.2 197 | -6.8 198 | -7.9 199 | 1.2 200 | 4.1 201 | 0.6 202 | -4.8 203 | -10.9 204 | -1.6 205 | -4 206 | 2.3 207 | 6 208 | -5.9 209 | 6.4 210 | 4.5 211 | 17 212 | 14.6 213 | 13.8 214 | 7.7 215 | 22.6 216 | 19.6 217 | 11.8 218 | 7 219 | 18 220 | 11.8 221 | 21.7 222 | 12.7 223 | 5.7 224 | -5.5 225 | -7.4 226 | -11.5 227 | -1.8 228 | -12.5 229 | -5.2 230 | -11.2 231 | -12.3 232 | -8.5 233 | -8.3 234 | -8.9 235 | -8.1 236 | 0.2 237 | -6.7 238 | 7.7 239 | 5.8 240 | 4.5 241 | -2.2 242 | -1.8 243 | 3.5 244 | 0.4 245 | -12.9 246 | 1.6 247 | -7.1 248 | -6 249 | -0.8 250 | -25.5 251 | -2.5 252 | -1 253 | -16.1 254 | -13 255 | -0.3 256 | -2.7 257 | -5.8 258 | 5 259 | -5.2 260 | -2.2 261 | 5 262 | 4 263 | -2.5 264 | 3.3 265 | 9.4 266 | 2.3 267 | 2.2 268 | 2.3 269 | 11.5 270 | -5.5 271 | 14.6 272 | 1.2 273 | -5.2 274 | 11.4 275 | 12.8 276 | 16.6 277 | 13.6 278 | 14.6 279 | 16.7 280 | 15 281 | 7.9 282 | 10.8 283 | 12.1 284 | 7.4 285 | 8.7 286 | 16.5 287 | 10 288 | 11.1 289 | 10.6 290 | 1.1 291 | 19.9 292 | 2.3 293 | 8.5 294 | 4.5 295 | -3.2 296 | -2.7 297 | -0.1 298 | -11.5 299 | -1.8 300 | 1.4 301 | -8.2 302 | -9.4 303 | -0.3 304 | -11 305 | -4.3 306 | -17.5 307 | -7.1 308 | -2.2 309 | 1.3 310 | -9.3 311 | -0.4 312 | 3.3 313 | 7.5 314 | -3 315 | -0.3 316 | -4.6 317 | -7.3 318 | -8.9 319 | -15 320 | 7 321 | 4.3 322 | 4 323 | -5.3 324 | -4 325 | -4 326 | 0.5 327 | 4.7 328 | 11.2 329 | 6.9 330 | 0.2 331 | -1.7 332 | 4.5 333 | 7.2 334 | 4.7 335 | -2.5 336 | 4.5 337 | 6.3 338 | 7.6 339 | 0.3 340 | 6.8 341 | 5.9 342 | -3.1 343 | 5.7 344 | -20.5 345 | 7.9 346 | 1.8 347 | -2.5 348 | -0.4 349 | -0.3 350 | 1.1 351 | -4.7 352 | 6.8 353 | 12.5 354 | 16.5 355 | -5.2 356 | -3.1 357 | -0.8 358 | 12.1 359 | 5.1 360 | -0.4 361 | 4.5 362 | 5.2 363 | 10.4 364 | 4.2 365 | 0.3 366 | 8.4 367 | 2.7 368 | 5.5 369 | 7.2 370 | 2.5 371 | -10.2 372 | -2.2 373 | -2.8 374 | -5.9 375 | -14.8 376 | -9.1 377 | -12.9 378 | -4.1 379 | -2.2 380 | 5.5 381 | 1.3 382 | 6.9 383 | 5.8 384 | 5.1 385 | 14.2 386 | 14 387 | 14.2 388 | 2.3 389 | -4.3 390 | -4.6 391 | 1.2 392 | 2.1 393 | -10.4 394 | -0.4 395 | -10.9 396 | -21 397 | -10.1 398 | -13.5 399 | -11 400 | -16.7 401 | 0.3 402 | -12.7 403 | -4.7 404 | -12.8 405 | -6 406 | -7.8 407 | 0.3 408 | -0.4 409 | 4.5 410 | -1.8 411 | -2.2 412 | 0.4 413 | -4.8 414 | 14.1 415 | 12.6 416 | 6.5 417 | -3.8 418 | -2.6 419 | 4.5 420 | 0.8 421 | 5.7 422 | 5.8 423 | -0.3 424 | -4.6 425 | -6.8 426 | 3.6 427 | 9.1 428 | -3.6 429 | -3 430 | 14.3 431 | 10 432 | 6.3 433 | 0.3 434 | -2.4 435 | -1.6 436 | -3.4 437 | 0.3 438 | -14.2 439 | -7.6 440 | -0.7 441 | -8.2 442 | -5.6 443 | -1.1 444 | -6.4 445 | -4 446 | -10 447 | -11.6 448 | -0.2 449 | 2.3 450 | -10.8 451 | -12.1 452 | 0.7 453 | -4.5 454 | 2.5 455 | 8.6 456 | -5.2 457 | 3.9 458 | 12.8 459 | 11 460 | 18.8 461 | 16.1 462 | 2.1 463 | 15.5 464 | 16.1 465 | 19.6 466 | 9.2 467 | 1.7 468 | 1.4 469 | 14.2 470 | 15.8 471 | 18.6 472 | 6.8 473 | 0.8 474 | 3.1 475 | 7.2 476 | 1.2 477 | -5.2 478 | -24 479 | -10.9 480 | -17.3 481 | -8.2 482 | -14.1 483 | -11 484 | -3.4 485 | -13.4 486 | -3.6 487 | -15 488 | -0.3 489 | -2.3 490 | 3.3 491 | 10 492 | 5.7 493 | 11.8 494 | 13.4 495 | 10.4 496 | 31.5 497 | 15.6 498 | 20.3 499 | 16 500 | 17 501 | 9.4 502 | 10.6 503 | 1.7 504 | 11.1 505 | 6.3 506 | 12.2 507 | 9.2 508 | -1.5 509 | 0.3 510 | -6 511 | 4.7 512 | 9.4 513 | 12.3 514 | 6.2 515 | 12.8 516 | 19.6 517 | 19.7 518 | 22.2 519 | 18.6 520 | 13.1 521 | 17.6 522 | 11.2 523 | 12.6 524 | 10.8 525 | 0.6 526 | 2.5 527 | 0.3 528 | -11.9 529 | -11.3 530 | -12.4 531 | 3.5 532 | 9.3 533 | -20 534 | -4.1 535 | 8.6 536 | -9.4 537 | -8.2 538 | -9.3 539 | -15.8 540 | -13.7 541 | -11.3 542 | -8.8 543 | -12.9 544 | -14.2 545 | -11.4 546 | -3.6 547 | -26.9 548 | -6 549 | -7.4 550 | 15.8 551 | 4.5 552 | 5.1 553 | 2.1 554 | 1.1 555 | -5.3 556 | -2.1 557 | -2.2 558 | -4.6 559 | 6.2 560 | -3.6 561 | -5.2 562 | 4 563 | 4.5 564 | 13.6 565 | -4.6 566 | 1.7 567 | -2.2 568 | -4.6 569 | -8.3 570 | 2.6 571 | 0.3 572 | -8.4 573 | -11.8 574 | -2.6 575 | -3.9 576 | -1.6 577 | 1.5 578 | -4.7 579 | -0.9 580 | -3.4 581 | -2.2 582 | 2.1 583 | -4.2 584 | -15.6 585 | -5.2 586 | 8.4 587 | 12.1 588 | 8.1 589 | 5.1 590 | 6.4 591 | -5.3 592 | 2.3 593 | 3.4 594 | 8.8 595 | -0.2 596 | 0.7 597 | -2.3 598 | -7.1 599 | -17.2 600 | -17.9 601 | -22.2 602 | -20 603 | -20.5 604 | -30 605 | -22.6 606 | -31.4 607 | -35.7 608 | -25.7 609 | -15.5 610 | 5.5 611 | -3.2 612 | -7 613 | 0.9 614 | 9.9 615 | 4.7 616 | -0.8 617 | -1.2 618 | 0.7 619 | 5.2 620 | -6.5 621 | 1.3 622 | 0.3 623 | -8.1 624 | 0.8 625 | 2.1 626 | 2.3 627 | -4.7 628 | 3.6 629 | -2.7 630 | -4.6 631 | 6.2 632 | -2.7 633 | 12.3 634 | 3.3 635 | -8.8 636 | -2.2 637 | 8.2 638 | 0.5 639 | -5.3 640 | -1.5 641 | 0.8 642 | 7.4 643 | -12.1 644 | -0.3 645 | 0.6 646 | -5.6 647 | 8.6 648 | 2 649 | -7 650 | -4.7 651 | 6.6 652 | -13.5 653 | -15 654 | -7 655 | -14 656 | -15.6 657 | -22.1 658 | -19.6 659 | -17.9 660 | -17.3 661 | -13.1 662 | -10.6 663 | -5.3 664 | -1.5 665 | -5.8 666 | -1.7 667 | -6.2 668 | 1.2 669 | -3 670 | 9.9 671 | -3.9 672 | 10.5 673 | 14.2 674 | 18.7 675 | 15.5 676 | 22 677 | 9.5 678 | 12.7 679 | 8.6 680 | 5.5 681 | 16.7 682 | 14.3 683 | 5.8 684 | 8.7 685 | -5.8 686 | 5.8 687 | 7.9 688 | -2.1 689 | -5.8 690 | -1.9 691 | -18.4 692 | -8.2 693 | -0.7 694 | 13.6 695 | 0 696 | 5.2 697 | -4.4 698 | -7.3 699 | -1.2 700 | -5 701 | -3.2 702 | 4.2 703 | -0.2 704 | -10.1 705 | -11.5 706 | -17.9 707 | -5.5 708 | -1.5 709 | -6.8 710 | -16.2 711 | -13.5 712 | -6.9 713 | -18.3 714 | -26 715 | -10.3 716 | -22.2 717 | -16.5 718 | 1.3 719 | -11.9 720 | -6.5 721 | 1.7 722 | 1.1 723 | -10.4 724 | -7 725 | -7 726 | -10 727 | -8 728 | -10 729 | -19 730 | -10 731 | -------------------------------------------------------------------------------- /data/amess.csv: -------------------------------------------------------------------------------- 1 | folate,treatmnt 2 | 243,1 3 | 251,1 4 | 275,1 5 | 291,1 6 | 347,1 7 | 354,1 8 | 380,1 9 | 392,1 10 | 206,2 11 | 210,2 12 | 226,2 13 | 249,2 14 | 255,2 15 | 273,2 16 | 285,2 17 | 295,2 18 | 309,2 19 | 241,3 20 | 258,3 21 | 270,3 22 | 293,3 23 | 328,3 24 | -------------------------------------------------------------------------------- /data/clinicalTrials.txt: -------------------------------------------------------------------------------- 1 | Cell.Count Drug.concentration 2 | 01-01 4.30 2882. 3 | 01-02 5.64 4155. 4 | 01-03 5.15 5286. 5 | 02-04 5.50 3765. 6 | 02-05 6.20 2978. 7 | 02-07 3.00 1638. 8 | 02-08 2.90 1631. 9 | 01-09 4.13 2684. 10 | 01-10 4.90 4475. 11 | 02-11 8.40 3540. 12 | 01-12 3.48 1755. 13 | 01-13 3.38 1595. 14 | 02-01 6.10 3259. 15 | 01-02 1.86 808. 16 | 01-04 8.72 3571. 17 | 01-05 4.32 1703. 18 | 02-06 6.80 3984. 19 | 01-07 4.08 2168. 20 | -------------------------------------------------------------------------------- /data/crab.csv: -------------------------------------------------------------------------------- 1 | "Obs","C","S","W","Wt","Sa" 2 | 1,2,3,28.3,3.05,8 3 | 2,3,3,26,2.6,4 4 | 3,3,3,25.6,2.15,0 5 | 4,4,2,21,1.85,0 6 | 5,2,3,29,3,1 7 | 6,1,2,25,2.3,3 8 | 7,4,3,26.2,1.3,0 9 | 8,2,3,24.9,2.1,0 10 | 9,2,1,25.7,2,8 11 | 10,2,3,27.5,3.15,6 12 | 11,1,1,26.1,2.8,5 13 | 12,3,3,28.9,2.8,4 14 | 13,2,1,30.3,3.6,3 15 | 14,2,3,22.9,1.6,4 16 | 15,3,3,26.2,2.3,3 17 | 16,3,3,24.5,2.05,5 18 | 17,2,3,30,3.05,8 19 | 18,2,3,26.2,2.4,3 20 | 19,2,3,25.4,2.25,6 21 | 20,2,3,25.4,2.25,4 22 | 21,4,3,27.5,2.9,0 23 | 22,4,3,27,2.25,3 24 | 23,2,2,24,1.7,0 25 | 24,2,1,28.7,3.2,0 26 | 25,3,3,26.5,1.97,1 27 | 26,2,3,24.5,1.6,1 28 | 27,3,3,27.3,2.9,1 29 | 28,2,3,26.5,2.3,4 30 | 29,2,3,25,2.1,2 31 | 30,3,3,22,1.4,0 32 | 31,1,1,30.2,3.28,2 33 | 32,2,2,25.4,2.3,0 34 | 33,2,1,24.9,2.3,6 35 | 34,4,3,25.8,2.25,10 36 | 35,3,3,27.2,2.4,5 37 | 36,2,3,30.5,3.32,3 38 | 37,4,3,25,2.1,8 39 | 38,2,3,30,3,9 40 | 39,2,1,22.9,1.6,0 41 | 40,2,3,23.9,1.85,2 42 | 41,2,3,26,2.28,3 43 | 42,2,3,25.8,2.2,0 44 | 43,3,3,29,3.28,4 45 | 44,1,1,26.5,2.35,0 46 | 45,3,3,22.5,1.55,0 47 | 46,2,3,23.8,2.1,0 48 | 47,3,3,24.3,2.15,0 49 | 48,2,1,26,2.3,14 50 | 49,4,3,24.7,2.2,0 51 | 50,2,1,22.5,1.6,1 52 | 51,2,3,28.7,3.15,3 53 | 52,1,1,29.3,3.2,4 54 | 53,2,1,26.7,2.7,5 55 | 54,4,3,23.4,1.9,0 56 | 55,1,1,27.7,2.5,6 57 | 56,2,3,28.2,2.6,6 58 | 57,4,3,24.7,2.1,5 59 | 58,2,1,25.7,2,5 60 | 59,2,1,27.8,2.75,0 61 | 60,3,1,27,2.45,3 62 | 61,2,3,29,3.2,10 63 | 62,3,3,25.6,2.8,7 64 | 63,3,3,24.2,1.9,0 65 | 64,3,3,25.7,1.2,0 66 | 65,3,3,23.1,1.65,0 67 | 66,2,3,28.5,3.05,0 68 | 67,2,1,29.7,3.85,5 69 | 68,3,3,23.1,1.55,0 70 | 69,3,3,24.5,2.2,1 71 | 70,2,3,27.5,2.55,1 72 | 71,2,3,26.3,2.4,1 73 | 72,2,3,27.8,3.25,3 74 | 73,2,3,31.9,3.33,2 75 | 74,2,3,25,2.4,5 76 | 75,3,3,26.2,2.22,0 77 | 76,3,3,28.4,3.2,3 78 | 77,1,2,24.5,1.95,6 79 | 78,2,3,27.9,3.05,7 80 | 79,2,2,25,2.25,6 81 | 80,3,3,29,2.92,3 82 | 81,2,1,31.7,3.73,4 83 | 82,2,3,27.6,2.85,4 84 | 83,4,3,24.5,1.9,0 85 | 84,3,3,23.8,1.8,0 86 | 85,2,3,28.2,3.05,8 87 | 86,3,3,24.1,1.8,0 88 | 87,1,1,28,2.62,0 89 | 88,1,1,26,2.3,9 90 | 89,3,2,24.7,1.9,0 91 | 90,2,3,25.8,2.65,0 92 | 91,1,1,27.1,2.95,8 93 | 92,2,3,27.4,2.7,5 94 | 93,3,3,26.7,2.6,2 95 | 94,2,1,26.8,2.7,5 96 | 95,1,3,25.8,2.6,0 97 | 96,4,3,23.7,1.85,0 98 | 97,2,3,27.9,2.8,6 99 | 98,2,1,30,3.3,5 100 | 99,2,3,25,2.1,4 101 | 100,2,3,27.7,2.9,5 102 | 101,2,3,28.3,3,15 103 | 102,4,3,25.5,2.25,0 104 | 103,2,3,26,2.15,5 105 | 104,2,3,26.2,2.4,0 106 | 105,3,3,23,1.65,1 107 | 106,2,2,22.9,1.6,0 108 | 107,2,3,25.1,2.1,5 109 | 108,3,1,25.9,2.55,4 110 | 109,4,1,25.5,2.75,0 111 | 110,2,1,26.8,2.55,0 112 | 111,2,1,29,2.8,1 113 | 112,3,3,28.5,3,1 114 | 113,2,2,24.7,2.55,4 115 | 114,2,3,29,3.1,1 116 | 115,2,3,27,2.5,6 117 | 116,4,3,23.7,1.8,0 118 | 117,3,3,27,2.5,6 119 | 118,2,3,24.2,1.65,2 120 | 119,4,3,22.5,1.47,4 121 | 120,2,3,25.1,1.8,0 122 | 121,2,3,24.9,2.2,0 123 | 122,2,3,27.5,2.63,6 124 | 123,2,1,24.3,2,0 125 | 124,2,3,29.5,3.02,4 126 | 125,2,3,26.2,2.3,0 127 | 126,2,3,24.7,1.95,4 128 | 127,3,2,29.8,3.5,4 129 | 128,4,3,25.7,2.15,0 130 | 129,3,3,26.2,2.17,2 131 | 130,4,3,27,2.63,0 132 | 131,3,3,24.8,2.1,0 133 | 132,2,1,23.7,1.95,0 134 | 133,2,3,28.2,3.05,11 135 | 134,2,3,25.2,2,1 136 | 135,2,2,23.2,1.95,4 137 | 136,4,3,25.8,2,3 138 | 137,4,3,27.5,2.6,0 139 | 138,2,2,25.7,2,0 140 | 139,2,3,26.8,2.65,0 141 | 140,3,3,27.5,3.1,3 142 | 141,3,1,28.5,3.25,9 143 | 142,2,3,28.5,3,3 144 | 143,1,1,27.4,2.7,6 145 | 144,2,3,27.2,2.7,3 146 | 145,3,3,27.1,2.55,0 147 | 146,2,3,28,2.8,1 148 | 147,2,1,26.5,1.3,0 149 | 148,3,3,23,1.8,0 150 | 149,3,2,26,2.2,3 151 | 150,3,2,24.5,2.25,0 152 | 151,2,3,25.8,2.3,0 153 | 152,4,3,23.5,1.9,0 154 | 153,4,3,26.7,2.45,0 155 | 154,3,3,25.5,2.25,0 156 | 155,2,3,28.2,2.87,1 157 | 156,2,1,25.2,2,1 158 | 157,2,3,25.3,1.9,2 159 | 158,3,3,25.7,2.1,0 160 | 159,4,3,29.3,3.23,12 161 | 160,3,3,23.8,1.8,6 162 | 161,2,3,27.4,2.9,3 163 | 162,2,3,26.2,2.02,2 164 | 163,2,1,28,2.9,4 165 | 164,2,1,28.4,3.1,5 166 | 165,2,1,33.5,5.2,7 167 | 166,2,3,25.8,2.4,0 168 | 167,3,3,24,1.9,10 169 | 168,2,1,23.1,2,0 170 | 169,2,3,28.3,3.2,0 171 | 170,2,3,26.5,2.35,4 172 | 171,2,3,26.5,2.75,7 173 | 172,3,3,26.1,2.75,3 174 | 173,2,2,24.5,2,0 175 | -------------------------------------------------------------------------------- /data/diet.csv: -------------------------------------------------------------------------------- 1 | "id","gender","age","height","diet.type","initial.weight","final.weight" 2 | 1,"Female",22,159,"A",58,54.2 3 | 2,"Female",46,192,"A",60,54 4 | 3,"Female",55,170,"A",64,63.3 5 | 4,"Female",33,171,"A",64,61.1 6 | 5,"Female",50,170,"A",65,62.2 7 | 6,"Female",50,201,"A",66,64 8 | 7,"Female",37,174,"A",67,65 9 | 8,"Female",28,176,"A",69,60.5 10 | 9,"Female",28,165,"A",70,68.1 11 | 10,"Female",45,165,"A",70,66.9 12 | 11,"Female",60,173,"A",72,70.5 13 | 12,"Female",48,156,"A",72,69 14 | 13,"Female",41,163,"A",72,68.4 15 | 14,"Female",37,167,"A",82,81.1 16 | 27,"Female",44,174,"B",58,60.1 17 | 28,"Female",37,172,"B",58,56 18 | 29,"Female",41,165,"B",59,57.3 19 | 30,"Female",43,171,"B",61,56.7 20 | 31,"Female",20,169,"B",62,55 21 | 32,"Female",51,174,"B",63,62.4 22 | 33,"Female",31,163,"B",63,60.3 23 | 34,"Female",54,173,"B",63,59.4 24 | 35,"Female",50,166,"B",65,62 25 | 36,"Female",48,163,"B",66,64 26 | 37,"Female",16,165,"B",68,63.8 27 | 38,"Female",37,167,"B",68,63.3 28 | 39,"Female",30,161,"B",76,72.7 29 | 40,"Female",29,169,"B",77,77.5 30 | 52,"Female",51,165,"C",60,53 31 | 53,"Female",35,169,"C",62,56.4 32 | 54,"Female",21,159,"C",64,60.6 33 | 55,"Female",22,169,"C",65,58.2 34 | 56,"Female",36,160,"C",66,58.2 35 | 57,"Female",20,169,"C",67,61.6 36 | 58,"Female",35,163,"C",67,60.2 37 | 59,"Female",45,155,"C",69,61.8 38 | 60,"Female",58,141,"C",70,63 39 | 61,"Female",37,170,"C",70,62.7 40 | 62,"Female",31,170,"C",72,71.1 41 | 63,"Female",35,171,"C",72,64.4 42 | 64,"Female",56,171,"C",73,68.9 43 | 65,"Female",48,153,"C",75,68.7 44 | 66,"Female",41,157,"C",76,71 45 | 15,"Male",39,168,"A",71,71.6 46 | 16,"Male",31,158,"A",72,70.9 47 | 17,"Male",40,173,"A",74,69.5 48 | 18,"Male",50,160,"A",78,73.9 49 | 19,"Male",43,162,"A",80,71 50 | 20,"Male",25,165,"A",80,77.6 51 | 21,"Male",52,177,"A",83,79.1 52 | 22,"Male",42,166,"A",85,81.5 53 | 23,"Male",39,166,"A",87,81.9 54 | 24,"Male",40,190,"A",88,84.5 55 | 41,"Male",51,191,"B",71,66.8 56 | 42,"Male",38,199,"B",75,72.6 57 | 43,"Male",54,196,"B",75,69.2 58 | 44,"Male",33,190,"B",76,72.5 59 | 45,"Male",45,160,"B",78,72.7 60 | 46,"Male",37,194,"B",78,76.3 61 | 47,"Male",44,163,"B",79,73.6 62 | 48,"Male",40,171,"B",79,72.9 63 | 49,"Male",37,198,"B",79,71.1 64 | 50,"Male",39,180,"B",80,81.4 65 | 51,"Male",31,182,"B",80,75.7 66 | 67,"Male",36,155,"C",71,68.5 67 | 68,"Male",47,179,"C",73,72.1 68 | 69,"Male",29,166,"C",76,72.5 69 | 70,"Male",37,173,"C",78,77.5 70 | 71,"Male",31,177,"C",78,75.2 71 | 72,"Male",26,179,"C",78,69.4 72 | 73,"Male",40,179,"C",79,74.5 73 | 74,"Male",35,183,"C",83,80.2 74 | 75,"Male",49,177,"C",84,79.9 75 | 76,"Male",28,164,"C",85,79.7 76 | 77,"Male",40,167,"C",87,77.8 77 | 78,"Male",51,175,"C",88,81.9 78 | -------------------------------------------------------------------------------- /data/genotypes.txt: -------------------------------------------------------------------------------- 1 | "AA" "AB" "BB" 2 | "1" 2.51304714898469 6.32886230947307 NA 3 | "2" 6.16876708252332 5.60757586162325 7.63948757433344 4 | "3" 3.18458867678788 8.26959775839993 6.79579919463547 5 | "4" 7.88995995654271 4.27138992776303 7.18864022126384 6 | "5" 5.14639512210706 6.28291661807179 7.48205753600522 7 | "6" NA 7.27477208906786 7.93472451951159 8 | "7" NA 7.18481566745089 9.20833919961227 9 | -------------------------------------------------------------------------------- /data/globalBreastCancerRisk.csv: -------------------------------------------------------------------------------- 1 | country,continent,year,lifeExp,pop,gdpPercap,NewCasesOfBreastCancerIn2002,AlcoholComsumption,BloodPressure,BodyMassIndex,Cholestorol,Smoking 2 | Afghanistan,Asia,2002,42.129,25268405,726.7340548,26.8,0.02,124.2085,20.65274,4.29517,NA 3 | Albania,Europe,2002,75.651,3508512,4604.211737,57.4,6.68,129.0609,25.27082,4.918646,4 4 | Algeria,Africa,2002,70.994,31287142,5288.040382,23.5,0.96,130.4024,25.69948,4.848951,0.3 5 | Angola,Africa,2002,41.003,10866106,2773.287312,23.1,5.4,129.9282,22.26093,4.499115,NA 6 | Argentina,Americas,2002,74.34,38331121,8797.640716,73.9,10,119.6538,26.7046,5.143871,25.4 7 | Australia,Oceania,2002,80.37,19546792,30687.75473,83.2,10.02,120.5113,26.25957,5.326858,21.8 8 | Austria,Europe,2002,78.98,8148312,32417.60769,70.5,13.24,125.8685,24.83051,5.381785,40.1 9 | Bahrain,Asia,2002,74.795,656397,23403.55927,40.2,3.66,132.4395,27.96036,5.204952,2.9 10 | Bangladesh,Asia,2002,62.013,135656790,1136.39043,16.6,0.17,124.5601,19.72414,4.400593,3.8 11 | Belgium,Europe,2002,78.32,10311970,30485.88375,92,10.77,124.8811,25.12181,5.454941,24.1 12 | Benin,Africa,2002,54.406,7026113,1372.877931,28.1,2.15,129.6162,23.14637,4.29748,NA 13 | Bolivia,Americas,2002,63.883,8445134,3413.26269,24.7,5.12,122.9154,26.1348,4.745852,29.2 14 | Bosnia and Herzegovina,Europe,2002,74.09,4165416,6018.975239,58.9,9.63,133.1459,25.96649,4.759707,35.1 15 | Botswana,Africa,2002,46.634,1630347,11003.60508,33.4,7.96,132.4681,25.52204,4.72277,NA 16 | Brazil,Americas,2002,71.006,179914212,8131.212843,46,9.16,125.8598,25.36366,4.899372,NA 17 | Bulgaria,Europe,2002,72.14,7661799,7696.777725,46.2,12.44,129.3342,25.24871,5.095784,27.8 18 | Burkina Faso,Africa,2002,50.65,12251209,1037.645221,30.6,6.98,129.2223,21.12089,4.127278,11.2 19 | Burundi,Africa,2002,47.36,7021078,446.4035126,19.5,9.47,131.8824,20.81771,4.170466,NA 20 | Cambodia,Asia,2002,56.752,12926707,896.2260153,21.5,4.77,117.6568,21.08197,4.443956,6.5 21 | Cameroon,Africa,2002,49.856,15929988,1934.011449,29.7,7.57,125.9758,23.99221,4.359665,2.2 22 | Canada,Americas,2002,79.77,31902268,33328.96507,84.3,9.77,120.5054,26.29888,5.263165,18.9 23 | Chad,Africa,2002,50.525,8835739,1156.18186,16.5,4.38,126.8861,21.18393,4.14814,2.6 24 | Chile,Americas,2002,77.86,15497046,10778.78385,43.9,8.55,126.4672,27.13405,5.088821,33.6 25 | China,Asia,2002,72.028,1280400000,3119.280896,18.7,5.91,123.3793,22.67688,4.451926,3.7 26 | Colombia,Americas,2002,71.682,41008227,5755.259962,30.3,6.17,123.3725,25.88257,5.016418,NA 27 | Comoros,Africa,2002,62.974,614382,1075.811558,19.5,0.36,130.3333,22.08826,4.379765,13.5 28 | Costa Rica,Americas,2002,78.123,3834934,7723.447195,30.9,5.55,121.6703,26.29537,4.884666,7.3 29 | Croatia,Europe,2002,74.876,4481020,11628.38895,62.1,15.11,131.4386,24.78217,5.093568,29.1 30 | Cuba,Americas,2002,77.158,11226999,6340.646683,31.2,5.51,125.4713,25.73888,4.708419,28.3 31 | Denmark,Europe,2002,77.18,5374693,32166.50006,88.7,13.37,122.0758,24.74319,5.479227,30.6 32 | Djibouti,Africa,2002,53.373,447416,1908.260867,19.5,1.87,128.7091,23.74615,4.707457,NA 33 | Ecuador,Americas,2002,74.173,12921234,5773.044512,23.5,9.38,123.7925,26.55303,4.837632,5.8 34 | Egypt,Africa,2002,69.806,73312559,4754.604414,24.2,0.37,125.4958,29.20828,4.839036,1.3 35 | El Salvador,Americas,2002,70.734,6353681,5351.568666,13.6,3.61,119.8362,26.8835,4.708999,NA 36 | Equatorial Guinea,Africa,2002,49.348,495627,7703.4959,16.5,6.08,132.0235,23.19161,4.695409,NA 37 | Eritrea,Africa,2002,55.24,4414865,765.3500015,19.5,1.54,123.9032,20.60579,4.252667,1.2 38 | Ethiopia,Africa,2002,50.725,67946797,530.0535319,24.7,4.02,123.9984,20.1839,4.203238,0.9 39 | Finland,Europe,2002,78.37,5193039,28204.59057,84.7,12.52,128.6964,25.41902,5.436867,24.4 40 | France,Europe,2002,79.59,59925035,28926.03234,91.9,13.66,122.8281,24.72699,5.43699,26.7 41 | Gabon,Africa,2002,56.761,1299304,12521.71392,18.2,9.32,132.3095,24.91982,4.942365,NA 42 | Gambia,Africa,2002,58.041,1457766,660.5855997,6.4,3.39,129.5809,23.63134,4.303299,2.9 43 | Germany,Europe,2002,78.67,82350671,30035.80198,79.8,12.81,128.0019,25.57631,5.593814,25.8 44 | Ghana,Africa,2002,58.453,20550751,1111.984578,28.1,2.97,128.9192,23.47061,4.238891,0.8 45 | Greece,Europe,2002,78.256,10603863,22514.2548,51.6,10.75,124.3581,24.73157,4.970351,39.8 46 | Guatemala,Americas,2002,68.978,11178650,4858.347495,25.9,4.03,120.3851,25.94954,4.56032,4.1 47 | Guinea,Africa,2002,53.676,8807818,945.5835837,15.3,0.76,131.2604,21.94696,4.222444,NA 48 | Guinea-Bissau,Africa,2002,45.504,1332459,575.7047176,28.1,3.68,129.6245,22.44935,4.170127,NA 49 | Haiti,Americas,2002,58.137,7607651,1270.364932,4.4,6.61,124.3388,22.67272,4.373766,NA 50 | Honduras,Americas,2002,68.565,6677328,3099.72866,25.9,4.48,122.2106,25.97615,4.593344,3.4 51 | Hungary,Europe,2002,72.59,10083313,14843.93556,63,16.27,129.7824,25.59566,5.189877,33.9 52 | Iceland,Europe,2002,80.5,288030,31163.20196,90,6.31,120.6813,25.50528,5.658249,26.6 53 | India,Asia,2002,62.879,1034172547,1746.769454,19.1,2.59,124.2951,21.03814,4.547826,3.8 54 | Indonesia,Asia,2002,68.588,211060000,2873.91287,26.1,0.59,126.4575,22.40431,4.622888,4.5 55 | Iran,Asia,2002,69.451,66907826,9240.761975,17.1,1.02,125.5227,26.13108,5.216626,5.5 56 | Iraq,Asia,2002,57.046,24001816,4390.717312,31.7,0.4,126.3037,27.93361,4.916117,2.5 57 | Ireland,Europe,2002,77.783,3879155,34077.04939,74.9,14.41,126.107,26.19135,5.426795,26 58 | Israel,Asia,2002,79.696,6029529,21905.59514,90.8,2.89,123.6205,26.72752,5.303616,17.9 59 | Italy,Europe,2002,80.24,57926999,27968.09817,74.4,10.68,125.5278,24.76686,5.27189,19.2 60 | Jamaica,Americas,2002,72.047,2664659,6994.774861,43.5,5,124.9047,26.64816,4.742052,9.2 61 | Japan,Asia,2002,82,127065841,28604.5919,32.7,8.03,123.9938,21.92321,5.197184,14.3 62 | Jordan,Asia,2002,71.263,5307470,3844.917194,33,0.71,123.1612,29.10914,5.21362,9.8 63 | Kenya,Africa,2002,50.992,31386842,1287.514732,25.2,4.14,127.8673,22.62218,4.405716,2.2 64 | Kuwait,Asia,2002,76.904,2111561,35110.10566,31.8,0.1,125.9623,30.34041,5.345846,NA 65 | Lebanon,Asia,2002,71.028,3677780,9313.93883,52.5,2.23,129.0933,27.12636,4.99548,7 66 | Lesotho,Africa,2002,44.593,2046772,1275.184575,13.1,5.55,129.5081,26.32388,4.260839,NA 67 | Liberia,Africa,2002,43.753,2814651,531.4823679,18.8,5.06,130.2577,22.69692,4.176063,NA 68 | Libya,Africa,2002,72.737,5368585,9534.677467,23.4,0.11,132.6075,28.51032,4.838883,NA 69 | Madagascar,Africa,2002,57.286,16473477,894.6370822,19.5,1.33,130.8052,20.65751,4.356253,NA 70 | Malawi,Africa,2002,45.009,11824495,665.4231186,10.5,1.74,130.9474,22.25219,4.331791,6.2 71 | Malaysia,Asia,2002,73.044,22662365,10206.97794,30.8,0.82,124.6766,24.86649,5.06707,2.8 72 | Mali,Africa,2002,51.818,10580176,951.4097518,18.2,1.04,127.2195,21.94506,4.195883,2.8 73 | Mauritania,Africa,2002,62.247,2828858,1579.019543,28.1,0.11,129.162,25.43653,4.372733,3.7 74 | Mauritius,Africa,2002,71.954,1200206,9021.815894,31.6,3.72,130.6751,25.57969,4.934729,1.1 75 | Mexico,Americas,2002,74.902,102479927,10742.44053,26.4,8.42,122.6993,27.97481,5.003826,12.4 76 | Mongolia,Asia,2002,65.033,2674234,2140.739323,6.6,3.24,128.8302,25.20751,4.831112,6.5 77 | Morocco,Africa,2002,69.615,31167783,3258.495584,22.5,1.46,126.6631,25.65263,4.749074,0.3 78 | Mozambique,Africa,2002,44.026,18473780,633.6179466,3.9,2.38,133.2081,22.44782,4.385706,3.4 79 | Namibia,Africa,2002,51.479,1972153,4072.324751,24.7,9.62,132.2566,24.39392,4.592049,10.9 80 | Nepal,Asia,2002,61.34,25873917,1057.206311,21.8,2.41,124.8058,20.16195,4.342687,26.4 81 | Netherlands,Europe,2002,78.53,16122830,33724.75778,86.7,10.05,123.8227,25.11372,5.389425,30.3 82 | New Zealand,Oceania,2002,79.11,3908037,23189.80135,91.9,9.62,121.1642,26.72162,5.380087,27.5 83 | Nicaragua,Americas,2002,70.836,5146848,2474.548819,23.9,5.37,122.9739,26.78785,4.589401,NA 84 | Niger,Africa,2002,54.496,11140655,601.0745012,23.3,0.34,132.2249,21.47314,4.089284,NA 85 | Nigeria,Africa,2002,46.608,119901274,1615.286395,31.2,12.28,134.2376,23.24367,4.110205,1.2 86 | Norway,Europe,2002,79.05,4535591,44683.97525,74.8,7.81,127.886,25.32467,5.410416,30.4 87 | Oman,Asia,2002,74.193,2713462,19774.83687,13.2,0.94,128.2915,26.41259,5.110034,1.3 88 | Pakistan,Asia,2002,63.61,153403524,2092.712441,50.1,0.06,126.2059,23.07639,4.591925,6.6 89 | Panama,Americas,2002,74.712,2990875,7356.031934,29,6.85,123.0864,26.76206,4.912917,NA 90 | Paraguay,Americas,2002,70.755,5884491,3783.674243,34.4,7.88,123.7607,25.36778,4.846562,14.8 91 | Peru,Americas,2002,69.906,26769436,5909.020073,35.1,6.9,120.7746,25.84887,4.82459,NA 92 | Philippines,Asia,2002,70.303,82995088,2650.921068,46.6,6.38,122.4797,23.00286,4.85752,9.8 93 | Poland,Europe,2002,74.67,38625976,12002.23908,50.3,13.25,130.0524,25.7373,5.206347,27.2 94 | Portugal,Europe,2002,77.29,10433867,19970.90787,55.5,14.55,129.172,25.9666,5.236896,31 95 | Romania,Europe,2002,71.322,22404337,7885.360081,44.3,15.3,129.3499,24.94229,4.968129,24.5 96 | Rwanda,Africa,2002,43.413,7852401,785.6537648,8.8,9.8,132.8907,21.56031,4.285954,NA 97 | Saudi Arabia,Asia,2002,71.626,24501530,19014.54118,24.7,0.25,127.3608,28.75728,4.991571,3.6 98 | Senegal,Africa,2002,61.6,10870037,1519.635262,18.4,0.6,129.346,23.53851,4.330656,1.5 99 | Sierra Leone,Africa,2002,41.012,5359092,699.489713,28.1,9.72,133.424,22.99407,4.110871,NA 100 | Singapore,Asia,2002,78.77,4197776,36023.1054,48.7,1.55,123.2673,23.23321,4.835357,NA 101 | Slovenia,Europe,2002,76.66,2011497,20660.01936,58.9,15.19,130.9412,26.39933,5.274947,21.1 102 | Somalia,Africa,2002,45.936,7753310,882.0818218,19.5,0.5,129.8451,22.18624,4.362204,NA 103 | South Africa,Africa,2002,53.365,44433622,7710.946444,35,9.46,130.0713,28.50839,4.563033,9.1 104 | Spain,Europe,2002,79.78,40152517,24835.47166,50.9,11.62,123.514,26.07576,5.19105,30.9 105 | Sri Lanka,Asia,2002,70.815,19576783,3015.378833,23.6,0.79,124.3827,22.71949,4.627639,2.6 106 | Sudan,Africa,2002,56.369,37090298,1993.398314,22.5,2.56,128.1803,22.50693,4.520633,NA 107 | Swaziland,Africa,2002,43.869,1130269,4128.116943,12.3,5.7,130.7529,27.73696,4.571466,3.2 108 | Sweden,Europe,2002,80.04,8954175,29341.63093,87.8,10.1,125.0731,24.99897,5.18503,24.5 109 | Switzerland,Europe,2002,80.62,7361757,34480.95771,81.7,11.06,121.6552,24.04547,5.349241,22.2 110 | Syria,Asia,2002,73.053,17155814,4090.925331,44.8,1.43,127.6258,28.23422,4.885778,NA 111 | Tanzania,Africa,2002,49.651,34593779,899.0742111,21.1,6.75,128.5271,22.5268,4.250212,4.3 112 | Thailand,Asia,2002,68.564,62806748,5913.187529,16.6,7.08,120.5435,23.9693,5.039856,3.4 113 | Togo,Africa,2002,57.561,4977378,886.2205765,28.1,1.99,129.8265,22.07173,4.169999,NA 114 | Trinidad and Tobago,Americas,2002,68.976,1101832,11460.60023,51.1,6.28,124.3713,27.50818,4.756425,7.6 115 | Tunisia,Africa,2002,73.042,9770575,5722.895655,19.6,1.29,129.5677,27.26516,4.830195,1.9 116 | Turkey,Europe,2002,70.845,67308928,6508.085718,22,2.87,126.0769,27.89841,4.855077,19.2 117 | Uganda,Africa,2002,47.813,24739869,927.7210018,18.3,11.93,132.7281,21.9589,4.288926,3.2 118 | United Kingdom,Europe,2002,78.471,59912431,29478.99919,87.2,13.37,127.7889,26.431,5.45187,34.7 119 | United States,Americas,2002,77.31,287675526,39097.09955,101.1,9.44,119.7817,27.75614,5.286608,21.5 120 | Uruguay,Americas,2002,75.307,3363085,7727.002004,83.1,8.14,124.619,25.88269,5.001928,28 121 | Venezuela,Americas,2002,72.766,24287670,8605.047831,34.3,8.23,124.9664,27.32819,4.802348,27 122 | Vietnam,Asia,2002,73.017,80908147,1764.456677,16.2,3.77,121.4875,20.47612,4.565701,2.5 123 | Zambia,Africa,2002,39.193,10595811,1071.613938,13,3.85,129.7063,22.45268,4.455595,5 124 | Zimbabwe,Africa,2002,39.989,11926563,672.0386227,19,5.08,130.9397,24.65855,4.40654,4.4 125 | -------------------------------------------------------------------------------- /data/lactoferrin.csv: -------------------------------------------------------------------------------- 1 | conc,growth 2 | 1,13.3222079379203 3 | 2,10.8305328089083 4 | 3,11.1765715064331 5 | 4,8.98395336946397 6 | 5,8.1206954901671 7 | 6,8.04378381984629 8 | 7,5.24744940390419 9 | 8,7.2136631215058 10 | 9,2.27301595493662 11 | 10,2.13181537401631 12 | -------------------------------------------------------------------------------- /data/pollution.csv: -------------------------------------------------------------------------------- 1 | pollution,temp,industry,population,wind,rain,rainy.days 2 | 24,61.5,368,497,9.1,48.34,115 3 | 30,55.6,291,593,8.3,43.11,123 4 | 56,55.9,775,622,9.5,35.89,105 5 | 28,51,137,176,8.7,15.17,89 6 | 14,68.4,136,529,8.8,54.47,116 7 | 46,47.6,44,116,8.8,33.36,135 8 | 9,66.2,641,844,10.9,35.94,78 9 | 35,49.9,1064,1513,10.1,30.96,129 10 | 26,57.8,197,299,7.6,42.59,115 11 | 61,50.4,347,520,9.4,36.22,147 12 | 29,57.3,434,757,9.3,38.98,111 13 | 28,52.3,361,746,9.7,38.74,121 14 | 14,51.5,181,347,10.9,30.18,98 15 | 18,59.4,275,448,7.9,46,119 16 | 17,51.9,454,515,9,12.95,86 17 | 23,54,462,453,7.1,39.04,132 18 | 47,55,625,905,9.6,41.31,111 19 | 13,61,91,132,8.2,48.52,100 20 | 31,55.2,35,71,6.6,40.75,148 21 | 12,56.7,453,716,8.7,20.66,67 22 | 10,70.3,213,582,6,7.05,36 23 | 110,50.6,3344,3369,10.4,34.44,122 24 | 56,49.1,412,158,9,43.37,127 25 | 10,68.9,721,1233,10.8,48.19,103 26 | 69,54.6,1692,1950,9.6,39.93,115 27 | 8,56.6,125,277,12.7,30.58,82 28 | 36,54,80,80,9,40.25,114 29 | 16,45.7,569,717,11.8,29.07,123 30 | 29,51.1,379,531,9.4,38.79,164 31 | 29,43.5,669,744,10.6,25.94,137 32 | 65,49.7,1007,751,10.9,34.99,155 33 | 9,68.3,204,361,8.4,56.77,113 34 | 10,75.5,207,335,9,59.8,128 35 | 26,51.5,266,540,8.6,37.01,134 36 | 31,59.3,96,308,10.6,44.68,116 37 | 10,61.6,337,624,9.2,49.1,105 38 | 11,47.1,391,463,12.4,36.11,166 39 | 14,54.5,381,507,10,37,99 40 | 17,49,104,201,11.2,30.85,103 41 | 11,56.8,46,244,8.9,7.77,58 42 | 94,50,343,179,10.6,42.75,125 43 | -------------------------------------------------------------------------------- /data/protein-expression.csv: -------------------------------------------------------------------------------- 1 | A,B,C,D,E 2 | 0.4,0.26,0.24,1.04,0.74 3 | 1.5,0.47,0.25,2.78,0.99 4 | 0.98,0.42,1.01,0.82,1.26 5 | 0.33,0.64,0.77,1.65,1.5 6 | 0.75,0.32,0.47,0.49,0.3 7 | 1.48,0.65,0.47,0.97,0.34 8 | 1.18,0.43,0.46,1.39,0.77 9 | 0.33,0.67,0.65,3.24,1.94 10 | 1.42,0.43,0.41,1.12,2.62 11 | 2.09,0.7,0.81,2.82,1.42 12 | 1.37,0.79,1.2,1.27,0.73 13 | 1.23,0.89,1.08,1.6,2.09 14 | ,,0.34,1.98,1.52 15 | ,,1.98,9.32,1.67 16 | ,,1.39,2.31,3.4 17 | ,,1.12,4.19,2.16 18 | ,,3.14,1.73,2.31 19 | ,,2.78,5.16,1.32 20 | -------------------------------------------------------------------------------- /data/students.csv: -------------------------------------------------------------------------------- 1 | "day","cases" 2 | 1,6 3 | 2,8 4 | 3,12 5 | 3,9 6 | 4,3 7 | 4,3 8 | 4,11 9 | 6,5 10 | 7,7 11 | 8,3 12 | 8,8 13 | 8,4 14 | 8,6 15 | 12,8 16 | 14,3 17 | 15,6 18 | 17,3 19 | 17,2 20 | 17,2 21 | 18,6 22 | 19,3 23 | 19,7 24 | 20,7 25 | 23,2 26 | 23,2 27 | 23,8 28 | 24,3 29 | 24,6 30 | 25,5 31 | 26,7 32 | 27,6 33 | 28,4 34 | 29,4 35 | 34,3 36 | 36,3 37 | 36,5 38 | 42,3 39 | 42,3 40 | 43,3 41 | 43,5 42 | 44,3 43 | 44,5 44 | 44,6 45 | 44,3 46 | 45,3 47 | 46,3 48 | 48,3 49 | 48,2 50 | 49,3 51 | 49,1 52 | 53,3 53 | 53,3 54 | 53,5 55 | 54,4 56 | 55,4 57 | 56,3 58 | 56,5 59 | 58,4 60 | 60,3 61 | 63,5 62 | 65,3 63 | 67,4 64 | 67,2 65 | 68,3 66 | 71,3 67 | 71,1 68 | 72,3 69 | 72,2 70 | 72,5 71 | 73,4 72 | 74,3 73 | 74,0 74 | 74,3 75 | 75,3 76 | 75,4 77 | 80,0 78 | 81,3 79 | 81,3 80 | 81,4 81 | 81,0 82 | 88,2 83 | 88,2 84 | 90,1 85 | 93,1 86 | 93,2 87 | 94,0 88 | 95,2 89 | 95,1 90 | 95,1 91 | 96,0 92 | 96,0 93 | 97,1 94 | 98,1 95 | 100,2 96 | 101,2 97 | 102,1 98 | 103,1 99 | 104,1 100 | 105,1 101 | 106,0 102 | 107,0 103 | 108,0 104 | 109,1 105 | 110,1 106 | 111,0 107 | 112,0 108 | 113,0 109 | 114,0 110 | 115,0 111 | -------------------------------------------------------------------------------- /data/treatments.txt: -------------------------------------------------------------------------------- 1 | Control Treatment 1 Treatment 2 Treatment 3 2 | GS 54. 43. 78. 111. 3 | JM 23. 34. 65. 99. 4 | HM 45. 65. 99. 78. 5 | DR 54. 77. 79. 90. 6 | PS 45. 46. 87. 95. -------------------------------------------------------------------------------- /glm+.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | --- 5 | title: "GLM with R" 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 8 | output: 9 | html_document: 10 | theme: united 11 | highlight: tango 12 | code_folding: show 13 | toc: true 14 | toc_depth: 2 15 | toc_float: true 16 | fig_width: 8 17 | fig_height: 6 18 | --- 19 | 20 | 21 | 22 | 23 | 24 | 25 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 26 | # change working directory: should be the directory containg the Markdown files: 27 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20200310/Practicals/") 28 | 29 | # install gamlss package if needed 30 | # install.packages("gamlss") 31 | ``` 32 | 33 | # Section 1: Logistic regression 34 | 35 | We will analyse the data collected by Jones (Unpublished BSc dissertation, University of Southampton, 1975). The aim of the study was to define if the probability of having Bronchitis is influenced by smoking and/or pollution. 36 | 37 | The data are stored under data/Bronchitis.csv and contains information on 212 participants. 38 | 39 | 40 | ### Section 1.1: importation and descriptive analysis 41 | 42 | Lets starts by 43 | 44 | * importing the data set *Bronchitis* with the function `read.csv()` 45 | * displaying _bron_ (a dichotomous variable which equals 1 for participants having bronchitis and 0 otherwise) as a function of _cigs_, the number of cigarettes smoked daily. 46 | 47 | 48 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 49 | Bronchitis = read.csv("data/Bronchitis.csv",header=TRUE) 50 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4", 51 | ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes") 52 | abline(h=c(0,1),col="light blue") 53 | ``` 54 | 55 | # Section 1.2: Model fit 56 | 57 | Lets 58 | 59 | * fit a logistic model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`. 60 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`. 61 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 62 | fit.glm = glm(bron~cigs,data=Bronchitis,family=binomial) 63 | 64 | library(gamlss) 65 | fit.gamlss = gamlss(bron~cigs,data=Bronchitis,family=BI) 66 | 67 | summary(fit.glm) 68 | ``` 69 | 70 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot: 71 | 72 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 73 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4", 74 | ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes") 75 | abline(h=c(0,1),col="light blue") 76 | 77 | axe.x = seq(0,40,length=1000) 78 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])/(1+exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])) 79 | lines(axe.x,f.x,col="pink2",lwd=2) 80 | ``` 81 | 82 | ## Section 1.3: Model selection 83 | 84 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 85 | 86 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 87 | anova(fit.glm,test="LRT") 88 | ``` 89 | 90 | ## Section 1.3: Model check 91 | 92 | Lets assess is the model fit seems satisfactory by means 93 | 94 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`, 95 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`, 96 | 97 | 98 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 99 | # deviance 100 | par(mfrow=c(2,2),mar=c(3,5,3,0)) 101 | plot(fit.glm) 102 | # randomised normalised quantile residuals 103 | plot(gamlss(bron~cigs,data=Bronchitis,family=BI)) 104 | ``` 105 | 106 | ## Section 1.4: Fun 107 | 108 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 109 | # long format: 110 | long = data.frame(mi = rep(c("MI","No MI"),c(104+189,11037+11034)), 111 | treatment = rep(c("Aspirin","Placebo","Aspirin","Placebo"),c(104,189,11037,11034))) 112 | # short format: 2 by 2 table 113 | table2by2 = table(long$treatment,long$mi) 114 | 115 | # 116 | chisq.test(table2by2) 117 | prop.test(table2by2[,"MI"],apply(table2by2,1,sum)) 118 | summary(glm(I(mi=="MI")~treatment,data=long,family="binomial")) 119 | ``` 120 | 121 | 122 | # Section 2: Poisson regression 123 | 124 | The dataset *students.csv* shows the number of high school students diagnosed with an infectious disease for each day from the initial disease outbreak. 125 | 126 | # Section 2.2: Importation 127 | 128 | Lets 129 | 130 | * import the dataset by means of the function `read.csv()` 131 | * display the daily number of students diagnosed with the disease (variable `cases`) as a function of the days since the outbreak (variable `day`). 132 | 133 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 134 | students = read.csv("data/students.csv",header=TRUE) 135 | plot(students$day,students$cases,col="blue4", 136 | ylab = "Number of diagnosed students", xlab = "Days since initial outbreak") 137 | abline(h=c(0),col="light blue") 138 | ``` 139 | 140 | # Section 2.2: Model fit 141 | 142 | Lets 143 | 144 | * fit a poisson model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`. 145 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`. 146 | 147 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 148 | fit.glm = glm(cases~day,data=students,family=poisson) 149 | 150 | library(gamlss) 151 | fit.gamlss = gamlss(cases~day,data=students,,family=PO) 152 | 153 | summary(fit.glm) 154 | ``` 155 | 156 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot: 157 | 158 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 159 | plot(students$day,students$cases,col="blue4", 160 | ylab = "Number of diagnosed students", xlab = "Days since initial outbreak") 161 | abline(h=c(0),col="light blue") 162 | 163 | axe.x = seq(0,120,length=1000) 164 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2]) 165 | lines(axe.x,f.x,col="pink2",lwd=2) 166 | ``` 167 | 168 | ## Section 2.3: Model selection 169 | 170 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 171 | 172 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 173 | anova(fit.glm,test="LRT") 174 | ``` 175 | 176 | ## Section 2.3: Model check 177 | 178 | Lets assess is the model fit seems satisfactory by means 179 | 180 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`, 181 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`, 182 | 183 | 184 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 185 | # deviance 186 | par(mfrow=c(2,2),mar=c(3,5,3,0)) 187 | plot(fit.glm) 188 | # randomised normalised quantile residuals 189 | plot(fit.gamlss) 190 | ``` 191 | 192 | 193 | 194 | # Section 6: Practicals 195 | 196 | 197 | ### (i) *Bronchitis.csv* 198 | Analyse further the Bronchitis data of Jones (1975) by 199 | 200 | * first investigating if the probability of having bronchitis also depends on _pollution_ (variable `poll`), 201 | * second investigating if there is an interaction between the variables `cigs` and `poll`. 202 | 203 | 204 | Lets plot the data first. 205 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 206 | Bronchitis = read.csv("data/Bronchitis.csv",header=TRUE) 207 | # plot 208 | plot(Bronchitis$poll,Bronchitis$bron,col="blue4", 209 | ylab = "Absence/Presense of Bronchitis", xlab = "Pollution level") 210 | abline(h=c(0,1),col="light blue") 211 | ``` 212 | No obvious relationship between pollution and bronchitis is visible by means of this plot. 213 | 214 | Lets fit a model assuming that the probability of getting bronchitis is a function of the pollution level. 215 | 216 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 217 | # fit1: 218 | fit.glm = glm(bron~poll,data=Bronchitis,family=binomial) 219 | summary(fit.glm) 220 | ``` 221 | 222 | The intercept of the previous fit allows to define the probability of getting bronchitis when the level of pollution equals 0. 223 | As a zero level of pollution is (i) out of range (ii) not a realistic value, we will 224 | 225 | * create the variable `poll_centered` defined as the pollution level minus the mean so that the intercept corresponds to the probability of getting bronchitis for an average pollution level in Cardiff, 226 | * refit the model 227 | 228 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 229 | # fit2: 230 | Bronchitis$poll_centered = Bronchitis$poll-mean(Bronchitis$poll) 231 | fit.glm = glm(bron~poll_centered,data=Bronchitis,family=binomial) 232 | library(gamlss) 233 | fit.gamlss = gamlss(bron~poll_centered,data=Bronchitis,family=BI) 234 | summary(fit.glm) 235 | ``` 236 | 237 | Lets 238 | 239 | * perform a model check but plotting the randomised quantile residuals of gamlss a few times 240 | * plot the fitted probabilities 241 | 242 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 243 | # model check 244 | plot(gamlss(bron~poll_centered,data=Bronchitis,family=BI)) 245 | plot(gamlss(bron~poll_centered,data=Bronchitis,family=BI)) 246 | plot(gamlss(bron~poll_centered,data=Bronchitis,family=BI)) 247 | 248 | # plot fit 249 | plot(Bronchitis$poll_centered,Bronchitis$bron,col="blue4", 250 | ylab = "Absence/Presense of Bronchitis", xlab = "Pollution level") 251 | abline(h=c(0,1),col="light blue") 252 | axe.x = seq(-10,10,length=1000) 253 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])/(1+exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])) 254 | lines(axe.x,f.x,col="pink2",lwd=2) 255 | ``` 256 | Model check suggests a good fit. Lets finally check if the interaction is significant: 257 | 258 | 259 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 260 | # interaction ? 261 | fit.glm = glm(bron~poll_centered*cigs,data=Bronchitis,family=binomial) 262 | summary(fit.glm) 263 | anova(fit.glm,test="LRT") 264 | ``` 265 | Interaction is not significant. 266 | 267 | 268 | 269 | ### (ii) *myocardialinfarction.csv* 270 | 271 | The file *myocardialinfarction.csv* indicates if a participant had a myocardial infarction attack (variable `infarction`) as well the participant's treatment (variable `treatment`). 272 | 273 | Does _Aspirin_ decrease the probability to have a myocardial infarction attack ? 274 | 275 | Lets (i) import the dataset, (ii) change the levels of the factor `treatment` so that `Placebo` corresponds to the reference group, (iii) and finally plot the (sample) probabilities to get an attack by treatment group 276 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 277 | # import 278 | myocardialinfarction = read.csv("data/myocardialinfarction.csv") 279 | # by default, Aspirin is the reference group as the alphabetic order is used 280 | myocardialinfarction$treatment = factor(myocardialinfarction$treatment, 281 | levels=c("Placebo","Aspirin")) 282 | # plot 283 | par(mfrow=c(1,1),mar=c(3,4,3,1)) 284 | pi.group = tapply(myocardialinfarction$infarction=="attack",myocardialinfarction$treatment,mean) 285 | table.group = tapply(myocardialinfarction$infarction=="attack",myocardialinfarction$treatment,table) 286 | temp = barplot(pi.group,plot=FALSE) 287 | barplot(pi.group,ylab="Probability",xlab="", 288 | main = "Probability of myocardial infarction\n per treatment group",names=rep("",2), 289 | cex.axis=.6,axes=FALSE,cex.main=1.4) 290 | axis(2,las=2,cex.axis=.8) 291 | axis(1,temp[,1],names(pi.group),cex.axis=1.25,tick=FALSE) 292 | for(gw in 1:length(pi.group)){ 293 | text(temp[gw],pi.group[gw]/2,.p(table.group[[gw]]["TRUE"]," / ",sum(table.group[[gw]])), 294 | col="red",cex=1.5) 295 | } 296 | ``` 297 | The barplot seems to suggest that the treatment (aspirin) reduces the risk of myocardial infarction. Lets fit a logistic model to assess if this difference is significant. 298 | Note that, in this case (a dichotomous outcome and a dichotomous predictor), a test of equality of proportions or an independence test could also do the job. 299 | With a logistic model, other predictors could easily be added to the model and the beta parameter corresponding to the treatment can be interpreted by means of odd ratios (or relative risk ratios when prevalences are *small*, as we will note at the end of this practical). 300 | 301 | 302 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 303 | # model fit 304 | fit.glm = glm(I(infarction=="attack")~treatment,data=myocardialinfarction,family=binomial) 305 | summary(fit.glm) 306 | # test of equality of (independent) proportions 307 | prop.test(unlist(lapply(table.group,function(x)x[2])),unlist(lapply(table.group,sum))) 308 | # test of independence 309 | chisq.test(matrix(unlist(table.group),ncol=2)) 310 | ``` 311 | The three methods lead to the same conclusion: there is a significant difference between the probabilities of having a myocardial infarction of the two treatment groups. 312 | Note that the two last methods get exactly the same results (as they use the same test X-squared statsitic). 313 | 314 | Lets define the fitted probabilities to get an attack and compare them to the sample probabilities: they should match: 315 | 316 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 317 | # pi_placebo: 318 | pi_placebo = exp(-4.04971)/(1+exp(-4.04971)) 319 | pi_placebo 320 | pi.group[1] 321 | # pi_aspirin 322 | pi_aspirin = exp( -4.04971-0.60544)/(1+exp( -4.04971-0.60544)) 323 | pi_aspirin 324 | pi.group[2] 325 | ``` 326 | 327 | Finally, note that when prevalence are small, the exponential of the logistic regression corresponding to the treatment **may** also be interpreted as relative risk ratios. Indeed: 328 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 329 | # interpreation of exp(beta1) when prevalences are small 330 | pi_aspirin/pi_placebo 331 | exp(-0.60544) 332 | c(exp(-0.60544-qnorm(.975)*0.12284),exp(-0.60544+qnorm(.975)*0.12284)) 333 | ``` 334 | Thus, aspirin strongly reduces the risk of myocardial infarction. 335 | 336 | 337 | 338 | 339 | ### (ii) *crabs.csv* 340 | 341 | This data set is derived from Agresti (2007, Table 3.2, pp.76-77). It gives 6 variables for each of 173 female horseshoe crabs: 342 | 343 | * Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weightweight (Wt) 344 | * C: the crab's colour, 345 | * S: the crab's spine condition, 346 | * Wt: the crab's weight, 347 | * W: the crab's carapace width, 348 | * Sa: the response outcome, i.e., the number of satellites. 349 | 350 | Check if the width of female's back can explain the number of satellites attached by fitting a Poisson regression model with width. 351 | 352 | 353 | Lets import the datasset, fit a poisson loglinear model, plot the fit and perfom a model check : 354 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 355 | crabs = read.csv("data/crab.csv",header=TRUE) 356 | 357 | # plot: 358 | plot(crabs$W,crabs$Sa,col="blue4", 359 | ylab = "Number of satellites", xlab = "width of female's back") 360 | abline(h=c(0),col="light blue") 361 | 362 | 363 | # fit 364 | fit.glm = glm(Sa~W,data=crabs,family=poisson) 365 | library(gamlss) 366 | 367 | 368 | # plot fit: 369 | plot(crabs$W,crabs$Sa,col="blue4", 370 | ylab = "Number of satellites", xlab = "width of female's back") 371 | abline(h=c(0),col="light blue") 372 | 373 | axe.x = seq(15,40,length=1000) 374 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2]) 375 | lines(axe.x,f.x,col="pink2",lwd=2) 376 | # not a great fit... 377 | 378 | # model check 379 | plot(gamlss(Sa~W,data=crabs,family=PO)) 380 | plot(gamlss(Sa~W,data=crabs,family=PO)) 381 | plot(gamlss(Sa~W,data=crabs,family=PO)) 382 | 383 | # confirm lack of fit -> bin the estimates 384 | 385 | # 2 alternative models 386 | plot(gamlss(Sa~W,data=crabs,family=ZIP)) 387 | plot(gamlss(Sa~W,data=crabs,family=NBI)) 388 | # check ?ZIP and ?NBI for detail 389 | ``` 390 | Reasonably, there is a lack of fit> the estimates are not to be trusted. 391 | 392 | 393 | 394 | -------------------------------------------------------------------------------- /glm.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "GLM with R" 3 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri " 4 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 5 | output: 6 | html_document: 7 | theme: united 8 | highlight: tango 9 | code_folding: show 10 | toc: true 11 | toc_depth: 2 12 | toc_float: true 13 | fig_width: 8 14 | fig_height: 6 15 | --- 16 | 17 | 18 | 19 | 20 | 21 | 22 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 23 | # change working directory: should be the directory containg the Markdown files: 24 | # setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/20210514/Practicals/") 25 | 26 | # install gamlss package if needed 27 | # install.packages("gamlss") 28 | ``` 29 | 30 | # Section 1: Logistic regression 31 | 32 | We will analyse the data collected by Jones (Unpublished BSc dissertation, University of Southampton, 1975). The aim of the study was to define if the probability of having Bronchitis is influenced by smoking and/or pollution. 33 | 34 | The data are stored under data/Bronchitis.csv and contains information on 212 participants. 35 | 36 | 37 | ### Section 1.1: importation and descriptive analysis 38 | 39 | Lets starts by 40 | 41 | * importing the data set *Bronchitis* with the function `read.csv()` 42 | * displaying _bron_ (a dichotomous variable which equals 1 for participants having bronchitis and 0 otherwise) as a function of _cigs_, the number of cigarettes smoked daily. 43 | 44 | 45 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 46 | Bronchitis = read.csv("data/Bronchitis.csv",header=TRUE) 47 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4", 48 | ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes") 49 | abline(h=c(0,1),col="light blue") 50 | ``` 51 | 52 | # Section 1.2: Model fit 53 | 54 | Lets 55 | 56 | * fit a logistic model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`. 57 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`. 58 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 59 | fit.glm = glm(bron~cigs,data=Bronchitis,family=binomial) 60 | 61 | library(gamlss) 62 | fit.gamlss = gamlss(bron~cigs,data=Bronchitis,family=BI) 63 | 64 | summary(fit.glm) 65 | ``` 66 | 67 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot: 68 | 69 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 70 | plot(Bronchitis$cigs,Bronchitis$bron,col="blue4", 71 | ylab = "Absence/Presense of Bronchitis", xlab = "Daily number of cigarettes") 72 | abline(h=c(0,1),col="light blue") 73 | 74 | axe.x = seq(0,40,length=1000) 75 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])/(1+exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2])) 76 | lines(axe.x,f.x,col="pink2",lwd=2) 77 | ``` 78 | 79 | ## Section 1.3: Model selection 80 | 81 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 82 | 83 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 84 | anova(fit.glm,test="LRT") 85 | ``` 86 | 87 | ## Section 1.3: Model check 88 | 89 | Lets assess is the model fit seems satisfactory by means 90 | 91 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`, 92 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`, 93 | 94 | 95 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 96 | # deviance 97 | par(mfrow=c(2,2),mar=c(3,5,3,0)) 98 | plot(fit.glm) 99 | # randomised normalised quantile residuals 100 | plot(gamlss(bron~cigs,data=Bronchitis,family=BI)) 101 | ``` 102 | 103 | ## Section 1.4: Fun 104 | 105 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 106 | # long format: 107 | long = data.frame(mi = rep(c("MI","No MI"),c(104+189,11037+11034)), 108 | treatment = rep(c("Aspirin","Placebo","Aspirin","Placebo"),c(104,189,11037,11034))) 109 | # short format: 2 by 2 table 110 | table2by2 = table(long$treatment,long$mi) 111 | 112 | # 113 | chisq.test(table2by2) 114 | prop.test(table2by2[,"MI"],apply(table2by2,1,sum)) 115 | summary(glm(I(mi=="MI")~treatment,data=long,family="binomial")) 116 | ``` 117 | 118 | 119 | # Section 2: Poisson regression 120 | 121 | The dataset *students.csv* shows the number of high school students diagnosed with an infectious disease for each day from the initial disease outbreak. 122 | 123 | # Section 2.2: Importation 124 | 125 | Lets 126 | 127 | * import the dataset by means of the function `read.csv()` 128 | * display the daily number of students diagnosed with the disease (variable `cases`) as a function of the days since the outbreak (variable `day`). 129 | 130 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 131 | students = read.csv("data/students.csv",header=TRUE) 132 | plot(students$day,students$cases,col="blue4", 133 | ylab = "Number of diagnosed students", xlab = "Days since initial outbreak") 134 | abline(h=c(0),col="light blue") 135 | ``` 136 | 137 | # Section 2.2: Model fit 138 | 139 | Lets 140 | 141 | * fit a poisson model by means the function `glm()` and by means of the function `gamlss()` of the library `gamlss`. 142 | * display and analyse the results of the `glm` function : Use the function `summary()` to display the results of an R object of class `glm`. 143 | 144 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 145 | fit.glm = glm(cases~day,data=students,family=poisson) 146 | 147 | library(gamlss) 148 | fit.gamlss = gamlss(cases~day,data=students,,family=PO) 149 | 150 | summary(fit.glm) 151 | ``` 152 | 153 | Let's now define the estimated probability of having bronchitis for any number of daily smoked cigarette and display the corresponding logistic curve on a plot: 154 | 155 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 156 | plot(students$day,students$cases,col="blue4", 157 | ylab = "Number of diagnosed students", xlab = "Days since initial outbreak") 158 | abline(h=c(0),col="light blue") 159 | 160 | axe.x = seq(0,120,length=1000) 161 | f.x = exp(fit.glm$coef[1]+axe.x*fit.glm$coef[2]) 162 | lines(axe.x,f.x,col="pink2",lwd=2) 163 | ``` 164 | 165 | ## Section 2.3: Model selection 166 | 167 | As for linear models, model selection may be done by means of the function `anova()` used on the glm object of interest. 168 | 169 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 170 | anova(fit.glm,test="LRT") 171 | ``` 172 | 173 | ## Section 2.3: Model check 174 | 175 | Lets assess is the model fit seems satisfactory by means 176 | 177 | * of the analysis of deviance residuals (function `plot()` on an object of class `glm`, 178 | * of the analysis of randomised normalised quantile residuals (function `plot()` on an object of class `gamlss`, 179 | 180 | 181 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 182 | # deviance 183 | par(mfrow=c(2,2),mar=c(3,5,3,0)) 184 | plot(fit.glm) 185 | # randomised normalised quantile residuals 186 | plot(fit.gamlss) 187 | ``` 188 | 189 | 190 | 191 | # Section 6: Practicals 192 | 193 | 194 | ### (i) *Bronchitis.csv* 195 | Analyse further the Bronchitis data of Jones (1975) by 196 | 197 | * first investigating if the probability of having bronchitis also depends on _pollution_ (variable `poll`), 198 | * second investigating if there is an interaction between the variables `cigs` and `poll`. 199 | 200 | 201 | ### (ii) *myocardialinfarction.csv* 202 | 203 | The file *myocardialinfarction.csv* indicates if a participant had a myocardial infarction attack (variable `infarction`) as well the participant's treatment (variable `treatment`). 204 | 205 | Does _Aspirin_ decrease the probability to have a myocardial infarction attack ? 206 | 207 | 208 | ### (ii) *crabs.csv* 209 | 210 | This data set is derived from Agresti (2007, Table 3.2, pp.76-77). It gives 6 variables for each of 173 female horseshoe crabs: 211 | 212 | * Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weightweight (Wt) 213 | * C: the crab's colour, 214 | * S: the crab's spine condition, 215 | * Wt: the crab's weight, 216 | * W: the crab's carapace width, 217 | * Sa: the response outcome, i.e., the number of satellites. 218 | 219 | Check if the width of female's back can explain the number of satellites attached by fitting a Poisson regression model with width. 220 | 221 | 222 | -------------------------------------------------------------------------------- /glm.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/glm.pdf -------------------------------------------------------------------------------- /images/examplePlots.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/images/examplePlots.png -------------------------------------------------------------------------------- /images/plot-char.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/images/plot-char.png -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | ## Introduction to Linear Modelling with R 2 | 3 | ## Description 4 | 5 | The course will cover ANOVA, linear regression and some extensions. It will be a mixture of lectures and hands-on time using RStudio to analyse data. 6 | 7 | [Timetable](timetable.pdf) 8 | 9 | # Aims: During this course you will learn about: 10 | 11 | - ANOVA 12 | - Simple and multiple regression 13 | - Generalised Linear Models 14 | - Introduction to more advanced topics, like non-linear models and time series. 15 | 16 | # Objectives: After this course you should be able to 17 | 18 | - Realise the connection between t-tests, ANOVA and linear regression 19 | - Fit a linear regression 20 | - Check if the assumptions of linear regression are met by the data and what to do if they are not 21 | - Know when linear regression is not appropriate and have an idea of which alternative method might be appropriate 22 | - Know when you need to seek help with analysis as the data structure is too complex for the methods taught 23 | 24 | # Course Data 25 | 26 | - Please Download [this zip file](Course_Data.zip) to have all the datasets and R files used in this course 27 | 28 | # Feedback 29 | - After the course, please fill in this [feedback form](https://www.surveymonkey.co.uk/r/LINMODMARCH). Thank you. 30 | 31 | # Other courses 32 | - The CRUK-CI Bioinformatics Core facility run a catalogue of courses. [Please visit for more details](https://www.cruk.cam.ac.uk/core-facilities/bioinformatics-core/programme). 33 | 34 | # Materials 35 | 36 | - Course Introduction 37 | + Tutorial [HTML](r-recap.nb.html) 38 | + Tutorial [R markdown](r-recap.Rmd) 39 | + Cheat Sheet [PDF](cheat_sheet.pdf) 40 | - ANOVA 41 | + Slides [PDF](anova.pdf) 42 | + Tutorial [HTML](anova.html) 43 | + Tutorial [R markdown](anova.Rmd) 44 | - Simple Regression 45 | + Slides [PDF](simple_regression.pdf) 46 | + Tutorial [HTML](simple_regression.html) 47 | + Tutorial [R markdown](simple_regression.Rmd) 48 | - Multiple Regression 49 | + Slides [PDF](multiple_regression.pdf) 50 | + Tutorial [HTML](multiple_regression.html) 51 | + Tutorial [R markdown](multiple_regression.Rmd) 52 | - Generalised Linear Models 53 | + Slides [PDF](glm.pdf) 54 | + Tutorial [HTML](glm.html) 55 | + Tutorial [R markdown](glm.Rmd) 56 | - Time Series Models 57 | + Slides [PDF](time_series.pdf) 58 | + Tutorial [HTML](time_series_analysis.html) 59 | + Tutorial [R markdown](time_series_analysis.Rmd) 60 | 61 | 62 | 63 | # Pre-requisites 64 | 65 | **This course assumes basic knowledge of statistics and use of R** , which would be obtained from our Introductory Statistics Course and an "Introduction to R for Solving Biological Problems" run at the Genetics department (or equivalent). 66 | 67 | - [Introduction to Solving Biological Problems with R](http://cambiotraining.github.io/r-intro/) 68 | - [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/) 69 | 70 | ## Going further 71 | - [Transforming data](http://rcompanion.org/handbook/I_12.html) 72 | 73 | -------------------------------------------------------------------------------- /install.R: -------------------------------------------------------------------------------- 1 | options(repos = c("CRAN" = "http://cran.ma.imperial.ac.uk")) 2 | 3 | install.packages(c("MASS", "gamlss")) 4 | -------------------------------------------------------------------------------- /logos/CRUK_CI_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/logos/CRUK_CI_logo.png -------------------------------------------------------------------------------- /logos/LMB_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/logos/LMB_logo.png -------------------------------------------------------------------------------- /logos/LMB_logo_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/logos/LMB_logo_small.png -------------------------------------------------------------------------------- /logos/Logos.txt: -------------------------------------------------------------------------------- 1 | Folder for organisational logos 2 | -------------------------------------------------------------------------------- /multiple_regression+.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | --- 5 | title: "Multiple Regression with R" 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 8 | output: 9 | html_document: 10 | theme: united 11 | highlight: tango 12 | code_folding: show 13 | toc: true 14 | toc_depth: 2 15 | toc_float: true 16 | fig_width: 8 17 | fig_height: 6 18 | --- 19 | 20 | 21 | 22 | 23 | 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 25 | # change working directory: should be the directory containg the Markdown files: 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/") 27 | 28 | ``` 29 | 30 | # Section 1: Multiple Regression 31 | 32 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Now we will expand this by considering `Height` as another predictor. 33 | 34 | Start by plotting the dataset: 35 | ```{r} 36 | plot(trees) 37 | ``` 38 | 39 | This plots all variables against each other, enabling visual information about correlations within the dataset. 40 | 41 | Re-create the original model of `Volume` against `Girth`: 42 | ```{r} 43 | m1 = lm(Volume~Girth,data=trees) 44 | summary(m1) 45 | ``` 46 | 47 | Now include `Height` as an additional variable: 48 | ```{r} 49 | m2 = lm(Volume~Girth+Height,data=trees) 50 | summary(m2) 51 | ``` 52 | 53 | Note that the R^2 has improved, yet the `Height` term is less significant than the other two parameters. 54 | 55 | Try including the interaction term between `Girth` and `Height`: 56 | ```{r} 57 | m3 = lm(Volume~Girth*Height,data=trees) 58 | summary(m3) 59 | ``` 60 | 61 | All terms are highly significant. Note that the `Height` is more significant than in the previous model, despite the introduction of an additional parameter. 62 | 63 | We'll now try a different functional form - rather than looking for an additive model, we can explore a multiplicative model by applying a log-log transformation (leaving out the interaction term for now). 64 | ```{r} 65 | m4 = lm(log(Volume)~log(Girth)+log(Height),data=trees) 66 | summary(m4) 67 | ``` 68 | 69 | All terms are significant. Note that the residual standard error is much lower than for the previous models. However, this value cannot be compared with the previous models due to transforming the response variable. The R^2 value has increased further, despite reducing the number of parameters from four to three. 70 | ```{r} 71 | confint(m4) 72 | ``` 73 | 74 | Looking at the confidence intervals for the parameters reveals that the estimated power of `Girth` is around 2, and `Height` around 1. This makes a lot of sense, given the well-known dimensional relationship between `Volume`, `Girth` and `Height`! 75 | 76 | For completeness, we'll now add the interaction term. 77 | ```{r} 78 | m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees) 79 | summary(m5) 80 | ``` 81 | 82 | The R^2 value has increased (of course, as all we've done is add an additional parameter), but interestingly none of the four terms are significant. This means that none of the individual terms alone are vital for the model - there is duplication of information between the variables. So we will revert back to the previous model. 83 | 84 | Given that it would be reasonable to expect the power of `Girth` to be 2, and Height to be 1, we will now fix those parameters, and instead just estimate the one remaining parameter. 85 | ```{r} 86 | m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees) 87 | summary(m6) 88 | ``` 89 | 90 | Note that there is no R^2 (as only the intercept was included in the model), and that the Residual Standard Error is incomparable with previous models due to changing the response variable. 91 | 92 | We can alternatively construct a model with the response being y, and the error term additive rather than multiplicative. 93 | ```{r} 94 | m7 = lm(Volume~0+I(Girth^2):Height,data=trees) 95 | summary(m7) 96 | ``` 97 | 98 | Note that the parameter estimates for the last two models are slightly different... this is due to differences in the error model. 99 | 100 | # Section 2: Model Selection 101 | 102 | Of the last two models, the one with the log-Normal error model would seem to have the more Normal residuals. This can be inspected by looking at diagnostic plots, by and using the `shapiro.test()`: 103 | ```{r} 104 | plot(m6) 105 | plot(m7) 106 | shapiro.test(residuals(m6)) 107 | shapiro.test(residuals(m7)) 108 | ``` 109 | 110 | The Akaike Information Criterion (AIC) can help to make decisions regarding which model is the most appropriate. Now calculate the AIC for each of the above models: 111 | ```{r} 112 | summary(m1) 113 | AIC(m1) 114 | summary(m2) 115 | AIC(m2) 116 | summary(m3) 117 | AIC(m3) 118 | summary(m4) 119 | AIC(m4) 120 | summary(m5) 121 | AIC(m5) 122 | summary(m6) 123 | AIC(m6) 124 | summary(m7) 125 | AIC(m7) 126 | ``` 127 | 128 | Whilst the AIC can help differentiate between similar models, it cannot help deciding between models that have different responses. Which model would you select as the most appropriate? 129 | 130 | # Section 3: Stepwise Regression 131 | 132 | The in-built dataset `swiss` contains data pertaining to fertility, along with a variety of socioeconomic indicators. We want to select a sensible model using stepwise regression. First regress `Fertility` agains all available indicators: 133 | ```{r} 134 | m8 = lm(Fertility~.,data=swiss) 135 | summary(m8) 136 | ``` 137 | 138 | Are all terms significant? 139 | 140 | Now use stepwise regression, performing backward elimination in order to automatically remove inappropriate terms: 141 | ```{r} 142 | library(MASS) 143 | summary(stepAIC(m8)) 144 | ``` 145 | 146 | Are all terms significant? Is this model suitable? What are the pro's and con's of this approach? 147 | 148 | # Section 4: Non-Linear Models 149 | 150 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. 151 | 152 | In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Using Multiple Regression we trialled various models, including some that had multiple predictor variables and/or involved log-log transformations to explore power relationships. 153 | 154 | However, due to limitations of the method, we were not able to explore other options such as a parameterised power relationship with an additive error model. We will now attempt to fit this model: 155 | 156 | $y = \beta_0x_1^{\beta_1}x_2^{\beta_2}+\varepsilon$ 157 | 158 | Parameters for non-linear models may be estimated using the `nls` package in R. 159 | 160 | ```{r} 161 | volume = trees$Volume 162 | height = trees$Height 163 | girth = trees$Girth 164 | m9 = nls(volume~beta0*girth^beta1*height^beta2,start=list(beta0=1,beta1=2,beta2=1)) 165 | summary(m9) 166 | ``` 167 | Note that the parameters `beta0`, `beta1` and `beta2` weren't defined prior to the function call - `nls` knew what to do with them. Also note that we had to provide starting points for the parameters. What happens if you change them? 168 | 169 | Are all terms significant? Is this model appropriate? What else could be tried to achieve a better model? 170 | 171 | # Section 5: Practical Exercises 172 | 173 | ## Puromycin 174 | 175 | The in-built R dataset `Puromycin` contains data regarding the reaction velocity versus 176 | substrate concentration in an enzymatic reaction involving untreated cells or cells 177 | treated with Puromycin. 178 | 179 | - Plot `conc` (concentration) against `rate`. What is the nature of the relationship 180 | between `conc` and `rate`? 181 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 182 | plot(conc~rate,data=Puromycin) 183 | # There is a non-linear positive relationship between conc and rate 184 | ``` 185 | 186 | - Find a transformation that linearises the data and stabilises the variance, 187 | making it possible to use linear regression. Create the corresponding linear 188 | regression model. Are all terms significant? 189 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 190 | plot(log(conc)~rate,data=Puromycin) 191 | m10 = lm(log(conc)~rate,data=Puromycin) 192 | plot(m10) 193 | summary(m10) 194 | # Both terms are significant 195 | ``` 196 | 197 | - Add the `state` term to the model. What type of variable is this? Is the 198 | inclusion of this term appropriate? 199 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 200 | m11 = lm(log(conc)~rate+state,data=Puromycin) 201 | plot(m11) 202 | summary(m11) 203 | # `state` is a boolean factor or indicator variable 204 | # The inclusion of `state` is appropriate, as the term is significant and the diagnostic plots look reasonable 205 | ``` 206 | 207 | - Now add a term representing the interaction between `rate` and `state`. Are all 208 | terms significant? What can you conclude? 209 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 210 | m12 = lm(log(conc)~rate*state,data=Puromycin) 211 | summary(m12) 212 | # The `state` term is not significant when the interaction between `rate` and `state` is included in the model. So it may be better to remove the `state` term from the model. 213 | ``` 214 | 215 | - Given this information, create the regression model you believe to be the most 216 | appropriate for modelling `conc`. Regenerate the plot of `conc` against `rate`. 217 | Draw curves corresponding to the fitted values of the final model onto this 218 | plot (note that two separate curves should be drawn, corresponding to the 219 | two levels of `state`). 220 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 221 | m13 = lm(log(conc)~rate+rate:state,data=Puromycin) 222 | summary(m13) 223 | 224 | # Solution one: 225 | plot(conc~rate,data=Puromycin) 226 | idx = order(Puromycin$rate) 227 | treated = Puromycin$state[idx] == "treated" 228 | untreated = Puromycin$state[idx] == "untreated" 229 | lines(exp(fitted(m13))[idx][treated]~Puromycin$rate[idx][treated]) 230 | lines(exp(fitted(m13))[idx][untreated]~Puromycin$rate[idx][untreated],col="red") 231 | 232 | # Solution two (better - more general): 233 | plot(conc~rate,data=Puromycin) 234 | xvals = range(Puromycin$rate)[1]:range(Puromycin$rate)[2] 235 | lines(exp(coef(m13)[1] + coef(m13)[2]*xvals) ~ xvals) 236 | lines(exp(coef(m13)[1] + coef(m13)[2]*xvals + coef(m13)[3]*xvals) ~ xvals, col="red") 237 | ``` 238 | 239 | ## Attitude 240 | 241 | The in-built R dataset `attitude` contains data from a survey of clerical employees. 242 | 243 | - Create a linear model regressing `rating` on `complaints`, and store the model 244 | in a variable. 245 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 246 | m14 = lm(rating~complaints,data=attitude) 247 | ``` 248 | 249 | - Use the step function to perform forward selection stepwise regression, in order to automatically add appropriate terms, using a command similar to: 250 | `new_model = step(original_model,.~.+privileges+learning+raises+critical+advance)` 251 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 252 | m15 = step(m14,.~.+privileges+learning+raises+critical+advance) 253 | ``` 254 | 255 | - Which term(s) were added? What is Akaike's Information Criterion (AIC) corresponding to the final model? Are all terms in the resulting model significant? Check diagnostic plots. Do you think this is a suitable model? 256 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 257 | # just the `learning` term was added 258 | # AIC = 118.00 for the final model 259 | summary(m15) 260 | # Only the `complaints` term is significant in the final model - the intercept and coefficient of `learning` are not significant. 261 | plot(m15) 262 | # Despite the residuals not being perfectly Normally distributed, the model does seem reasonable. 263 | ``` 264 | -------------------------------------------------------------------------------- /multiple_regression.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | --- 5 | title: "Multiple Regression with R" 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 8 | output: 9 | html_document: 10 | theme: united 11 | highlight: tango 12 | code_folding: show 13 | toc: true 14 | toc_depth: 2 15 | toc_float: true 16 | fig_width: 8 17 | fig_height: 6 18 | --- 19 | 20 | 21 | 22 | 23 | 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 25 | # change working directory: should be the directory containg the Markdown files: 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/") 27 | 28 | ``` 29 | 30 | # Section 1: Multiple Regression 31 | 32 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Now we will expand this by considering `Height` as another predictor. 33 | 34 | Start by plotting the dataset: 35 | ```{r} 36 | plot(trees) 37 | ``` 38 | 39 | This plots all variables against each other, enabling visual information about correlations within the dataset. 40 | 41 | Re-create the original model of `Volume` against `Girth`: 42 | ```{r} 43 | m1 = lm(Volume~Girth,data=trees) 44 | summary(m1) 45 | ``` 46 | 47 | Now include `Height` as an additional variable: 48 | ```{r} 49 | m2 = lm(Volume~Girth+Height,data=trees) 50 | summary(m2) 51 | ``` 52 | 53 | Note that the R^2 has improved, yet the `Height` term is less significant than the other two parameters. 54 | 55 | Try including the interaction term between `Girth` and `Height`: 56 | ```{r} 57 | m3 = lm(Volume~Girth*Height,data=trees) 58 | summary(m3) 59 | ``` 60 | 61 | All terms are highly significant. Note that the `Height` is more significant than in the previous model, despite the introduction of an additional parameter. 62 | 63 | We'll now try a different functional form - rather than looking for an additive model, we can explore a multiplicative model by applying a log-log transformation (leaving out the interaction term for now). 64 | ```{r} 65 | m4 = lm(log(Volume)~log(Girth)+log(Height),data=trees) 66 | summary(m4) 67 | ``` 68 | 69 | All terms are significant. Note that the residual standard error is much lower than for the previous models. However, this value cannot be compared with the previous models due to transforming the response variable. The R^2 value has increased further, despite reducing the number of parameters from four to three. 70 | ```{r} 71 | confint(m4) 72 | ``` 73 | 74 | Looking at the confidence intervals for the parameters reveals that the estimated power of `Girth` is around 2, and `Height` around 1. This makes a lot of sense, given the well-known dimensional relationship between `Volume`, `Girth` and `Height`! 75 | 76 | For completeness, we'll now add the interaction term. 77 | ```{r} 78 | m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees) 79 | summary(m5) 80 | ``` 81 | 82 | The R^2 value has increased (of course, as all we've done is add an additional parameter), but interestingly none of the four terms are significant. This means that none of the individual terms alone are vital for the model - there is duplication of information between the variables. So we will revert back to the previous model. 83 | 84 | Given that it would be reasonable to expect the power of `Girth` to be 2, and Height to be 1, we will now fix those parameters, and instead just estimate the one remaining parameter. 85 | ```{r} 86 | m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees) 87 | summary(m6) 88 | ``` 89 | 90 | Note that there is no R^2 (as only the intercept was included in the model), and that the Residual Standard Error is incomparable with previous models due to changing the response variable. 91 | 92 | We can alternatively construct a model with the response being y, and the error term additive rather than multiplicative. 93 | ```{r} 94 | m7 = lm(Volume~0+I(Girth^2):Height,data=trees) 95 | summary(m7) 96 | ``` 97 | 98 | Note that the parameter estimates for the last two models are slightly different... this is due to differences in the error model. 99 | 100 | # Section 2: Model Selection 101 | 102 | Of the last two models, the one with the log-Normal error model would seem to have the more Normal residuals. This can be inspected by looking at diagnostic plots, by and using the `shapiro.test()`: 103 | ```{r} 104 | plot(m6) 105 | plot(m7) 106 | shapiro.test(residuals(m6)) 107 | shapiro.test(residuals(m7)) 108 | ``` 109 | 110 | The Akaike Information Criterion (AIC) can help to make decisions regarding which model is the most appropriate. Now calculate the AIC for each of the above models: 111 | ```{r} 112 | summary(m1) 113 | AIC(m1) 114 | summary(m2) 115 | AIC(m2) 116 | summary(m3) 117 | AIC(m3) 118 | summary(m4) 119 | AIC(m4) 120 | summary(m5) 121 | AIC(m5) 122 | summary(m6) 123 | AIC(m6) 124 | summary(m7) 125 | AIC(m7) 126 | ``` 127 | 128 | Whilst the AIC can help differentiate between similar models, it cannot help deciding between models that have different responses. Which model would you select as the most appropriate? 129 | 130 | # Section 3: Stepwise Regression 131 | 132 | The in-built dataset `swiss` contains data pertaining to fertility, along with a variety of socioeconomic indicators. We want to select a sensible model using stepwise regression. First regress `Fertility` agains all available indicators: 133 | ```{r} 134 | m8 = lm(Fertility~.,data=swiss) 135 | summary(m8) 136 | ``` 137 | 138 | Are all terms significant? 139 | 140 | Now use stepwise regression, performing backward elimination in order to automatically remove inappropriate terms: 141 | ```{r} 142 | library(MASS) 143 | summary(stepAIC(m8)) 144 | ``` 145 | 146 | Are all terms significant? Is this model suitable? What are the pro's and con's of this approach? 147 | 148 | # Section 4: Non-Linear Models 149 | 150 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. 151 | 152 | In the Simple Regression session, we constructed a simple linear model for `Volume` using `Girth` as the independent variable. Using Multiple Regression we trialled various models, including some that had multiple predictor variables and/or involved log-log transformations to explore power relationships. 153 | 154 | However, due to limitations of the method, we were not able to explore other options such as a parameterised power relationship with an additive error model. We will now attempt to fit this model: 155 | 156 | $y = \beta_0x_1^{\beta_1}x_2^{\beta_2}+\varepsilon$ 157 | 158 | Parameters for non-linear models may be estimated using the `nls` package in R. 159 | 160 | ```{r} 161 | volume = trees$Volume 162 | height = trees$Height 163 | girth = trees$Girth 164 | m9 = nls(volume~beta0*girth^beta1*height^beta2,start=list(beta0=1,beta1=2,beta2=1)) 165 | summary(m9) 166 | ``` 167 | Note that the parameters `beta0`, `beta1` and `beta2` weren't defined prior to the function call - `nls` knew what to do with them. Also note that we had to provide starting points for the parameters. What happens if you change them? 168 | 169 | Are all terms significant? Is this model appropriate? What else could be tried to achieve a better model? 170 | 171 | # Section 5: Practical Exercises 172 | 173 | ## Puromycin 174 | 175 | The in-built R dataset `Puromycin` contains data regarding the reaction velocity versus 176 | substrate concentration in an enzymatic reaction involving untreated cells or cells 177 | treated with Puromycin. 178 | 179 | - Plot `conc` (concentration) against `rate`. What is the nature of the relationship 180 | between `conc` and `rate`? 181 | - Find a transformation that linearises the data and stabilises the variance, 182 | making it possible to use linear regression. Create the corresponding linear 183 | regression model. Are all terms significant? 184 | - Add the `state` term to the model. What type of variable is this? Is the 185 | inclusion of this term appropriate? 186 | - Now add a term representing the interaction between `rate` and `state`. Are all 187 | terms significant? What can you conclude? 188 | - Given this information, create the regression model you believe to be the most 189 | appropriate for modelling `conc`. Regenerate the plot of `conc` against `rate`. 190 | Draw curves corresponding to the fitted values of the final model onto this 191 | plot (note that two separate curves should be drawn, corresponding to the 192 | two levels of `state`). 193 | 194 | ## Attitude 195 | 196 | The in-built R dataset `attitude` contains data from a survey of clerical employees. 197 | 198 | - Create a linear model regressing `rating` on `complaints`, and store the model 199 | in a variable. 200 | - Use the step function to perform forward selection stepwise regression, in order to automatically add appropriate terms, using a command similar to: 201 | `new_model = step(original_model,.~.+privileges+learning+raises+critical+advance)` 202 | - Which term(s) were added? What is Akaike's Information Criterion (AIC) corresponding to the final model? Are all terms in the resulting model significant? Check diagnostic plots. Do you think this is a suitable model? 203 | 204 | -------------------------------------------------------------------------------- /multiple_regression.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/multiple_regression.pdf -------------------------------------------------------------------------------- /r-recap.Rmd: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | title: "Recap of Statistical Analysis in R" 4 | author: Chandra Chilamakuri, Dominique-Laurent Couturier, Rob Nicholls 5 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 6 | output: 7 | html_notebook: 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | 13 | 14 | 15 | # Introduction 16 | 17 | The purpose of this section is to review some of the key concepts in basic R usage, and statistical testing 18 | 19 | - Reading data into R 20 | - The data-frame representation of data in R 21 | - Selecting rows and columns from a data frame 22 | - Computing numerical summaries 23 | - Basic plotting 24 | - Getting help on functions in RStudio 25 | 26 | 27 | ## About this tutorial 28 | 29 | - The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time). 30 | - However, for this course we will use a relatively new feature called R-notebooks. 31 | - An R-notebook mixes plain text with R code 32 | + The R code can be run from inside the document and the results are displayed directly underneath 33 | - Each chunk of R code looks something like this. 34 | 35 | ```{r} 36 | 37 | ``` 38 | 39 | - Each line of R can be executed by clicking on the line and pressing CTRL and ENTER 40 | - Or you can press the green triangle on the right-hand side to run everything in the chunk 41 | + Try this now! 42 | 43 | ```{r} 44 | print("Hello World") 45 | ``` 46 | 47 | - You can add R chunks by pressing CRTL + ALT + I 48 | + or using the Insert menu option 49 | + (can also include code from other languages such as Python or bash) 50 | 51 | The document may also contain other formatting options that are used to render the HTML (or PDF, Word) output. 52 | 53 | Here is some *italic* text, but we can also write in **bold**, or write things 54 | 55 | - in 56 | - a 57 | - list 58 | + which include sub-lists 59 | 60 | 61 | 62 | # Example Analysis 63 | 64 | We will use a dataset from The University of Sheffield Mathematics and Statistics Help group ((MASH)(https://www.sheffield.ac.uk/mash/statistics2/anova)). 65 | 66 | > The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet. 67 | 68 | 69 | ## Reading and inspecting the data 70 | 71 | 72 | Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by executing the following R command:- 73 | 74 | ```{r} 75 | getwd() 76 | ``` 77 | 78 | *N.B.*Here, a set of open and closed brackets () is used to run the `getwd` function with no arguments. 79 | *Note if you are following this material on a Windows machine as opposed to a Linux or MacOS machine 80 | you will get a path like C:\Users\Fred. If you want to use the complementing R command 'setwd()' to set 81 | the working directory you MUST escape the \ i.e. setwd("C:\\Users\\Fred"). 82 | We can also list the files in a specific directory with:- 83 | 84 | ```{r} 85 | list.files("data/") 86 | ``` 87 | 88 | A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory. 89 | 90 | ```{r} 91 | file.exists("data/diet.csv") 92 | ``` 93 | 94 | 95 | 96 | - Assuming the file can be found, we can use the `read.csv` function to import the data. Other functions can be used to read tab-delimited files (read.delim) or a generic read.table function. A data frame object is created. 97 | - The file name `diet.csv` is the only *argument* to the function `read.csv` 98 | + arguments are listed inside the brackets 99 | + for functions requiring more than one argument (input), arguments are separated by commas 100 | + a function may have default values for some arguments; meaning they do not need to be specified 101 | - The characters `<-` are used to tell R to create a variable 102 | + without this, the data are not loaded into memory and you won't be able to work with them 103 | - If you get an error saying `Error in file(file, “rt”) : cannot open the connection...`, you might need to change your working directory or make sure the file name is typed correctly (R is case-sensitive) 104 | - Typing the name of an object will cause R to print the contents to the screen 105 | 106 | ```{r} 107 | diet <- read.csv("data/diet.csv") 108 | diet 109 | ``` 110 | 111 | ### A note on importing your own data 112 | 113 | If you are trying to read your own data, and encounter an error at this stage, you may need to consider if your data are in the correct form for analysis. Like most programming languages, R will struggle if your spreadsheet has been heavily formatted to include colours, formulas and special formatting. 114 | 115 | These references will guide you through some of the pitfalls and common mistakes to avoid when formatting data 116 | 117 | - [Formatting data tables in Spreadsheets](http://www.datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html) 118 | - [Data Organisation tutorial by Karl Broman](http://kbroman.org/dataorg/) 119 | - [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide/blob/master/README.md) 120 | 121 | 122 | 123 | 124 | `diet` is an example of a data frame. The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text). 125 | 126 | - the `summary` function will provide a overview of the contents of each column in the table 127 | + the type of summary provided depends on the data type in each column 128 | 129 | ```{r} 130 | summary(diet) 131 | ``` 132 | 133 | - particular columns can be accessed using the `$` operator 134 | + ***TIP*** RStudio will allow auto-complete using the *Tab* key 135 | 136 | ```{r} 137 | diet$gender 138 | diet$age 139 | 140 | 141 | ``` 142 | 143 | We can create new columns based on existing ones 144 | 145 | ```{r} 146 | diet$weight.loss <- diet$final.weight - diet$initial.weight 147 | 148 | ``` 149 | 150 | Subsetting rows and columns is done using the `[rows, columns]` syntax; where `rows` and `columns` are *vectors* containing the rows and columns you want 151 | 152 | - you can choose to omit either vector to show all rows and columns. *However, you still need to remember the `,` 153 | 154 | ```{r} 155 | diet[1:5,] 156 | diet[,2:3] 157 | ``` 158 | 159 | Logical tests can be used to select rows. e.g. using `==`, `<`, `>` 160 | 161 | ```{r} 162 | diet$diet.type == "A" 163 | 164 | dietA <- diet[diet$diet.type == "A",] 165 | dietA 166 | ``` 167 | 168 | 169 | 170 | ## Visualisation 171 | 172 | All your favourite types of plot can be created in R 173 | 174 | 175 | - Simple plots are supported in the *base* distribution of R (what you get automatically when you download R). 176 | + `boxplot`, `hist`, `barplot`,... all of which are extensions of the basic `plot` function 177 | - Many different customisations are possible 178 | + colour, overlay points / text, legends, multi-panel figures 179 | - ***You need to think about how best to visualise your data*** 180 | + http://www.bioinformatics.babraham.ac.uk/training.html#figuredesign 181 | - R cannot prevent you from creating a plotting disaster: 182 | + http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6?op=1&IR=T 183 | 184 | 185 | Plots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame. 186 | 187 | - a histogram is commonly-used to examine the distribution of a particular variable 188 | 189 | ```{r} 190 | hist(diet$weight.loss) 191 | ``` 192 | 193 | - a boxplot is often used to compare distributions visually 194 | + if given a data-frame, each column will be shown as a separate box 195 | + otherwise the formula syntax `~` is used to define x and y variables 196 | 197 | ```{r} 198 | boxplot(diet$weight.loss~diet$diet.type) 199 | 200 | ``` 201 | 202 | - scatter plots can be constructed by given two vectors as arguments to `plot` 203 | 204 | ```{r} 205 | plot(diet$age,diet$initial.weight) 206 | ``` 207 | 208 | 209 | *Lots* of customisations are possible to enhance the appaerance of our plots. Not for the faint-hearted, the help pages `?plot` and `?par` give the full details. In short, 210 | 211 | - Axis labels, and titles can be specified as character strings. 212 | 213 | - R recognises many preset names as colours. To get a full list use `colours()`, or check this [online reference](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf). 214 | + can also use `*R*ed, *G*reen, *B*lue values; which you might get from a paint program 215 | - Plotting characters can be specified using a pre-defined number 216 | 217 | ```{r} 218 | boxplot(diet$weight.loss~diet$diet.type, 219 | ylab="Weight Loss", 220 | xlab="Diet Type", 221 | col=c("yellow","blue","red"), 222 | main="Weight Loss According to diet type") 223 | ``` 224 | 225 | You can get help on any of the functions that we will be using in this course by using the '?' or 'help()' commands. The help will appear in the help pane (usually bottom RH corner) . 226 | 227 | ```{r} 228 | ?lm 229 | ``` 230 | 231 | ```{r} 232 | help(lm) 233 | ``` 234 | -------------------------------------------------------------------------------- /simple_regression+.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | --- 5 | title: "Simple Regression with R" 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 8 | output: 9 | html_document: 10 | theme: united 11 | highlight: tango 12 | code_folding: show 13 | toc: true 14 | toc_depth: 2 15 | toc_float: true 16 | fig_width: 8 17 | fig_height: 6 18 | --- 19 | 20 | 21 | 22 | 23 | 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 25 | # change working directory: should be the directory containg the Markdown files: 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/") 27 | 28 | ``` 29 | 30 | # Section 1: Correlation Coefficients 31 | 32 | We'll start by generating some synthetic data to investigate correlation coefficients. 33 | 34 | Generate 50 random numbers in the range [0,50]: 35 | ```{r} 36 | x = runif(50,0,50) 37 | ``` 38 | 39 | Now let's generate some y-values that are linearly correlated with the x-values with gradient=1, applying a random Normal offset (with sd=5): 40 | ```{r} 41 | y = x + rnorm(50,0,5) 42 | ``` 43 | 44 | Plotting y against x, you'll observe a positive linear relationship: 45 | ```{r} 46 | plot(y~x) 47 | ``` 48 | 49 | This strong linear relationship is reflected in the correlation coefficient and in the coefficient of determination (R^2): 50 | ```{r} 51 | pearson_cor_coef = cor(x,y) 52 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2) 53 | ``` 54 | 55 | If the data exhibit a negative linear correlation then the correlation coefficient will become strong and negative, whilst the R^2 value will remain strong and positive: 56 | ```{r} 57 | y = -x + rnorm(50,0,5) 58 | plot(y~x) 59 | pearson_cor_coef = cor(x,y) 60 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2) 61 | ``` 62 | 63 | If data are uncorrelated then both the correlation coefficient and R^2 values will be close to zero: 64 | ```{r} 65 | y = rnorm(50,0,5) 66 | plot(y~x) 67 | pearson_cor_coef = cor(x,y) 68 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2) 69 | ``` 70 | 71 | The significance of a correlation can be tested using `cor.test()`, which also provides a 95% confidence interval on the correlation: 72 | ```{r} 73 | cor.test(x,y) 74 | ``` 75 | 76 | In this case, the value 0 is contained within the confidence interval, indivating that there is insufficient evidence to reject the null hypothesis that the true correlation is equal to zero. 77 | 78 | # Section 2: Simple Regression 79 | 80 | Now let's look at some real data. 81 | 82 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. 83 | 84 | We will now attempt to construct a simple linear model that uses `Girth` to predict `Volume`. 85 | ```{r} 86 | plot(Volume~Girth,data=trees) 87 | m1 = lm(Volume~Girth,data=trees) 88 | abline(m1) 89 | cor.test(trees$Volume,trees$Girth) 90 | ``` 91 | 92 | It is evident that `Volume` and `Girth` are highly correlated. 93 | 94 | The summary for the linear model provides information regarding the quality of the model: 95 | ```{r} 96 | summary(m1) 97 | ``` 98 | 99 | Model residuals can be readily accessed using the `residuals()` function: 100 | ```{r} 101 | hist(residuals(m1),breaks=10,col="light grey") 102 | ``` 103 | 104 | Diagnostic plots for the model can reveal whether or not modelling assumptions are reasonable. In this case, there is visual evidence to suggest that the assumptions are not satisfied - note in particular the trend observed in the plot of residuals vs fitted values: 105 | ```{r} 106 | plot(m1) 107 | ``` 108 | 109 | # Section 3: Assessing the quality of linear models 110 | 111 | Let's see what happens if we try to describe a non-linear relationship using a linear model. Consider the sine function in the range [0,1.5*pi): 112 | ```{r} 113 | z = seq(0,1.5*pi,0.2) 114 | plot(sin(z)~z) 115 | m2 = lm(sin(z)~z) 116 | abline(m2) 117 | ``` 118 | 119 | In this case, it is clear that a linear model is not appropriate for describing the relationship. However, we are able to fit a linear model, and the linear model summary does not identify any major concerns: 120 | ```{r} 121 | summary(m2) 122 | ``` 123 | Here we see that the overall p-value is low enough to suggest that the model has significant utility, and both terms (the intercept and the coefficient of `z`) are significantly different from zero. The R^2 value of 0.5422 is high enough to indicate that there is a reasonably strong correlation between `sin(z)` and `z` in this range. 124 | 125 | This information is misleading, as we know that a linear model is inappropriate in this case. Indeed, the linear model summary does not check whether the underlying model assumptions are satisfied. 126 | 127 | By observing strong patterns in the diagnostic plots, we can see that the modelling assumptions are not satisified in this case. 128 | ```{r} 129 | plot(m2) 130 | ``` 131 | 132 | 133 | # Section 4: Modelling Non-Linear Relationships 134 | 135 | It is sometimes possible to use linear models to describe non-linear relationships (which is perhaps counterintuitive!). This can be achieved by applying transformations to the variable(s) in order to linearise the relationship, whilst ensuring that modelling assumptions are satisfied. 136 | 137 | Another in-built dataset `cars` provides the speeds and associated stopping distances of cars in the 1920s. 138 | 139 | Let's construct a linear model to predict stopping distance using speed: 140 | 141 | ```{r} 142 | plot(dist~speed,data=cars) 143 | m3 = lm(dist~speed,data=cars) 144 | abline(m3) 145 | summary(m3) 146 | ``` 147 | 148 | The model summary indicates that the intercept term does not have significant utility. So that term could/should be removed from the model. 149 | 150 | In addition, the plot of residuals versus fitted values indicates potential issues with variance stability: 151 | ```{r} 152 | plot(m3) 153 | ``` 154 | 155 | In this case, variance stability can be aided by a square-root transformation of the response variable: 156 | ```{r} 157 | plot(sqrt(dist)~speed,data=cars) 158 | m4 = lm(sqrt(dist)~speed,data=cars) 159 | abline(m4) 160 | plot(m4) 161 | summary(m4) 162 | ``` 163 | 164 | The R^2 value is improved over the previous model. 165 | Note that again that the intercept term is not significant. 166 | 167 | We'll now try a log-log transformation, that is applying a log transformation to the predictor and response variables. This represents a power relationship between the two variables. 168 | ```{r} 169 | plot(log(dist)~log(speed),data=cars) 170 | m5 = lm(log(dist)~log(speed),data=cars) 171 | abline(m5) 172 | plot(m5) 173 | summary(m5) 174 | ``` 175 | 176 | The R^2 value is improved, and the diagnostic plots don't look too unreasonable. However, again the intercept term does not have significant utility. So we'll now remove it from the model: 177 | ```{r} 178 | m6 = lm(log(dist)~0+log(speed),data=cars) 179 | plot(m6) 180 | summary(m6) 181 | ``` 182 | 183 | This model seems reasonable. However, remember that R^2 values corresponding to models without an intercept aren't meaningful (or at least can't be compared against models with an intercept term). 184 | 185 | We can now transform the model back, and display the regression curve on the plot: 186 | ```{r} 187 | plot(dist~speed,data=cars) 188 | x = order(cars$speed) 189 | lines(exp(fitted(m6))[x]~cars$speed[x]) 190 | ``` 191 | 192 | # Section 5: Relationship between the t-test, ANOVA and linear regression 193 | 194 | In the ANOVA session we looked at the `diet` dataset, and performed the t-test and ANOVA. Here's a recap: 195 | 196 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 197 | # import 198 | diet = read.csv("data/diet.csv",row.names=1) 199 | diet$weight.loss = diet$initial.weight - diet$final.weight 200 | diet$diet.type = factor(diet$diet.type,levels=c("A","B","C")) 201 | diet$gender = factor(diet$gender,levels=c("Female","Male")) 202 | # comparison 203 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE) 204 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",])) 205 | ``` 206 | 207 | Note that the p-values for both the t-test and ANOVA are the same. This is because these tests are equivalent (in the 2-sample case). They both test the same hypothesis. 208 | 209 | Also, the F-test statistic is equal to the square of the t-test statistic (-2.8348^2 = 8.036). Again, this is only true for the 2-sample case. 210 | 211 | Now let's use a different strategy. Instead of directly testing whether there is a difference between the two groups, let's attempt to create a linear model describing the relationship between `weight.loss` and `diet.type`. Indeed, it is possible to construct a linear model where the independent variable(s) are categorical - they do not have to be continuous or even ordinal! 212 | 213 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 214 | summary(lm(weight.loss~diet.type,data=diet[diet$diet.type!="B",])) 215 | ``` 216 | 217 | You can see that the p-value corresponding to the `diet.type` term is the same as the overall p-value of the linear model, which is also the same as the p-value from the t-test and ANOVA. Note also that the F-test statistic is the same as given by the ANOVA. 218 | 219 | So, we are also able to use the linear model to test the hypothesis that there is a difference between the two diet groups, as well as provide a more detailed description of the relationship between `weight.loss` and `diet.type`. 220 | 221 | # Section 6: Practical Exercises 222 | 223 | ## Old Faithful 224 | 225 | The inbuilt R dataset `faithful` pertains to the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. 226 | 227 | - Create a simple linear regression model that models the eruption duration `faithful$eruptions` using waiting time `faithful$waiting` as the independent variable, storing the model in a variable. Look at the summary of the model. 228 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 229 | m7 = lm(eruptions~waiting,data=faithful) 230 | summary(m7) 231 | ``` 232 | + What are the values of the estimates of the intercept and coefficient of 'waiting'? 233 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 234 | # intercept = -1.874016 235 | # coef of waiting = 0.075628 236 | ``` 237 | + What is the R^2 value? 238 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 239 | # R^2 = 0.8115 240 | ``` 241 | + Does the model have significant utility? 242 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 243 | # Yes, the model does have significant utility 244 | ``` 245 | + Are neither, one, or both of the parameters significantly different from zero? 246 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 247 | # Both of the parameters are significantly different from zero 248 | ``` 249 | + Can you conclude that there is a linear relationship between the two variables? 250 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 251 | # In the absence of other information, this summary would indicate a linear relationship between the two variables. However, we cannot conclude that without first checking that the modelling assumptions have been satistified... 252 | ``` 253 | - Plot the eruption duration against waiting time. Is there anything noticeable about the data? 254 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 255 | plot(eruptions~waiting,data=faithful) 256 | # The observations appear to cluster in two groups. 257 | ``` 258 | - Draw the regression line corresponding to your model onto the plot. Based on this graphical representation, does the model seem reasonable? 259 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 260 | plot(eruptions~waiting,data=faithful) 261 | abline(m7) 262 | # At a glance, the model seems to describe the overall dependence of eruptions on waiting time reasonably well. However, this is misleading... 263 | ``` 264 | - Generate the four diagnostic plots corresponding to your model. Contemplate the appropriateness of the model for describing the relationship between eruption duration and waiting time. 265 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 266 | plot(m7) 267 | # There is strong systematic behaviour in the plot of residuals versus fitted values. This indicates that the relationship/dependence is different or more complicated than can be described with the simple linear model. 268 | # Specifically, it should be identified what causes observations to fall into one or other of the two groups. Differences between the two groups should be accounted for when modelling the relationship. It seems that the direct dependence of `eruptions` on `waiting` is not as strong as is indicated by the simple linear model. 269 | ``` 270 | 271 | ## Anscombe datasets 272 | 273 | Consider the inbuilt R dataset `anscombe`. This dataset contains four x-y datasets, contained in the columns: (x1,y1), (x2,y2), (x3,y3) and (x4,y4). 274 | 275 | - For each of the four datasets, calculate and test the correlation between the x and y variables. What do you conclude? 276 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 277 | cor(anscombe$x1,anscombe$y1) 278 | cor.test(anscombe$x1,anscombe$y1) 279 | cor(anscombe$x2,anscombe$y2) 280 | cor.test(anscombe$x2,anscombe$y2) 281 | cor(anscombe$x3,anscombe$y3) 282 | cor.test(anscombe$x3,anscombe$y3) 283 | cor(anscombe$x4,anscombe$y4) 284 | cor.test(anscombe$x4,anscombe$y4) 285 | # All four datasets seem to exhibit positive linear relationships, with the same correlation and the same p-value. 286 | ``` 287 | - For each of the four datasets, create a linear model that regresses y on x. Look at the summaries corresponding to these models. What do you conclude? 288 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 289 | summary(lm(anscombe$y1~anscombe$x1)) 290 | summary(lm(anscombe$y2~anscombe$x2)) 291 | summary(lm(anscombe$y3~anscombe$x3)) 292 | summary(lm(anscombe$y4~anscombe$x4)) 293 | # The summaries are essentially identical for all four linear models. 294 | ``` 295 | - For each of the four datasets, create a plot of y against x. What do you conclude? 296 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 297 | plot(anscombe$y1~anscombe$x1) 298 | plot(anscombe$y2~anscombe$x2) 299 | plot(anscombe$y3~anscombe$x3) 300 | plot(anscombe$y4~anscombe$x4) 301 | # The four datasets are very different, with very different relationships between the x and y variables. 302 | # This demonstrates how very different datasets can appear to be very similar when looking solely at summary statistics. 303 | # We conclude that it is always important to peform exploratory data analysis, and look at the data before modelling. 304 | ``` 305 | 306 | 307 | ## Pharmacokinetics of Indomethacin 308 | 309 | Consider the inbuilt R dataset `Indometh`, which contains data on the pharmacokinetics of indometacin. 310 | 311 | - Plot `Indometh$time` versus `Indometh$conc` (concentration). What is the nature of the relationship 312 | between `time` and `conc`? 313 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 314 | plot(time~conc,data=Indometh) 315 | # There is a non-linear negative relationship between time and conc 316 | ``` 317 | - Apply monotonic transformations to the data so that a simple linear regression model can be used to model the relationship (ensure both linearity and stabilised variance, within reason). Create a plot of the transformed data, to confirm that the relationship seems linear. 318 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 319 | plot(log(time)~log(conc),data=Indometh) 320 | ``` 321 | - After creating the linear model, inspect the diagnostic plots to ensure that the 322 | assumptions are not violated (too much). Are there any outliers with large influence? What are the parameter estimates? Are both terms significant? 323 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 324 | m8 = lm(log(time)~log(conc),data=Indometh) 325 | plot(m8) 326 | # The diagnostic plots indicate that the residuals aren't perfectly Normally distributed, but the modelling assumptions aren't violated so much as to inhibit construction of a model. 327 | summary(m8) 328 | # Intercept = -0.4203 329 | # Coefficient of log(conc) = -0.9066 330 | # Both terms are significantly different from zero. 331 | ``` 332 | - Add a line to the plot showing the linear relationship between the transformed data. 333 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 334 | plot(log(time)~log(conc),data=Indometh) 335 | abline(m8) 336 | ``` 337 | - Now regenerate the original plot of `time` versus `conc` (i.e. the untransformed 338 | data). Using the `lines` function, add a curve to the plot corresponding to the 339 | fitted values of the model. 340 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 341 | plot(time~conc,data=Indometh) 342 | idx <- order(Indometh$conc) 343 | lines(exp(fitted(m8))[idx]~Indometh$conc[idx]) 344 | ``` 345 | -------------------------------------------------------------------------------- /simple_regression.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | --- 5 | title: "Simple Regression with R" 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 8 | output: 9 | html_document: 10 | theme: united 11 | highlight: tango 12 | code_folding: show 13 | toc: true 14 | toc_depth: 2 15 | toc_float: true 16 | fig_width: 8 17 | fig_height: 6 18 | --- 19 | 20 | 21 | 22 | 23 | 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE} 25 | # change working directory: should be the directory containg the Markdown files: 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/") 27 | 28 | ``` 29 | 30 | # Section 1: Correlation Coefficients 31 | 32 | We'll start by generating some synthetic data to investigate correlation coefficients. 33 | 34 | Generate 50 random numbers in the range [0,50]: 35 | ```{r} 36 | x = runif(50,0,50) 37 | ``` 38 | 39 | Now let's generate some y-values that are linearly correlated with the x-values with gradient=1, applying a random Normal offset (with sd=5): 40 | ```{r} 41 | y = x + rnorm(50,0,5) 42 | ``` 43 | 44 | Plotting y against x, you'll observe a positive linear relationship: 45 | ```{r} 46 | plot(y~x) 47 | ``` 48 | 49 | This strong linear relationship is reflected in the correlation coefficient and in the coefficient of determination (R^2): 50 | ```{r} 51 | pearson_cor_coef = cor(x,y) 52 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2) 53 | ``` 54 | 55 | If the data exhibit a negative linear correlation then the correlation coefficient will become strong and negative, whilst the R^2 value will remain strong and positive: 56 | ```{r} 57 | y = -x + rnorm(50,0,5) 58 | plot(y~x) 59 | pearson_cor_coef = cor(x,y) 60 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2) 61 | ``` 62 | 63 | If data are uncorrelated then both the correlation coefficient and R^2 values will be close to zero: 64 | ```{r} 65 | y = rnorm(50,0,5) 66 | plot(y~x) 67 | pearson_cor_coef = cor(x,y) 68 | list("cor"=pearson_cor_coef,"R^2"=pearson_cor_coef^2) 69 | ``` 70 | 71 | The significance of a correlation can be tested using `cor.test()`, which also provides a 95% confidence interval on the correlation: 72 | ```{r} 73 | cor.test(x,y) 74 | ``` 75 | 76 | In this case, the value 0 is contained within the confidence interval, indivating that there is insufficient evidence to reject the null hypothesis that the true correlation is equal to zero. 77 | 78 | # Section 2: Simple Regression 79 | 80 | Now let's look at some real data. 81 | 82 | The in-built dataset `trees` contains data pertaining to the `Volume`, `Girth` and `Height` of 31 felled black cherry trees. 83 | 84 | We will now attempt to construct a simple linear model that uses `Girth` to predict `Volume`. 85 | ```{r} 86 | plot(Volume~Girth,data=trees) 87 | m1 = lm(Volume~Girth,data=trees) 88 | abline(m1) 89 | cor.test(trees$Volume,trees$Girth) 90 | ``` 91 | 92 | It is evident that `Volume` and `Girth` are highly correlated. 93 | 94 | The summary for the linear model provides information regarding the quality of the model: 95 | ```{r} 96 | summary(m1) 97 | ``` 98 | 99 | Model residuals can be readily accessed using the `residuals()` function: 100 | ```{r} 101 | hist(residuals(m1),breaks=10,col="light grey") 102 | ``` 103 | 104 | Diagnostic plots for the model can reveal whether or not modelling assumptions are reasonable. In this case, there is visual evidence to suggest that the assumptions are not satisfied - note in particular the trend observed in the plot of residuals vs fitted values: 105 | ```{r} 106 | plot(m1) 107 | ``` 108 | 109 | # Section 3: Assessing the quality of linear models 110 | 111 | Let's see what happens if we try to describe a non-linear relationship using a linear model. Consider the sine function in the range [0,1.5*pi): 112 | ```{r} 113 | z = seq(0,1.5*pi,0.2) 114 | plot(sin(z)~z) 115 | m0 = lm(sin(z)~z) 116 | abline(m0) 117 | ``` 118 | 119 | In this case, it is clear that a linear model is not appropriate for describing the relationship. However, we are able to fit a linear model, and the linear model summary does not identify any major concerns: 120 | ```{r} 121 | summary(m0) 122 | ``` 123 | Here we see that the overall p-value is low enough to suggest that the model has significant utility, and both terms (the intercept and the coefficient of `z`) are significantly different from zero. The R^2 value of 0.5422 is high enough to indicate that there is a reasonably strong correlation between `sin(z)` and `z` in this range. 124 | 125 | This information is misleading, as we know that a linear model is inappropriate in this case. Indeed, the linear model summary does not check whether the underlying model assumptions are satisfied. 126 | 127 | By observing strong patterns in the diagnostic plots, we can see that the modelling assumptions are not satisified in this case. 128 | ```{r} 129 | plot(m0) 130 | ``` 131 | 132 | 133 | # Section 4: Modelling Non-Linear Relationships 134 | 135 | It is sometimes possible to use linear models to describe non-linear relationships (which is perhaps counterintuitive!). This can be achieved by applying transformations to the variable(s) in order to linearise the relationship, whilst ensuring that modelling assumptions are satisfied. 136 | 137 | Another in-built dataset `cars` provides the speeds and associated stopping distances of cars in the 1920s. 138 | 139 | Let's construct a linear model to predict stopping distance using speed: 140 | 141 | ```{r} 142 | plot(dist~speed,data=cars) 143 | m2 = lm(dist~speed,data=cars) 144 | abline(m2) 145 | summary(m2) 146 | ``` 147 | 148 | The model summary indicates that the intercept term does not have significant utility. So that term could/should be removed from the model. 149 | 150 | In addition, the plot of residuals versus fitted values indicates potential issues with variance stability: 151 | ```{r} 152 | plot(m2) 153 | ``` 154 | 155 | In this case, variance stability can be aided by a square-root transformation of the response variable: 156 | ```{r} 157 | plot(sqrt(dist)~speed,data=cars) 158 | m3 = lm(sqrt(dist)~speed,data=cars) 159 | abline(m3) 160 | plot(m3) 161 | summary(m3) 162 | ``` 163 | 164 | The R^2 value is improved over the previous model. 165 | Note that again that the intercept term is not significant. 166 | 167 | We'll now try a log-log transformation, that is applying a log transformation to the predictor and response variables. This represents a power relationship between the two variables. 168 | ```{r} 169 | plot(log(dist)~log(speed),data=cars) 170 | m4 = lm(log(dist)~log(speed),data=cars) 171 | abline(m4) 172 | plot(m4) 173 | summary(m4) 174 | ``` 175 | 176 | The R^2 value is improved, and the diagnostic plots don't look too unreasonable. However, again the intercept term does not have significant utility. So we'll now remove it from the model: 177 | ```{r} 178 | m5 = lm(log(dist)~0+log(speed),data=cars) 179 | plot(m5) 180 | summary(m5) 181 | ``` 182 | 183 | This model seems reasonable. However, remember that R^2 values corresponding to models without an intercept aren't meaningful (or at least can't be compared against models with an intercept term). 184 | 185 | We can now transform the model back, and display the regression curve on the plot: 186 | ```{r} 187 | plot(dist~speed,data=cars) 188 | x = order(cars$speed) 189 | lines(exp(fitted(m5))[x]~cars$speed[x]) 190 | ``` 191 | 192 | # Section 5: Relationship between the t-test, ANOVA and linear regression 193 | 194 | In the ANOVA session we looked at the `diet` dataset, and performed the t-test and ANOVA. Here's a recap: 195 | 196 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 197 | # import 198 | diet = read.csv("data/diet.csv",row.names=1) 199 | diet$weight.loss = diet$initial.weight - diet$final.weight 200 | diet$diet.type = factor(diet$diet.type,levels=c("A","B","C")) 201 | diet$gender = factor(diet$gender,levels=c("Female","Male")) 202 | # comparison 203 | t.test(weight.loss~diet.type,data=diet[diet$diet.type!="B",],var.equal = TRUE) 204 | summary(aov(weight.loss~diet.type,data=diet[diet$diet.type!="B",])) 205 | ``` 206 | 207 | Note that the p-values for both the t-test and ANOVA are the same. This is because these tests are equivalent (in the 2-sample case). They both test the same hypothesis. 208 | 209 | Also, the F-test statistic is equal to the square of the t-test statistic (-2.8348^2 = 8.036). Again, this is only true for the 2-sample case. 210 | 211 | Now let's use a different strategy. Instead of directly testing whether there is a difference between the two groups, let's attempt to create a linear model describing the relationship between `weight.loss` and `diet.type`. Indeed, it is possible to construct a linear model where the independent variable(s) are categorical - they do not have to be continuous or even ordinal! 212 | 213 | ```{r message = FALSE, warning = FALSE, echo = TRUE} 214 | summary(lm(weight.loss~diet.type,data=diet[diet$diet.type!="B",])) 215 | ``` 216 | 217 | You can see that the p-value corresponding to the `diet.type` term is the same as the overall p-value of the linear model, which is also the same as the p-value from the t-test and ANOVA. Note also that the F-test statistic is the same as given by the ANOVA. 218 | 219 | So, we are also able to use the linear model to test the hypothesis that there is a difference between the two diet groups, as well as provide a more detailed description of the relationship between `weight.loss` and `diet.type`. 220 | 221 | # Section 6: Practical Exercises 222 | 223 | ## Old Faithful 224 | 225 | The inbuilt R dataset `faithful` pertains to the waiting time between eruptions and 226 | the duration of the eruption for the Old Faithful geyser in Yellowstone National 227 | Park, Wyoming, USA. 228 | 229 | - Create a simple linear regression model that models the eruption duration `faithful$eruptions` using waiting time `faithful$waiting` as the independent variable, storing the model in a variable. Look at the summary of the model. 230 | + What are the values of the estimates of the intercept and coefficient of 'waiting'? 231 | + What is the R^2 value? 232 | + Does the model have significant utility? 233 | + Are neither, one, or both of the parameters significantly different from zero? 234 | + Can you conclude that there is a linear relationship between the two variables? 235 | - Plot the eruption duration against waiting time. Is there anything noticeable 236 | about the data? 237 | - Draw the regression line corresponding to your model onto the plot. Based on this graphical representation, does the model seem reasonable? 238 | - Generate the four diagnostic plots corresponding to your model. Contemplate the appropriateness of the model for describing the relationship between eruption duration and waiting time. 239 | 240 | ## Anscombe datasets 241 | 242 | Consider the inbuilt R dataset `anscombe`. This dataset contains four x-y datasets, 243 | contained in the columns: (x1,y1), (x2,y2), (x3,y3) and (x4,y4). 244 | 245 | - For each of the four datasets, calculate and test the correlation between the x and y 246 | variables. What do you conclude? 247 | - For each of the four datasets, create a linear model that regresses y on x. Look 248 | at the summaries corresponding to these models. What do you conclude? 249 | - For each of the four datasets, create a plot of y against x. What do you 250 | conclude? 251 | 252 | ## Pharmacokinetics of Indomethacin 253 | 254 | Consider the inbuilt R dataset `Indometh`, which contains data on the pharmacokinetics of indometacin. 255 | 256 | - Plot `Indometh$time` versus `Indometh$conc` (concentration). What is the nature of the relationship 257 | between `time` and `conc`? 258 | - Apply monotonic transformations to the data so that a simple linear regression model can be used to model the relationship (ensure both linearity and stabilised variance, within reason). Create a plot of the transformed data, to confirm that the relationship seems linear. 259 | - After creating the linear model, inspect the diagnostic plots to ensure that the 260 | assumptions are not violated (too much). Are there any outliers with large influence? What are the parameter estimates? Are both terms significant? 261 | - Add a line to the plot showing the linear relationship between the transformed data. 262 | - Now regenerate the original plot of `time` versus `conc` (i.e. the untransformed 263 | data). Using the `lines` function, add a curve to the plot corresponding to the 264 | fitted values of the model. 265 | 266 | -------------------------------------------------------------------------------- /simple_regression.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/simple_regression.pdf -------------------------------------------------------------------------------- /time_series.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/time_series.pdf -------------------------------------------------------------------------------- /time_series_analysis.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | --- 5 | title: "Time Series Analysis with R" 6 | author: "D.-L. Couturier / R. Nicholls / C. Chilamakuri" 7 | date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' 8 | output: 9 | html_document: 10 | theme: united 11 | highlight: tango 12 | code_folding: show 13 | toc: true 14 | toc_depth: 2 15 | toc_float: true 16 | fig_width: 8 17 | fig_height: 6 18 | --- 19 | 20 | 21 | 22 | 23 | 24 | ```{r message = FALSE, warning = FALSE, echo = FALSE,eval=FALSE} 25 | # change working directory: should be the directory containg the Markdown files: 26 | #setwd("/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r/") 27 | 28 | ``` 29 | 30 | We will consider a dataset corresponding to the Monthly Southern Oscillation Index, measured as the difference in sea-surface air pressure between Darwin and Tahiti. 31 | 32 | ```{r} 33 | x=read.table("data/OscillationIndex.txt",header=TRUE) 34 | x$Index 35 | plot(x$Index,type="l",ylab="Oscillation Index") 36 | ``` 37 | 38 | Now let's look at the autocorrelation function corresponding to this dataset: 39 | ```{r} 40 | acf(x$Index,lag.max=70,main="") 41 | ``` 42 | 43 | There is clear long-range oscillatory behaviour in the autocorrelation function, indicating that the process is not stationary. 44 | 45 | We should consider an integrated (ARIMA) model, so let's calculate and plot the first differences, as well as the associated autocorrelation function: 46 | ```{r} 47 | plot(diff(x$Index),type="l",ylab="Oscillation Index (d=1)") 48 | acf(diff(x$Index),lag.max=70,main="") 49 | ``` 50 | 51 | That's more promising - there is one large negative peak at lag=1, after which the autocorrelation function decays rapidly and stays small. This indicates that this process is covariance stationary. This also indicates that the Moving Average (MA) part of the model may be of order 1. So an ARIMA(0,1,1) model might be a possibility. 52 | 53 | Now let's look at the partial autocorrelation function: 54 | ```{r} 55 | pacf(diff(x$Index),lag.max=70,main="") 56 | ``` 57 | 58 | This also looks promising. There are four negative peaks before the PACF decays below the significance threshold. That indicates that the AutoRegressive (AR) part of the model may have order up to 4. 59 | 60 | Now we'll try to create an ARIMA(0,1,1) model: 61 | ```{r} 62 | arima(x$Index,order=c(0,1,1)) 63 | ``` 64 | 65 | Note that the standard error of the coefficient indicates significance of the term. 66 | 67 | Now try creating other ARIMA models, and compare. 68 | 69 | There are a variety of time series datasets in the in-built R "datasets" package. Type `data()` to get a full list. For example, the datasets called `lh`, `ldeaths` and `presidents` are particularly appropriate for this type of analysis. Other datasets also contain time series data, including: `nhtemp`, `lynx`, `Nile`, `co2` and `WWWusage`. Explore such datasets - look at autocorrelation and partial autocorrelation functions, identify whether the datasets are suitable for time series analysis, and try fitting ARIMA models. 70 | -------------------------------------------------------------------------------- /timetable.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/linear-models-r/a6c1414d1dcba3cecc6b595e8b4081ec1f729e61/timetable.pdf --------------------------------------------------------------------------------