├── .RData ├── .Rhistory ├── README.md ├── data-eyetracking.csv ├── data-vocab-females.csv ├── data-vocab-individuals.csv ├── data-vocab.csv ├── independent problems ├── tutorial2-independent.R └── tutorial3-independent.R ├── old ├── README.md └── gca-tutorial.R ├── probabilities-refresher.R ├── tutorial1-introduction.Rmd ├── tutorial1-introduction.html ├── tutorial2-general-linear-models.Rmd ├── tutorial2-general-linear-models.html ├── tutorial3-generalized-linear-models.Rmd ├── tutorial3-generalized-linear-models.html ├── tutorial4-growth-curve-analysis.Rmd └── tutorial4-growth-curve-analysis.html /.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brockf/tutorial-GCA-R/364d56eae9b9e0fc767c5e79d561bf7177961fb8/.RData -------------------------------------------------------------------------------- /.Rhistory: -------------------------------------------------------------------------------- 1 | I have done some prep, though: 2 | * removed trackloss 3 | * converted GazeX and Y columns to AOI 1 or 0 columns 4 | * merged in Subject information 5 | # Analysis 1: By-subjects and By-items ANOVAs (i.e., F1 and F2) 6 | Pros: 7 | * simple 8 | * allows for generalization beyond samples participants and sampled items. 9 | Cons: 10 | * lose all timing information 11 | * what do we do with ambiguities between analyses? 12 | We need to aggregate across trials by target within participants: 13 | ```{r} 14 | agg_subjects <- data %>% 15 | group_by(ParticipantName,Sex,Age,Target) %>% 16 | summarise(PropAnimal = mean(Animate)) %>% 17 | ungroup() 18 | ``` 19 | Visualize our aggregated data: 20 | ```{r} 21 | ggplot(agg_subjects, aes(x=Target, y=PropAnimal)) + 22 | geom_point(position=position_jitter(.3)) 23 | ``` 24 | Use our best practices from the `lm()` tutorial to prepare and model these data: 25 | ```{r} 26 | agg_subjects$TargetCoded <- ifelse(agg_subjects$Target == 'Artefact', -.5, .5) 27 | ``` 28 | There's no need to center here because these data are balanced (every subject is present in both conditions). But this is how we would center, anyways: 29 | ```{r} 30 | agg_subjects$TargetCoded <- scale(agg_subjects$TargetCoded, center=T, scale=F) 31 | ``` 32 | Here we use `aov()` because it allows for a repeated-measures Error() term. 33 | As we learned before, aov() uses Type I sums of squares, but with only one factor (i.e., no correlation issues), it's safe. 34 | ```{r} 35 | model <- aov(PropAnimal ~ TargetCoded + Error(ParticipantName/TargetCoded), data = agg_subjects) 36 | summary(model) 37 | ``` 38 | Looks good! That's our F2 "subjects" ANOVA. 39 | Now we can do an F1 "items" ANOVA as well. This just involves slightly changing our `group_by()` call: 40 | ```{r} 41 | agg_items <- data %>% 42 | group_by(Trial,Target) %>% 43 | summarise(PropAnimal = mean(Animate)) %>% 44 | ungroup() 45 | agg_items$TargetCoded <- ifelse(agg_items$Target == 'Artefact', -.5, .5) 46 | ``` 47 | Visualize effects by items: 48 | ```{r} 49 | ggplot(agg_items, aes(x=Target, y=PropAnimal, fill=Trial)) + 50 | geom_point(position=position_jitter(.3)) 51 | ``` 52 | Normally, in an F2 analysis, we would include an Error() term because we would have observed each condition within each item and thus have a sense about the size of the condition effect for each item. However, this was a study with infants and we couldn't do that. So we won't include an Error() term here, and just do a between-subjects one-way ANOVA. 53 | ```{r} 54 | model <- aov(PropAnimal ~ TargetCoded, data = agg_items) 55 | summary(model) 56 | ``` 57 | These F1/F2 analyses are both crystal clear (reject the null!). But, in other cases, there can be ambiguit. For example, what if one is significant and the other is marginal? 58 | Ideally, we could have one test which allows us to control for random trial AND subject factors simultaneously. Enter `lmer()`... 59 | # Analysis 2: Simultaneous Trial and Subject Random Effects 60 | Aggregate data by Trials (Items) and Participants (i.e., one datapoint for each trial). 61 | ```{r} 62 | agg_sub_items <- data %>% 63 | group_by(ParticipantName,Trial,Target) %>% 64 | summarise(PropAnimal = mean(Animate)) %>% 65 | ungroup() 66 | agg_sub_items$TargetCoded <- ifelse(agg_sub_items$Target == 'Artefact', -.5, .5) 67 | ``` 68 | Fit a mixed-effects model allowing the intercept (represented by a "1") to vary by both Participants and Trials. By allowing the intercept to vary by subjects and items, we are allowing the model to estimate (and thus control for) each participants' and trials' mean tendency to look (cause participants to) look at the animal regardless of condition. For example, Billy may just love animals while Sammy may hate animals. Importantly, we want to know whether they word they heard (represented here by TargetCoded) caused them to look more/less to the animal above and beyond these baseline preferences. 69 | ```{r} 70 | model <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items) 71 | summary(model) 72 | ``` 73 | This looks good, and converges with our previous estimate of a significant effect. 74 | Note that, at the top of the summary output, we can see that there is very little variance in our random effect estimates. This means that the model is having trouble estimating them and is therefore keeping them all very near zero. 75 | This is likely the result of (a) small random differences between subjects and trials in this sample and, (b) a relatively small dataset. We will want to continue keep an eye on the variance of random effects -- when this is essentially 0, we may want to consider using a regular ANOVA. 76 | But in the meantime let's dive into this model a bit more. 77 | We can see the fixed effects: 78 | ```{r} 79 | fixef(model) 80 | ``` 81 | Here, because we centered the variable, `Intercept` is the overall mean looking to animal. `TargetCoded` represents the difference between looking to the animal between our two conditions. If we subtract 1/2 of it from the intercept, we get our mean looking to the animal when the "Artefact" was named and, if we add 1/2 of it to the intercept, we get our mean looking to the animal when the "Animal" was named. 82 | We can also see random effects (some people use these to describe individual differences): 83 | ```{r} 84 | ranef(model) 85 | ``` 86 | Here they are nearly zero, which corresponds to our assessment of their variance earlier. 87 | We can also see what the model's predictions were, for each participant/item. Here we will get the mean prediction for each subject for each type of Target trial. 88 | ```{r} 89 | agg_sub_items$prediction <- predict(model, agg_sub_items) 90 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) + 91 | stat_summary(fun.y='mean', geom='point', position=position_jitter(.3)) 92 | ``` 93 | You can see that the model is making different predictions for each subject. By allowing their intercepts to vary, the model is accounting for the fact that some kids just like looking at animals more than others. But it's being very conservative with its estimates because, frankly, we haven't given it much data to go on. 94 | Importantly, with these models, we aren't limited to varying just the intercept by specific grouping factors. We can also vary slopes (i.e., fixed effects) by participants as well. By adding "TargetCoded" as a random slope, we allow the model to vary the magnitude of the difference between Target conditions within participants. 95 | ```{r} 96 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 97 | summary(model) 98 | ``` 99 | We now see an additional random effect by the `ParticipantName` group: `TargetCoded`. We also see that it is actually varying between subjects, suggesting the model is become more confident in the differences between our subjects. 100 | Look at which participants responded stronger/weaker to the Target manipulation. 101 | These random effects are centered with respect to the fixed effect estimated. Therefore, negative effects represented a weaker effect of TargetCoded for this participant. This could be a cool way to look at individual differences -- who is responding most accurately to the words spoken? 102 | ```{r} 103 | ranef(model) 104 | subject_slopes <- ranef(model)$ParticipantName 105 | subject_slopes$subject <- factor(rownames(subject_slopes)) 106 | colnames(subject_slopes) <- c('Baseline','ConditionEffect','Subject') 107 | ggplot(subject_slopes, aes(x=Subject, y=ConditionEffect)) + 108 | geom_point() + 109 | geom_text(aes(label=Subject),hjust=0,vjust=0,size=3) + 110 | geom_hline(yint=0, linetype="dashed", alpha=.5) + 111 | theme(axis.text.x=element_blank()) 112 | ``` 113 | With this random slope included in the model, we see more variation now in our predictions by subjects. It's even picking up on the fact that participants consistently look to the animal and the real variance lies in whether they look away from the animal when the artefact is named. 114 | ```{r} 115 | agg_sub_items$prediction <- predict(model, agg_sub_items) 116 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) + 117 | stat_summary(fun.y='mean', geom='point', position=position_jitter(.3)) 118 | ``` 119 | # Two important questions: 120 | Although you've just played around with your first two mixed-effects models, you're likely already asking yourself two pressing questions... 121 | ## Question 1: Which random effect structure should we specify? 122 | If you are going to use mixed-effects models, you are going to require AT LEAST a random intercept for every natural "grouping" of data. However, beyond random intercepts, what random slopes should you include? 123 | ### Guidelines 124 | One way you can decide this is by adhering to guidelines. Unfortunately, you'll find guidelines pull you in opposing directions. 125 | For example, one school of thought (see Barr et al., 2013) is to "keep it maximal". That is, include every random effect that your experimental design permits (i.e., every factor that appeared across subjects or trials). Another school of thought (see Bates et al., 2015) is to keep it parsimonious. Don't overcomplexify your models with lots of random slopes, as this will make model estimates increasingly hard to reconcile with the data and risk overparamterizing your model: 126 | ```{r} 127 | # commented out because it kills knitr 128 | # model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded + Trial | ParticipantName) + (1 | Trial), data = agg_sub_items) 129 | ``` 130 | ### Model comparison 131 | A second way to decide is to think bottom-up from the data. Compare two models -- one with your random slope and another without your random slope -- and see if your random slope model is actually a better fit. 132 | ```{r} 133 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 134 | model_null <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items) 135 | anova(model,model_null) # -2 log-likelihood ratio test, gives you Chisq(df) = X and a p-value. 136 | ``` 137 | We'll talk more about model comparison -- a very powerful tool -- in just a minute. 138 | summary(model) 139 | summary(model) 140 | summary(model) 141 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 142 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 143 | anova(model,model_null) 144 | log_model <- logLik(model) 145 | log_model 146 | log_null <- logLik(model_null) 147 | log_null 148 | exp(39) 149 | log_model 150 | log_null 151 | View(agg_sub_items) 152 | agg_sub_items$RandomFactor <- ifelse(rbinom(nrow(agg_sub_items),.5) == 1, 'On','Off') 153 | ?rbinom 154 | agg_sub_items$RandomFactor <- ifelse(rbinom(nrow(agg_sub_items),1,.5) == 1, 'On','Off') 155 | View(agg_sub_items) 156 | model1 <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F) 157 | model2 <- lmer(PropAnimal ~ RandomFactor + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F) 158 | anova(model1,model2) 159 | library(ggplot2) 160 | library(lme4) 161 | library(dplyr) 162 | data <- read.csv('data-eyetracking.csv') 163 | binned <- data %>% 164 | mutate(TimeFromSubphaseOnset = (TimeFromSubphaseOnset / 1000) %/% 50) %>% # re-scale, bin 165 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 166 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 167 | ungroup() %>% 168 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 169 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 170 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt 171 | binned <- data %>% 172 | mutate(Bin = (TimeFromSubphaseOnset / 1000) %/% 50) %>% # re-scale, bin 173 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 174 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 175 | ungroup() %>% 176 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 177 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 178 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt 179 | binned <- data %>% 180 | mutate(TimeS = TimeFromSubphaseOnset / 1000, 181 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin 182 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 183 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 184 | ungroup() %>% 185 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 186 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 187 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt 188 | View(binned) 189 | binned <- data %>% 190 | mutate(TimeS = TimeFromSubphaseOnset / 1000, 191 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin 192 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 193 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 194 | #ungroup() %>% 195 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 196 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 197 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt 198 | View(binned) 199 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5) 200 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F) 201 | model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned) 202 | summary(model) 203 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 204 | stat_summary(fun.y=mean, geom="point") + 205 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") 206 | binned <- binned %>% 207 | mutate(TimeS_2 = TimeS^2, 208 | TimeS_3 = TimeS^3, 209 | TimeS_4 = TimeS^4) 210 | head(binned) 211 | plot(binned$TimeS, binned$TimeS_2) 212 | plot(binned$TimeS, binned$TimeS_3) 213 | plot(binned$TimeS, binned$TimeS_4) 214 | ggplot(binned, aes(x=TimeS, y=TimeS_2)) + 215 | geom_point() + 216 | geom_point(aes(y=TimeS_3)) + 217 | geom_point(aes(y=TimeS_4)) 218 | ggplot(binned, aes(x=TimeS, y=TimeS)) + 219 | geom_point() + 220 | geom_point(aes(y=TimeS_2), color='red') + 221 | geom_point(aes(y=TimeS_3), color='blue') + 222 | geom_point(aes(y=TimeS_4), color='green') 223 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3 + TimeS_4) + (1 + Target + TimeS + TimeS_2 + TimeS_3 + TimeS_4 | ParticipantName), data = binned) 224 | summary(model) 225 | ggplot(binned, aes(x=TimeS, y=elog)) + 226 | stat_summary(fun.y='mean', geom='line') 227 | ggplot(binned, aes(x=TimeS, y=elog)) + 228 | stat_smooth(method="loess") 229 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3) + (1 + Target + TimeS + TimeS_2 + TimeS_3 | ParticipantName), data = binned) 230 | summary(model) 231 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 232 | stat_summary(fun.y=mean, geom="point") + 233 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") 234 | natural_model <- model 235 | cor(binned[, c('TimeS','TimeS_2','TimeS_3','TimeS_4')]) 236 | orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6) 237 | head(orthogonal_polynomials) 238 | ggplot(orthogonal_polynomials, aes(x=1, y=1)) + 239 | geom_point() + 240 | geom_point(aes(y=2), color='red') + 241 | geom_point(aes(y=3), color='blue') + 242 | geom_point(aes(y=4), color='green') 243 | ggplot(data.frame(orthogonal_polynomials), aes(x=1, y=1)) + 244 | geom_point() + 245 | geom_point(aes(y=2), color='red') + 246 | geom_point(aes(y=3), color='blue') + 247 | geom_point(aes(y=4), color='green') 248 | data.frame(orthogonal_polynomials) 249 | ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) + 250 | geom_point() + 251 | geom_point(aes(y=X2), color='red') + 252 | geom_point(aes(y=X3), color='blue') + 253 | geom_point(aes(y=X4), color='green') 254 | ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) + 255 | geom_point() + 256 | geom_point(aes(y=X2), color='red') + 257 | geom_point(aes(y=X3), color='blue') + 258 | geom_point(aes(y=X4), color='green') + 259 | geom_point(aes(y=X5), color='purple') + 260 | geom_point(aes(y=X6), color='yellow') 261 | cor(orthogonal_polynomials[, c(1:6)]) 262 | round(cor(orthogonal_polynomials[, c(1:6)]),5) 263 | time_codes <- data.frame( 264 | sort(as.vector(unique(binned$TimeS))), 265 | orthogonal_polynomials[, c(1:6)] 266 | ) 267 | colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6') 268 | binned <- merge(binned, time_codes, by='TimeS') 269 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned) 270 | summary(model) 271 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 272 | stat_summary(fun.y=mean, geom="point") + 273 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") 274 | summary(natural_model) 275 | summary(model) 276 | drop1(model, ~., test="Chisq") 277 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned) 278 | summary(model_quartic) 279 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 280 | stat_summary(fun.y=mean, geom="point") + 281 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model 282 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model 283 | model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned) 284 | anova(model, model_quartic) 285 | model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned) 286 | model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F) 287 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F) 288 | anova(model_cubic, model_quartic) 289 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F) 290 | model_quintic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4 + ot5) + (1 + TargetC + ot1 + ot2 + ot3 + ot4+ ot5 | ParticipantName), data = binned, REML=F) 291 | anova(model_quartic, model_quintic) 292 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 293 | stat_summary(fun.y=mean, geom="point") + 294 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed', color='black') + # 4th-order model 295 | stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model 296 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 297 | stat_summary(fun.y=mean, geom="point") + 298 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed', color='black') + # 3rd-order model 299 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model 300 | model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned) 301 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 302 | stat_summary(fun.y=mean, geom="point") + 303 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed', color='black') + # 3rd-order model 304 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model 305 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 306 | stat_summary(fun.y=mean, geom="point") + 307 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") + # 3rd-order model 308 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model 309 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 310 | stat_summary(fun.y=mean, geom="point") + 311 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model 312 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model 313 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 314 | stat_summary(fun.y=mean, geom="point") + 315 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 4th-order model 316 | stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model 317 | model_linear <- lmer(elog ~ Target*(ot1) + (1 + Target + ot1 | ParticipantName), data = binned) 318 | summary(model_linear) 319 | model_linear <- lmer(elog ~ TargetC*(ot1) + (1 + TargetC + ot1 | ParticipantName), data = binned) 320 | summary(model_linear) 321 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 322 | stat_summary(fun.y=mean, geom="point") + 323 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model 324 | stat_summary(aes(y=predict(model_linear,binned,re.form=NA)), fun.y=mean, geom="line") # 2nd-order model 325 | new_condition <- binned %>% 326 | filter(Target == 'Animal') %>% 327 | mutate(Target = 'Neutral', 328 | y = y + round(rnorm(length(y),0,2)), 329 | y = ifelse(y > N, N, y), 330 | y = ifelse(y < 1, 1, y), 331 | PropAnimal = y / N, 332 | elog = log(y / (N-y+.5)), 333 | ArcSin = asin(sqrt(PropAnimal))) 334 | new_condition <- binned %>% 335 | filter(Target == 'Animal') 336 | View(binned) 337 | new_condition <- binned 338 | new_condition <- binned %>% 339 | filter(Target == 'Animal') 340 | new_condition <- binned %>% 341 | filter(Target == 'Animal') 342 | new_condition <- data.frame(binned) %>% 343 | filter(Target == 'Animal') 344 | summary(binned) 345 | binned[Target == 'Animal'] 346 | binned[which(Target == 'Animal'), ] 347 | binned[which(binned$Target == 'Animal'), ] 348 | new_condition <- binned %>% 349 | filter(Target == 'Animal') 350 | new_condition <- binned %>% 351 | group_by(ParticipantName) %>% 352 | summarise(MeanTest=mean(PropAnimal)) 353 | View(new_condition) 354 | new_condition <- binned %>% 355 | filter(Target == 'Animal') 356 | colnames(binned) 357 | new_condition <- binned %>% 358 | filter(Target = 'Animal') 359 | filter(binned, Target == 'Animal') 360 | new_condition <- binned %>% 361 | ungroup() %>% 362 | filter(Target == 'Animal') 363 | new_condition <- binned %>% 364 | filter(TargetC == .5) 365 | new_condition <- binned %>% 366 | #filter(Target == 'Animal') %>% 367 | mutate(Target = 'Neutral', 368 | y = y + round(rnorm(length(y),0,2)), 369 | y = ifelse(y > N, N, y), 370 | y = ifelse(y < 1, 1, y), 371 | PropAnimal = y / N, 372 | elog = log(y / (N-y+.5)), 373 | ArcSin = asin(sqrt(PropAnimal))) 374 | summary(binned) 375 | detach("package:plyr",detach=T) 376 | detach("package:plyr",unload=T) 377 | binned <- data %>% 378 | mutate(TimeS = TimeFromSubphaseOnset / 1000, 379 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin 380 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 381 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 382 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 383 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 384 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt %>% 385 | ungroup() 386 | binned <- data %>% 387 | mutate(TimeS = TimeFromSubphaseOnset / 1000, 388 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin 389 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 390 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 391 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 392 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 393 | Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt 394 | ungroup() 395 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5) 396 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F) 397 | model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned) 398 | summary(model) 399 | new_condition <- binned %>% 400 | filter(Target == 'Animal') %>% 401 | mutate(Target = 'Neutral', 402 | y = y + round(rnorm(length(y),0,2)), 403 | y = ifelse(y > N, N, y), 404 | y = ifelse(y < 1, 1, y), 405 | PropAnimal = y / N, 406 | elog = log(y / (N-y+.5)), 407 | ArcSin = asin(sqrt(PropAnimal))) 408 | new_condition <- binned %>% 409 | filter(Target == 'Animal') %>% 410 | mutate(Target = 'Neutral', 411 | y = y + round(rnorm(length(y),0,2)), 412 | y = ifelse(y > N, N, y), 413 | y = ifelse(y < 1, 1, y), 414 | PropAnimal = y / N, 415 | elog = log(y / (N-y+.5)), 416 | ArcSin = asin(sqrt(PropAnimal))) 417 | binned_3levels <- rbind(binned,new_condition) 418 | binned_3levels$Target <- factor(binned_3levels$Target) 419 | View(new_condition) 420 | View(binned) 421 | binned <- data %>% 422 | mutate(TimeS = TimeFromSubphaseOnset / 1000, 423 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin 424 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 425 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 426 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 427 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 428 | Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt 429 | ungroup() 430 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5) 431 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F) 432 | binned <- binned %>% 433 | mutate(TimeS_2 = TimeS^2, 434 | TimeS_3 = TimeS^3, 435 | TimeS_4 = TimeS^4) 436 | orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6) 437 | time_codes <- data.frame( 438 | sort(as.vector(unique(binned$TimeS))), 439 | orthogonal_polynomials[, c(1:6)] 440 | ) 441 | colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6') 442 | binned <- merge(binned, time_codes, by='TimeS') 443 | new_condition <- binned %>% 444 | filter(Target == 'Animal') %>% 445 | mutate(Target = 'Neutral', 446 | y = y + round(rnorm(length(y),0,2)), 447 | y = ifelse(y > N, N, y), 448 | y = ifelse(y < 1, 1, y), 449 | PropAnimal = y / N, 450 | elog = log(y / (N-y+.5)), 451 | ArcSin = asin(sqrt(PropAnimal))) 452 | summary(binned) 453 | data.frame(as.matrix(binned)) 454 | binned2 <- data.frame(as.matrix(binned)) 455 | View(binned2) 456 | new_condition <- binned2 %>% 457 | filter(Target == 'Animal') %>% 458 | mutate(Target = 'Neutral', 459 | y = y + round(rnorm(length(y),0,2)), 460 | y = ifelse(y > N, N, y), 461 | y = ifelse(y < 1, 1, y), 462 | PropAnimal = y / N, 463 | elog = log(y / (N-y+.5)), 464 | ArcSin = asin(sqrt(PropAnimal))) 465 | View(new_condition) 466 | new_condition <- binned2 %>% 467 | filter(Target == 'Animal') %>% 468 | mutate(Target = 'Neutral', 469 | y = y + round(rnorm(length(y),0,2))) 470 | summary(binned2) 471 | binned2 <- as.matrix(binned) 472 | new_condition <- binned2 %>% 473 | filter(Target == 'Animal') 474 | View(binned) 475 | new_condition <- binned[which(binned$Target == 'Animal'), ] 476 | new_condition$Target <- 'Neutral' 477 | #new_condition$y <- new_condition$y - round(new_condition$N / 3) 478 | new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),0,2)) 479 | new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y) 480 | new_condition[which(new_condition$y < 1), 'y'] <- 1 481 | new_condition$PropAnimal <- new_condition$y / new_condition$N 482 | new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) ) 483 | new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5) 484 | new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal)) 485 | binned_3levels <- rbind(binned,new_condition) 486 | binned_3levels$Target <- factor(binned_3levels$Target) 487 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) + 488 | stat_summary(fun.y=mean, geom="point") 489 | new_condition <- binned[which(binned$Target == 'Animal'), ] 490 | new_condition$Target <- 'Neutral' 491 | #new_condition$y <- new_condition$y - round(new_condition$N / 3) 492 | new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),-.5,2)) 493 | new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y) 494 | new_condition[which(new_condition$y < 1), 'y'] <- 1 495 | new_condition$PropAnimal <- new_condition$y / new_condition$N 496 | new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) ) 497 | new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5) 498 | new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal)) 499 | binned_3levels <- rbind(binned,new_condition) 500 | binned_3levels$Target <- factor(binned_3levels$Target) 501 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) + 502 | stat_summary(fun.y=mean, geom="point") 503 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels) 504 | summary(model) 505 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) + 506 | stat_summary(fun.y=mean, geom="point") + 507 | stat_summary(aes(y=predict(model,binned_3levels,re.form=NA)), fun.y=mean, geom="line") 508 | summary(binned_3levels$Target) 509 | levels(binned_3levels$Target) 510 | binned_3levels$Target <- factor(binned_3levels$Target, levels=c('Neutral','Animal','Artefact')) 511 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels) 512 | summary(model) 513 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # R Workshop on Using Linear Models, Logistic Regression, and Growth Curve Analyses to Analyze Eye-tracking Data 2 | 3 | A 4-part series culminating in using growth-curve analyses to model eye-tracking data. 4 | 5 | ## Introduction to R 6 | What are dataframes and vectors? How do R functions work? How do statistical tests in R work? How can I import and export data? 7 | 8 | ## General Linear Models 9 | How can I fit linear models in R? When should I use aov() and when should I use lm()? How can I interpret parameter estimates (without the help of SPSS...)? 10 | 11 | ## Generalized Linear Models 12 | How can I use generalized linear models (e.g., logistic regression) to do time-based eye-tracking analyses? How can I use empirical logit regression to the same end? And the arcsin-root transformation? How do mixed-effects models' random effects (intercepts and slopes) work in lmer()? 13 | 14 | ## Growth Curve Analyses 15 | How do I look at non-linear change over time? What are the differences between natural and orthogonal polynomials? How can interpret estimates in a growth curve model versus an empirical logit model? How can I visualize my raw data and model fits simultaneously? 16 | 17 | ### Acknowledgments 18 | 19 | * Dan Mirman for GCA techniques 20 | * Dale Barr for empirical logit regression 21 | * Florian Jaeger for mixed-effects models 22 | * Mike Frank and the Wordbank team for vocabulary data in the first two tutorials 23 | -------------------------------------------------------------------------------- /data-vocab-females.csv: -------------------------------------------------------------------------------- 1 | "","age","median" 2 | "1",16,50.5 3 | "3",17,75 4 | "5",18,121 5 | "7",19,148 6 | "9",20,221 7 | "11",21,213 8 | "13",22,309 9 | "15",23,379 10 | "17",24,350 11 | "19",25,435 12 | "21",26,471 13 | "23",27,531.5 14 | "25",28,513 15 | "27",29,579 16 | "29",30,575 17 | -------------------------------------------------------------------------------- /data-vocab.csv: -------------------------------------------------------------------------------- 1 | "","age","gender","median","ci.l","ci.h","n" 2 | "1",16,"F",50.5,8.5,17.5,65 3 | "2",16,"M",31,16,40,59 4 | "3",17,"F",75,31.05,37,35 5 | "4",17,"M",56.5,14.5,25.0874999999999,36 6 | "5",18,"F",121,50,48.5999999999999,34 7 | "6",18,"M",51,15,40.5999999999999,47 8 | "7",19,"F",148,54.5,71,38 9 | "8",19,"M",139,62,52,35 10 | "9",20,"F",221,76,40,45 11 | "10",20,"M",117,33,73.5,48 12 | "11",21,"F",213,73,114,31 13 | "12",21,"M",182,70,103,39 14 | "13",22,"F",309,82.2125,73.5,40 15 | "14",22,"M",190,53.625,104,32 16 | "15",23,"F",379,71,77,41 17 | "16",23,"M",314,69,32,41 18 | "17",24,"F",350,63,47,59 19 | "18",24,"M",270,52.1375,82,48 20 | "19",25,"F",435,53,44,45 21 | "20",25,"M",363,89,27,31 22 | "21",26,"F",471,101,70,39 23 | "22",26,"M",420,116,48,39 24 | "23",27,"F",531.5,66.5,36.5,38 25 | "24",27,"M",416,85,75,43 26 | "25",28,"F",513,45,30,29 27 | "26",28,"M",398.5,78.0875,77.0125,30 28 | "27",29,"F",579,84,49,28 29 | "28",29,"M",444,80,39,31 30 | "29",30,"F",575,34,42,35 31 | "30",30,"M",540,60,48,31 32 | -------------------------------------------------------------------------------- /independent problems/tutorial2-independent.R: -------------------------------------------------------------------------------- 1 | vocab_inds <- read.csv('data-vocab-individuals.csv') 2 | vocab_inds$ageC <- scale(vocab_inds$age, center=T,scale=F) 3 | vocab_inds$birth_orderC <- scale(vocab_inds$birth_order, center=T, scale=F) 4 | vocab_inds$mom_edC <- scale(vocab_inds$mom_ed, center=T, scale=F) 5 | vocab_inds$genderContrast <- -.5 6 | vocab_inds[which(vocab_inds$gender == 'M'), 'genderContrast'] <- .5 7 | 8 | model <- lm(productive ~ ageC + genderContrast + birth_orderC + mom_edC, data = vocab_inds) 9 | summary(model) 10 | 11 | library(ggplot2) 12 | 13 | # visualize gender and age effects 14 | ggplot(vocab_inds, aes(x=age, y=productive, color=gender)) + 15 | geom_point() + 16 | stat_smooth(method="lm") + 17 | scale_x_continuous(name="Age (Months)") + 18 | scale_y_continuous(name="Vocabulary") 19 | 20 | # visualize birth order effects 21 | # ugly median split: 22 | vocab_inds$birth_orderSplit <- 'Early' 23 | vocab_inds[which(vocab_inds$birth_order > median(vocab_inds$birth_order, na.rm=T)), 'birth_orderSplit'] <- 'Late' 24 | ggplot(vocab_inds, aes(x=age, y=productive, color=birth_orderSplit)) + 25 | geom_point() + 26 | stat_smooth(method="lm") + 27 | scale_x_continuous(name="Age (Months)") + 28 | scale_y_continuous(name="Vocabulary") 29 | 30 | # right now, our model's intercept (292.37) represents the estimated productive vocabulary 31 | # for a genderless baby at 22.91 months born as the 1.59th child born to a mom with 14.98 years 32 | # of education 33 | mean(vocab_inds$age) 34 | mean(vocab_inds$birth_order, na.rm=T) 35 | mean(vocab_inds$mom_ed, na.rm=T) 36 | 37 | # this is neither the overall mean, nor the mean of any particular cell 38 | mean(vocab_inds$productive, na.rm=T) 39 | # 292.64 40 | 41 | # but if we center the genderContrast variable, it should be the overall mean 42 | vocab_inds$genderContrastC <- scale(vocab_inds$genderContrast, center=T, scale=F) 43 | model <- lm(productive ~ ageC + genderContrastC + birth_orderC + mom_edC, data = vocab_inds) 44 | summary(model) 45 | 46 | # replicates our model 47 | drop1(model,~.,test="F") 48 | 49 | # visualize our model's predictions for our important factors 50 | vocab_inds$predicted <- predict(model, vocab_inds) 51 | 52 | vocab_inds$group <- paste(vocab_inds$gender, vocab_inds$birth_orderSplit, sep=' / ') 53 | vocab_inds$group <- factor(vocab_inds$group) 54 | 55 | ggplot(vocab_inds, aes(x=age, y=predicted, color=group)) + 56 | geom_point() + 57 | stat_smooth(method="lm", se=F) + 58 | scale_x_continuous(name="Age (Months)") + 59 | scale_y_continuous(name="Predicted Vocabulary") 60 | 61 | # fit a crazy interaction model 62 | crazy_model <- lm(productive ~ (ageC + genderContrastC + birth_orderC + mom_edC)^3, data = vocab_inds) 63 | summary(crazy_model) 64 | 65 | vocab_inds$predicted <- predict(crazy_model, vocab_inds) 66 | 67 | ggplot(vocab_inds, aes(x=age, y=predicted, color=group)) + 68 | geom_point() + 69 | stat_smooth(method="lm", se=F) + 70 | scale_x_continuous(name="Age (Months)") + 71 | scale_y_continuous(name="Predicted Vocabulary") 72 | 73 | # is the crazy model worth keeping? 74 | anova(crazy_model, model, test="F") 75 | -------------------------------------------------------------------------------- /independent problems/tutorial3-independent.R: -------------------------------------------------------------------------------- 1 | # tutorial 3 - independent work 2 | 3 | library(ggplot2) 4 | library(lme4) 5 | library(plyr) 6 | library(ez) 7 | 8 | # load our data 9 | data <- read.csv('data-eyetracking.csv') 10 | 11 | data$TargetCoded <- -.5 12 | data[which(data$Target == 'Artefact'), 'TargetCoded'] <- .5 13 | 14 | data$SexCoded <- -.5 15 | data[which(data$Sex == 'F'), 'SexCoded'] <- .5 16 | 17 | data$TimeS <- data$TimeFromSubphaseOnset / 1000; 18 | 19 | model <- glmer(Animate ~ (TargetCoded + SexCoded + TrialNumber + Age + TimeS)^2 + (1 + TargetCoded + TimeS + TrialNumber | ParticipantName) + (1 | Trial), family="binomial", data=data) 20 | # takes forever to fit... I stopped after 10 minutes 21 | 22 | data$Bin <- data$TimeFromSubphaseOnset %/% 50 23 | 24 | binned <- ddply(data, .(ParticipantName,Age,TargetCoded,SexCoded,Bin), summarize, PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) 25 | 26 | # calculate the empirical logit 27 | binned$elog <- log( (binned$y + .5) / (binned$N - binned$y + .5) ) 28 | binned$wts <- 1/(binned$y + .5) + 1/(binned$N - binned$y + .5) 29 | 30 | model <- lmer(elog ~ (TargetCoded + SexCoded + Age + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned, weights=1/wts) 31 | summary(model) 32 | # weird, no main effect of TargetCoded anymore... 33 | # why? 34 | # because Age isn't centered, and so it's estimating main effects at Age 0 (newborn...) 35 | 36 | model <- lmer(elog ~ (TargetCoded + SexCoded + scale(Age,center=T,scale=F) + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned, weights=1/wts) 37 | summary(model) 38 | # fixes it... 39 | 40 | # let's center Age for good, now 41 | binned$AgeC <- scale(binned$Age, scale=F, center=T) 42 | 43 | binned$Arcsin <- asin(sqrt(binned$PropAnimal)) 44 | 45 | model <- lmer(Arcsin ~ (TargetCoded + SexCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned) 46 | summary(model) 47 | 48 | drop1(model,~.,test="Chisq") 49 | 50 | # visualize a model's predictions 51 | # note I use PropAnimal here for interpretable DV's 52 | model <- lmer(PropAnimal ~ (TargetCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned) 53 | summary(model) 54 | 55 | predictions <- data.frame( 56 | rep(as.vector(unique(binned$TimeS)),times=4), 57 | rep(c(-1,+1),each=(length(unique(binned$TimeS))*2)), # Age, centered (younger/older) 58 | rep(c(-.5,.5,-.5,.5),each=length(unique(binned$TimeS))), 59 | rep(c('Animal','Artefact','Animal','Artefact'),each=length(unique(binned$TimeS))) 60 | ) 61 | colnames(predictions) <- c('TimeS','AgeC','TargetCoded','Target') 62 | 63 | predictions$prediction <- predict(model, re.form=NA, predictions) 64 | # note: re.form = NA sets all random effects to 0 so that 65 | # it works on new data!! 66 | 67 | ggplot(predictions, aes(x=TimeS, y=prediction, color=Target)) + 68 | geom_point() + 69 | geom_line() + 70 | facet_grid(.~AgeC) 71 | 72 | # this is how a linear-only model reconstructs our data... 73 | # could be better... 74 | 75 | # what happens when we mess with the Age column? 76 | 77 | binned$AgeC <- factor(binned$AgeC) 78 | 79 | model <- lmer(Arcsin ~ (TargetCoded + SexCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned) 80 | summary(model) 81 | 82 | binned$AgeC <- as.numeric(as.character(binned$AgeC)) 83 | 84 | model <- lmer(Arcsin ~ (TargetCoded + SexCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned) 85 | summary(model) 86 | # better... 87 | 88 | -------------------------------------------------------------------------------- /old/README.md: -------------------------------------------------------------------------------- 1 | tutorial-GCA-R 2 | ============== 3 | 4 | This tutorial uses simulated data to compare mixed-effects growth curve analyses (GCAs) to 5 | traditional ANOVA and natural polynomial mixed-effects modelling approaches. 6 | 7 | Everything I know about GCA, I learned from Dan Mirman (http://www.danmirman.org/gca) 8 | and by playing around in R. Here's a good Mirman paper on the topic: 9 | 10 | Mirman, D., Dixon, J., & Magnuson, J. S. (2008). Statistical and computational models of the 11 | visual world paradigm: Growth curves and individual differences. Journal of Memory and 12 | Language, 59, 474–494. 13 | 14 | Requires: install.packages('lmer','ggplot2','plyr','ez') 15 | 16 | @author: Brock Ferguson 17 | @website: brockferguson.com 18 | 19 | @modified June 29, 2014 20 | -------------------------------------------------------------------------------- /old/gca-tutorial.R: -------------------------------------------------------------------------------- 1 | # 2 | # Tutorial: Growth Curve Analyses in R 3 | # 4 | # This tutorial uses simulated data to compare mixed-effects growth curve analyses (GCAs) to 5 | # traditional ANOVA and natural polynomial mixed-effects modelling approaches. 6 | # 7 | # Everything I know about GCA, I learned from Dan Mirman (http://www.danmirman.org/gca) 8 | # and by playing around in R. Here's a good Mirman paper on the topic: 9 | # 10 | # Mirman, D., Dixon, J., & Magnuson, J. S. (2008). Statistical and computational models of the 11 | # visual world paradigm: Growth curves and individual differences. Journal of Memory and 12 | # Language, 59, 474–494. 13 | # 14 | # Requires: install.packages('lmer','ggplot2','plyr','ez') 15 | # 16 | # @author: Brock Ferguson 17 | # @website: brockferguson.com 18 | # 19 | # @modified June 29, 2014 20 | # 21 | 22 | ##################################################### 23 | # (1) Simulate Data 24 | # 25 | # Generate a random dataset for an experiment that has 2 conditions (Practice and Control). 26 | # Each participant is assessed for their "accuracy" at 10 timepoints. In the Practice condiition, 27 | # they rapidly become more accurate but then taper ~63%. In the Control condition, they get better 28 | # at a constant rate, and also max out around ~63%. Thus the group's differences do not differ in their 29 | # linear rate of improvement, maximum improvement, or starting point (~50%), but in their 30 | # growth trajectory. This is an ideal problem for Growth Curve Analyses, because we are interested 31 | # in the "shape" or functional form of their improvement over time, and we have multiple measurements 32 | # from the same individuals (i.e., it's longitudinal, not cross-sectional). 33 | ##################################################### 34 | 35 | # set parameters for Subjects and Timepoints 36 | # these are configured at 240/10 by default, and things may look different 37 | # if you change them 38 | num_subjects <- 240 39 | num_timepoints <- 10 40 | 41 | # create dataframe and begin populating it 42 | data <- data.frame(matrix(nrow=(num_subjects*num_timepoints), ncol=4)) 43 | colnames(data) <- c('Subject','Condition','Time','Accuracy') 44 | data$Subject <- paste('SUB',rep(1:num_subjects, each=num_timepoints),sep="") 45 | data$Condition <- rep(c('Control','Practice'), each=((num_subjects*num_timepoints)/2)) 46 | data$Time <- rep(1:num_timepoints,times=num_subjects) 47 | 48 | # generate data for each participant... 49 | 50 | # control condition gets slightly better over time 51 | data[which(data$Condition == 'Control'), 'Accuracy'] <- seq(.5,.64,by=(.15/num_timepoints)) 52 | 53 | # but the practice+sleep condition improves logarthmically 54 | data[which(data$Condition == 'Practice'), 'Accuracy'] <- .5 + log(1 + ((data[which(data$Condition == 'Practice'), 'Time'] - 1)*.55), 450000) 55 | 56 | # and make sure our data has the proper column formats 57 | data$Subject <- factor(data$Subject) 58 | data$Condition <- factor(data$Condition) 59 | 60 | ##################################################### 61 | # (2) Visualization and summaries 62 | ##################################################### 63 | 64 | # first, let's take a look at our "perfect" data (before we add some random noise) 65 | 66 | library(ggplot2) 67 | 68 | # this is what we want our models to eventually discover in the noise... 69 | ggplot(data=data, aes(x=Time, y=Accuracy, group=Condition, colour=Condition)) + geom_line() + geom_point() 70 | 71 | # add said noise 72 | data$Accuracy <- data$Accuracy + rnorm(num_subjects*num_timepoints,0,.05) 73 | data[which(data$Accuracy > 1), 'Accuracy'] <- 1 74 | data[which(data$Accuracy < 0), 'Accuracy'] <- 0 75 | 76 | # visualize with noise 77 | ggplot(data=data, aes(x=Time, y=Accuracy, group=Condition, colour=Condition)) + geom_point() 78 | 79 | # take some summary statistics 80 | 81 | library(plyr) 82 | 83 | # just grab total accuracy by condition 84 | summary <- ddply(data, .(Condition), summarize, MeanAccuracy = mean(Accuracy), sd = sd(Accuracy), N = length(Accuracy)) 85 | summary$se <- summary$sd / sqrt(summary$N) 86 | 87 | # hold up though, those N's are way too big. collapse by subjects first. 88 | subjects <- ddply(data, .(Subject,Condition), summarize, SubjectAccuracy = mean(Accuracy)) 89 | 90 | summary <- ddply(subjects, .(Condition), summarize, MeanAccuracy = mean(SubjectAccuracy), sd = sd(SubjectAccuracy), N = length(SubjectAccuracy)) 91 | summary$se <- summary$sd / sqrt(summary$N) 92 | 93 | # effects are looking strong 94 | 95 | # now breakdown by timepoint... 96 | summary_time <- ddply(data, .(Condition,Time), summarize, MeanAccuracy = mean(Accuracy), sd = sd(Accuracy), N = length(Accuracy)) 97 | summary_time$se <- summary_time$sd / sqrt(summary_time$N) 98 | 99 | ##################################################### 100 | # (3) Standard ANOVA Analysis 101 | ##################################################### 102 | 103 | library(ez) 104 | 105 | # ANOVA (Type-III) predicting Accuracy from Condition (between) and Time (within, nested under Subject) 106 | model <- ezANOVA(data, type = 3, dv = 'Accuracy', wid = Subject, between = .(Condition), within = .(Time)) 107 | model 108 | # effects: 109 | # Condition, p < .001 110 | # Time, p < .001 111 | # Interaction, ??? 112 | # most of the time, we won't see an interaction here (depends on your random data) 113 | # - if it's reliable, we would conclude that Practice participants improved more quickly over time (wrong) 114 | # - if it's insignificant, we would conclude that they improved at the same rate (wrong) 115 | # this type of analysis just can't capture the form of the growth, and will lead us in wrong directions. 116 | 117 | ##################################################### 118 | # (4) Hierarchical Linear Regression 119 | ##################################################### 120 | 121 | library(lme4) 122 | 123 | # let's just replicate our ANOVA analysis first... 124 | model <- lmer(Accuracy ~ Time*Condition + (1 | Subject), data = data) 125 | summary(model) 126 | # weird! now our main effect is less reliable... 127 | # what happened? 128 | # ezANOVA centered our predictors automatically, and so the main effect in the ANOVA 129 | # was captured at t==0 or, theoretically speaking, timepoint 5.5 130 | 131 | # center variables and try again... 132 | data$TimeC <- data$Time - mean(data$Time) 133 | 134 | model <- lmer(Accuracy ~ TimeC*Condition + (1 | Subject), data = data) 135 | summary(model) 136 | # better, but still suffering from the same problems as the ANOVA because we are only including 137 | # a term that can inherently only capture linear growth (i.e., linear time) 138 | 139 | ##################################################### 140 | # (5) Growth Curve Analysis 141 | ##################################################### 142 | 143 | # first, let's review what we've been doing in the previous 2 analyses... 144 | # essentially: correlating a perfectly linear variable (TimeC) with our DV (Accuracy) 145 | # and seeing if these correlations differed by an IV (Condition) 146 | # this would isolate differences in linear growth (i.e., slope) 147 | 148 | # let's re-create that manually... (of course, breaking all DF rules) 149 | practice <- data[which(data$Condition == 'Practice'), ] 150 | control <- data[which(data$Condition == 'Control'), ] 151 | 152 | cor.test(control$TimeC, control$Accuracy) 153 | # r ~ .67 154 | 155 | cor.test(practice$TimeC, practice$Accuracy) 156 | # r ~ .62 157 | 158 | # the linear models we used compared these slopes and (probably) told us they were not different 159 | # there, slopes == linear coefficients 160 | # (note: the main effect of condition in prior analyses is because the Time slope in the practice condition 161 | # is lower than the slope in the control condition... when extrapolating to timepoint=0, 162 | # this makes the Practice condition slightly higher) 163 | 164 | # in GCA, we do the same basic thing except we estimate linear and non-linear coefficients 165 | 166 | # let's continue to do this manually just to begin 167 | # TimeC is linear, let's create a quadratic vector (we know our data is quadratic) 168 | data$TimeQuad <- data$Time^2 169 | data$TimeQuadC <- scale(data$TimeQuad, center = T, scale = F) 170 | practice <- data[which(data$Condition == 'Practice'), ] 171 | control <- data[which(data$Condition == 'Control'), ] 172 | 173 | # take a look... 174 | data[1:10, 'Time'] 175 | data[1:10, 'TimeQuad'] 176 | 177 | cor.test(control$TimeQuadC, control$Accuracy) 178 | # r ~ .66 179 | 180 | cor.test(practice$TimeQuadC, practice$Accuracy) 181 | # r ~ .58 182 | 183 | # these coefficients are very similar to the linear TimeC 184 | # is the data linear or quadratic? or both? and do they differ by condition? 185 | # in order to answer these questions, we can fit them simultaneously to see which one 186 | # is the best predictor of Accuracy (as we would with any 2 variables of interest) 187 | 188 | model <- lmer(Accuracy ~ Condition + TimeC + TimeQuadC + TimeC:Condition + TimeQuadC:Condition + (1 | Subject), data = data) 189 | summary(model) 190 | # we should now see lots of significant effects, giving us some hint of form. 191 | # however, this look at form -- using natural polynomials -- is not ideal 192 | # because natural polynomials correlate with one another and, just as with any 193 | # 2 variables, when there is significant correlation between two IV's, our ability to 194 | # attribute variance to them is hindered. 195 | 196 | # to demonstrate: TimeC and TimeQuadC are strongly correlated. 197 | cor.test(data[1:10, 'TimeC'], data[1:10, 'TimeQuadC']) 198 | # p ~ .97 199 | 200 | # so when our linear model is attempting to parcel variance, it has no idea where it goes. 201 | # solution: replace our time codes with *orthogonal* polynomial contrast codes 202 | # these are, by definition, uncorrelated vectors of N length which perfectly fit linear, 203 | # quadratic, cubic, quartic, etc. functions 204 | 205 | # let's confirm what I said by generating 3: linear, quadratic, and cubic 206 | polycodes <- poly(1:10, 3) 207 | 208 | # linear code 209 | polycodes[, 1] 210 | 211 | # quadratic code 212 | polycodes[, 2] 213 | 214 | # cubic code 215 | polycodes[, 3] 216 | 217 | # visualize these polycodes... 218 | poly_df <- data.frame(matrix(nrow=30, ncol=3)) 219 | poly_df[, 1] <- paste('Poly', rep(1:3, each=10), sep='') 220 | poly_df[, 2] <- rep(1:10, times=3) 221 | poly_df[1:10, 3] <- polycodes[, 1] 222 | poly_df[11:20, 3] <- polycodes[, 2] 223 | poly_df[21:30, 3] <- polycodes[, 3] 224 | colnames(poly_df) <- c('Code','Timepoint','Value') 225 | 226 | ggplot(data=poly_df, aes(x=Timepoint, y=Value, group=Code, colour=Code)) + geom_line() + geom_point() 227 | 228 | # they better not be correlated... or we're going to run into the same problem. 229 | cor(polycodes) 230 | round(cor(polycodes), 2) 231 | # not at all, nice! 232 | 233 | # the linear code correlates 100% with our previous linear time code 234 | cor.test(polycodes[, 1], data[1:10, 'TimeC']) 235 | 236 | # the quadratic code correlates only partially with our old code, because this 237 | # is a complete quadratic function and not just our exponential function 238 | cor.test(polycodes[, 2], data[1:10, 'TimeQuadC']) 239 | 240 | # but because they don't correlate with eachother, we can substitute them into a model 241 | # and see how they predict our DV 242 | # we will also simultaneously compare the estimated coefficients for each Condition (i.e., the 243 | # interactions) to see if they differ 244 | 245 | # add them into our dataset 246 | data$ot1 <- rep(polycodes[, 1],times=num_subjects) 247 | data$ot2 <- rep(polycodes[, 2],times=num_subjects) 248 | data$ot3 <- rep(polycodes[, 3],times=num_subjects) 249 | 250 | # now fit a GCA model with these new orthogonal polynomials 251 | model <- lmer(Accuracy ~ ot1*ot2*Condition + (1 | Subject), data = data) 252 | summary(model) 253 | 254 | # this is good! but we need to add a couple of things: 255 | # to this point, all we have done is vary the intercept by subject. 256 | # when we are looking at a within-subjects variable (time, or our orthogonal timecodes) 257 | # we should allow the model to vary those by subject too 258 | 259 | # fit another model... 260 | model <- lmer(Accuracy ~ ot1*ot2*Condition + (1 + ot1 + ot2 | Subject), data = data) 261 | summary(model) 262 | 263 | # in our world of perfect data, it doesn't make much difference, but it's good practice 264 | 265 | # visualize our raw data with our model fit on top 266 | ggplot(data=data, aes(x=Time, y=Accuracy, group=Condition, colour=Condition)) + geom_point() + stat_summary(aes(y=fitted(model)), fun.y=mean, geom="line") 267 | # we recovered the original form pretty well! 268 | 269 | # what happens we add in the cubic parameter? 270 | model <- lmer(Accuracy ~ ot1*ot2*ot3*Condition + (1 + ot1 + ot2 + ot3 | Subject), data = data) 271 | summary(model) 272 | 273 | # pretty messy... 274 | # we can use model comparison to tell us if this parameter is justified 275 | 276 | model_null <- lmer(Accuracy ~ ot1*ot2*Condition + (1 + ot1 + ot2 + ot3 | Subject), data = data) 277 | anova(model,model_null) 278 | # p ~ .78... nope 279 | 280 | -------------------------------------------------------------------------------- /probabilities-refresher.R: -------------------------------------------------------------------------------- 1 | # Let's say we start with a probability of .8 2 | prob <- .8 3 | 4 | # we can calculate the odds that we get a "hit" 5 | # .8 / .2 6 | odds <- prob / (1-prob) 7 | 8 | # we can then calculate the log-odds, used in logistic regression 9 | log_odds <- log(odds) 10 | log_odds <- log(prob / (1-prob)) 11 | 12 | # and if we want to convert the log-odds back to probability 13 | prob2 <- exp(log_odds) / (1+exp(log_odds)) 14 | 15 | # empirical logit is just like log-odds except we add a constant 16 | # to keep from getting odds of -Inf 17 | hits <- 8 18 | total <- 10 19 | empirical_logit <- log( (hits + .5) / (total - hits + .5) ) 20 | 21 | # compare probabilities to log-odds 22 | probs <- seq(.1,.9,by=.1) 23 | probs_logodds <- log(probs / (1-probs)) 24 | probs_arcsin <- asin(sqrt(probs)) 25 | 26 | # both Arcsin and log-odds do a nice job of extending the tails of the probability 27 | # distribution 28 | plot(probs,probs_logodds) 29 | plot(probs,probs_arcsin) -------------------------------------------------------------------------------- /tutorial1-introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 1: Introduction to R" 3 | author: "Brock Ferguson" 4 | date: "July 1, 2015" 5 | output: 6 | html_document: 7 | toc: true 8 | theme: readable 9 | --- 10 | 11 | # Dependencies 12 | 13 | Install the packages we'll be using today, and load ggplot -- which we'll use in this first tutorial. 14 | 15 | ```{r quiet=T} 16 | # commented out because I already have these 17 | # install.packages('reshape2','dplyr','ggplot2','lme4','lmerTest') 18 | 19 | library(ggplot2) 20 | ``` 21 | 22 | # Working with dataframes 23 | 24 | We are going to use dataset from Wordbank (http://wordbank.stanford.edu/) for these 25 | early tutorials. This dataset tracks boys' and girls' productive vocabularies 26 | through toddlerhood. To make things easier, I have aggregated the dataset across 27 | subjects to yield a median vocabulary size by gender and age, with confidence intervals. 28 | 29 | ```{r} 30 | vocab <- read.csv('data-vocab.csv') 31 | ``` 32 | 33 | This dataset is loaded as a "dataframe" -- a series of columns. 34 | You can see this dataframe now in our "Environment" window if you are using RStudio. 35 | 36 | Each column in a dataframe can be of a unique type. 37 | 38 | Let's see what types of columns we have: 39 | 40 | ```{r} 41 | summary(vocab) 42 | ``` 43 | 44 | We don't need the X column (it's just indicates the row number) so we will get rid of it 45 | by selecting the columns of the dataframe using a vector representing the range 2 to 7. 46 | 47 | ```{r} 48 | vocab <- vocab[, c(2:7)] 49 | summary(vocab) 50 | ``` 51 | 52 | We can select one column and assign it to a new vector variable. 53 | 54 | ```{r} 55 | medians <- vocab[, 'median'] 56 | ``` 57 | 58 | This vector is now listed in the Environment as well -- under "values". 59 | 60 | We can also select rows by number. Because we haven't assigned their values to a new variable, they will just be printed to the screen. 61 | 62 | ```{r} 63 | vocab[c(1:5), ] 64 | vocab[c(1,2,3,4,5), ] 65 | ``` 66 | 67 | ... Or combine row and column selections... 68 | 69 | ```{r} 70 | vocab[5, 'gender'] 71 | vocab[c(1:5), c('age','gender')] 72 | ``` 73 | 74 | Want to search through a dataframe for matching columns? We can do that too. 75 | 76 | ```{r} 77 | vocab[which(vocab$gender == 'F'), ] 78 | vocab[which(vocab$gender == 'F' & vocab$age == 17), ] 79 | vocab[which(vocab$age == 18 | vocab$age == 17), ] 80 | vocab[which(vocab$age > 16 & vocab$age < 19), ] 81 | ``` 82 | 83 | Using assignment, we can overwrite values in a dataframe. 84 | For example, let's set the median vocab for 17-month-old females to 1000. 85 | 86 | ```{r} 87 | vocab[which(vocab$gender == 'F' & vocab$age == 17), 'median'] <- 1000 88 | vocab[which(vocab$gender == 'F' & vocab$age == 17), ] # now shows 1000 89 | 90 | vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(1000,NA) 91 | 92 | # reset it now... 93 | vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(75,37) 94 | ``` 95 | 96 | # Vectors 97 | 98 | R is a vectorized language, meaning that we can perform operations on entire vectors 99 | in parallel and not have to operate on each cell one-at-a-time. 100 | 101 | Take our vector of medians, for example: 102 | 103 | ```{r} 104 | medians 105 | ``` 106 | 107 | Let's perform some mathematical operations on it: 108 | 109 | ```{r} 110 | medians + 2 111 | medians*4 112 | (medians^2) / 4 113 | log(medians) 114 | exp(medians) 115 | sqrt(medians) 116 | ``` 117 | 118 | Those vectorized functions return new vectors that have been transformed in some way. 119 | 120 | We can also use summary functions to boil our medians vector down to a single meaningful metric: 121 | 122 | ```{r} 123 | mean(medians) 124 | var(medians) 125 | sqrt(var(medians)) 126 | sd(medians) 127 | length(medians) 128 | 129 | length(vocab) 130 | nrow(vocab) 131 | ncol(vocab) 132 | ``` 133 | 134 | Or combine summary functions and vector transformations to do useful things, like 135 | calculate z-scores: 136 | 137 | ```{r} 138 | z_scores <- (medians - mean(medians)) / sd(medians) 139 | z_scores 140 | 141 | # compare to R's functions for z-score scaling... 142 | z_scores <- scale(medians,center=T,scale=T) 143 | ``` 144 | 145 | Almost all functions can apply to vectors, even non-mathematical functions: 146 | 147 | For example, `paste()` combines strings together: 148 | 149 | ```{r} 150 | paste('this','is','a','sentence',sep=' ') 151 | paste('median: ',medians,sep='') 152 | ``` 153 | 154 | And, of course, we can perform operations involving multiple vectors: 155 | 156 | ```{r} 157 | # vector * integer 158 | medians * rnorm(1, 0, 1) 159 | 160 | # vector * vector 161 | medians * rnorm(length(medians), 0, 1) 162 | ``` 163 | 164 | # Visualization 165 | 166 | R includes basic functions for visualizing your data: 167 | 168 | ```{r} 169 | plot(medians) 170 | hist(medians) 171 | plot(vocab$age, vocab$median) 172 | ``` 173 | 174 | ...but I prefer using ggplot2 which gives us more intuitive options for visualization 175 | and greater control over the appearance: 176 | 177 | ```{r} 178 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 179 | geom_point() 180 | 181 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 182 | geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) + 183 | labs(x="Age (Months)", y="Productive Vocabulary") 184 | 185 | ggplot(vocab, aes(x=age, y=median, fill=gender)) + 186 | geom_bar(stat='identity', position=position_dodge()) + 187 | geom_errorbar(aes(ymax=median+ci.h, ymin=median-ci.l), position=position_dodge(.9), width=0.25) + 188 | labs(x="Age (Months)", y="Productive Vocabulary") 189 | ``` 190 | 191 | We'll gradually use more and more advanced aspects of ggplot throughout these tutorials. For now, we are simply plotting single variables from a dataframe (much like we could with Excel). However, by the 4th tutorial, we'll be using ggplot to group, transform, and summarize our (almost) raw data and visualize it in a single shot. 192 | 193 | # Statistical tests and models 194 | 195 | Finally, R provides a host of functions to run statistical tests and/or fit models to your data. 196 | 197 | Let's do a basic correlation just to see how these functions work. 198 | 199 | First, get just the female data: 200 | 201 | ```{r} 202 | female_vocab <- vocab[which(vocab$gender == 'F'), c('age','median')] 203 | female_vocab 204 | ``` 205 | 206 | ## Does age correlate with productive vocabulary? 207 | 208 | Plot using basic R plots. 209 | 210 | ```{r} 211 | plot(female_vocab$age, female_vocab$median) 212 | 213 | # add a line of best fit 214 | abline(lm(female_vocab$median ~ female_vocab$age)) 215 | ``` 216 | 217 | We'll get more into the syntax of `lm()` later, but this function uses a linear model 218 | to predict the 'median' vector from the 'age' vector, then feeds this to `abline()` 219 | to plot this line of best fit. 220 | 221 | We can do the same thing with ggplot, though it is nice enough to give us the SE around our linear model fit: 222 | 223 | ```{r} 224 | ggplot(female_vocab, aes(x=age, y=median)) + 225 | geom_point() + 226 | stat_smooth(method="lm") 227 | ``` 228 | 229 | What is the Pearson's r for this correlation? 230 | 231 | ```{r} 232 | cor(female_vocab$age, female_vocab$median) 233 | ``` 234 | 235 | Is this significant? 236 | 237 | ```{r} 238 | cor.test(female_vocab$age, female_vocab$median) 239 | ``` 240 | 241 | We can access specific variables in this statistical test's output directly: 242 | 243 | ```{r} 244 | cor_output <- cor.test(female_vocab$age, female_vocab$median) 245 | 246 | # useful command to show the available data in any variable 247 | str(cor_output) 248 | 249 | cor_output$p.value 250 | cor_output$statistic 251 | 252 | # also: 253 | names(cor_output) 254 | cor_output$conf.int 255 | 256 | # round the p-value: 257 | cor_output$p.value 258 | round(cor_output$p.value, 3) 259 | 260 | # how much variance does this account for (e.g., R-squared, R^2)? 261 | cor_output$estimate^2 262 | ``` 263 | 264 | When we run statistical models, the vector we return is not only a collection of variables 265 | (e.g., from which we can select the p-value, estimate, etc.) but also an *object* 266 | on which we can run various *class* methods. 267 | 268 | ```{r} 269 | class(cor_output) 270 | ``` 271 | 272 | Here we can see we have an "htest" (i.e., "hypothesis test") object. 273 | 274 | This class is limited to the "print" method. 275 | 276 | ```{r} 277 | print(cor_output) 278 | ``` 279 | 280 | ... but other classes include other methods like 'plot', 'summary', etc. 281 | 282 | Let's run another statistical test on some random data. A Welch's t-test: 283 | 284 | ```{r} 285 | t.test(rnorm(30,5,5), rnorm(30,10,5)) 286 | ``` 287 | 288 | We can pass specific options to these tests. One common option for 't.test' is var.equal=T so that we run a standard Student t-test rather than a Welch's t-test. 289 | 290 | ```{r} 291 | t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T) 292 | ``` 293 | 294 | We can also use `paired` to run a dependent, paired t-test, or use `alt` to specify whether we want a 295 | one-sided or two-sided p-value. 296 | 297 | ```{r} 298 | t.test(rnorm(30,5,5), rnorm(30,10,5), paired=T) 299 | t.test(rnorm(30,5,5), rnorm(30,10,5), alt='less',var.equal=T) # by default: 'two.sided' 300 | ``` 301 | 302 | What class is a `t.test()` object? 303 | 304 | ```{r} 305 | class(t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T)) 306 | ``` 307 | 308 | It's the same as `cor.test()`, which is why the output is similar. 309 | 310 | Finally, we loaded our data from a CSV, now let's save a dataset to a CSV file. 311 | 312 | ```{r} 313 | write.csv(female_vocab, 'data-vocab-females.csv') 314 | ``` 315 | 316 | Clean up our workspace. 317 | 318 | ```{r} 319 | ls() 320 | rm(list=ls()) 321 | ``` -------------------------------------------------------------------------------- /tutorial2-general-linear-models.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 2: General Linear Models" 3 | author: "Brock Ferguson" 4 | date: "July 1, 2015" 5 | output: 6 | html_document: 7 | toc: true 8 | theme: readable 9 | --- 10 | 11 | Load packages we'll be using: 12 | 13 | ```{r} 14 | library(ggplot2) 15 | ``` 16 | 17 | Load in our vocabulary data again. 18 | 19 | ```{r} 20 | vocab <- read.csv('data-vocab.csv') 21 | ``` 22 | 23 | # Introduction to lm() 24 | 25 | Re-run our correlation as a basic linear model: 26 | 27 | ```{r} 28 | model <- lm(median ~ age, data=subset(vocab, gender == 'F')) 29 | ``` 30 | 31 | What methods are available to us on this object? 32 | 33 | ```{r} 34 | class(model) 35 | 36 | model 37 | print(model) 38 | 39 | summary(model) 40 | 41 | plot(model) 42 | ``` 43 | 44 | # lm() and aov() comparison 45 | 46 | Let's do some analyses with our vocab dataset to compare the different linear modeling 47 | methods made available by R. 48 | 49 | First, let's begin by visualizing the raw data again: 50 | 51 | ```{r} 52 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 53 | geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) + 54 | scale_x_continuous(name="Age (Months)") + 55 | scale_y_continuous(name="Productive Vocabulary") 56 | ``` 57 | 58 | ## Single factor models 59 | 60 | Let's compare `lm()` (Linear Model) and `aov()` (Analysis of Variance) in modeling these data. 61 | 62 | Note: For simplicity, we are going to treat each datapoint here as if it were a different subject, as if they were 1 subject of each gender at each age. 63 | 64 | ```{r} 65 | summary(lm(median ~ age, data=subset(vocab, gender == 'F'))) 66 | 67 | summary(aov(median ~ age, data=subset(vocab, gender == 'F'))) 68 | ``` 69 | 70 | What we expect is that a single term linear model should give us the exact same results as a one-way ANOVA. This is actually what we see when compare these models. Although the ANOVA did not return a precise estimate of the slope (like `lm()`), their p-values are identical and the ANOVA's F-value is exactly the linear model's t, squared: 71 | 72 | ```{r} 73 | sqrt(630.2) 74 | ``` 75 | 76 | ## Two factor models, no interaction 77 | 78 | What happens when we add in a second main effect (but no interaction)? 79 | 80 | ```{r} 81 | summary(lm(median ~ age + gender, data=vocab)) 82 | summary(aov(median ~ age + gender, data=vocab)) 83 | ``` 84 | 85 | Perfect again! Exact same results using both methods which is what we would expect when we have a two-way, non-interaction ANOVA and a two-parameter linear model. 86 | 87 | ## Two factor models, with interactions 88 | 89 | What happens when we allow our two factors to interact in predicting vocabulary? 90 | 91 | ```{r} 92 | summary(lm(median ~ age*gender, data=vocab)) 93 | summary(aov(median ~ age*gender, data=vocab)) 94 | ``` 95 | 96 | Now we see (a) vastly different results between the models and, (b) strange results within the linear model especially. Let's break this down. 97 | 98 | In our linear model, we still have our significant effect of age (which is good) but now no reliable effect of gender (which is bad) and a positive estimate of being male (which is very bad). 99 | 100 | In our ANOVA, we see a wildly significant effect of age, and a wildly significant effect of gender. 101 | 102 | What's going on? 103 | 104 | ## Why are lm() and aov() different? 105 | 106 | The answer to this questions lays in (a) the ways the models are fitted, and (b) the way our parameters are coded before running the model. We normally don't have to think about either of these things in SPSS, for example, because they are done for us automatically. 107 | 108 | ### lm() 109 | 110 | Let's first look at the linear model which, at first glance, seems the most questionable. 111 | 112 | Specifically, (1) why did it not return a main effect of gender? 113 | (2) why is its estimate of the effect of being male positive, when each male "subject" is lower than the female at that same age? 114 | 115 | Let's re-fit our model and save its results in a variable. 116 | 117 | ```{r} 118 | model <- lm(median ~ age*gender, data=vocab) 119 | ``` 120 | 121 | We want to know what this model fit looks like. We can do this by retrieving the predictions of our model and storing them in our dataframe. 122 | 123 | ```{r} 124 | vocab$predictions <- predict(model, vocab) 125 | ``` 126 | 127 | Rather than comparing our predictions and actual values line-by-line, it's best to visualize them on a single plot. 128 | 129 | Here we will plot the data like we did before, however we are now going to add a new "layer" to the plot which will include our predicted values. 130 | 131 | ```{r} 132 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 133 | geom_point() + # the raw data, plots the 'y' aesthetic above 134 | geom_line(aes(y=predictions)) + # our predictions, plotted as a line 135 | scale_x_continuous(name="Age (Months)") + 136 | scale_y_continuous(name="Predicted Vocabulary") 137 | ``` 138 | 139 | Looks like a good fit! Indeed, our model fit maps exactly on to the line of best we can generate using ggplot: 140 | 141 | ```{r} 142 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 143 | geom_point() + # the raw data, plots the 'y' aesthetic above 144 | stat_smooth(method="lm", se = FALSE) + # add a line of best fit, without SE 145 | scale_x_continuous(name="Age (Months)") + 146 | scale_y_continuous(name="Predicted Vocabulary") 147 | ``` 148 | 149 | So what's wrong with our model? 150 | 151 | Let's think about we want to know with each parameter: 152 | 153 | For Age, we want to know what the effect of Age is while controlling for Gender. 154 | 155 | For Gender, we want to know what the effect of Gender is while controlling for Age. 156 | 157 | When we say "Controlling" in a linear model, what we actually mean is "while holding that parameter constant at zero." 158 | 159 | If we set Age to 0 in our current model, as the variables are currently coded, we are essentially asking for the effect of Gender on a newborn's vocabulary. 160 | 161 | We can extrapolate our model's predictions down to a newborn to take the model's perspective on this question... 162 | 163 | ```{r} 164 | vocab_extrapolations <- data.frame( 165 | c(rep(0:30,each=2)), 166 | c(rep(c('M','F'),times=31)), 167 | c(rep(0,62)) 168 | ) 169 | colnames(vocab_extrapolations) <- c('age','gender','prediction') 170 | 171 | vocab_extrapolations$prediction <- predict(model, vocab_extrapolations) 172 | 173 | ggplot(vocab_extrapolations, aes(x=age, y=prediction, color=gender)) + 174 | geom_line() + 175 | scale_x_continuous(name="Age (Months)") + 176 | scale_y_continuous(name="Predicted Vocabulary") 177 | ``` 178 | 179 | Now everything should begin to make sense. Our non-significant, positive effect of being a Male is the model telling us that it thinks male newborns have a slight advantage on female newborns. This obviously doesn't make sense. 180 | 181 | This wasn't a problem for us before adding the interaction term because, without the interaction, it's estimate of the effect of Gender is constant and therefore equally reliable at all ages. 182 | 183 | To fix this problem, let's *center* the age variable so when we tell the model to give us the effect of gender while controlling for Age, we are holding Age constant at the *mean* age for our sample. 184 | 185 | ```{r} 186 | vocab$ageC <- scale(vocab$age, center=T, scale=F) 187 | 188 | summary(lm(median ~ ageC*gender, data=vocab)) 189 | ``` 190 | 191 | This is better: a significant, negative effect for gender. 192 | 193 | Now, we have to think about our effect of Age. As things stand, when we hold Gender constant at zero, we are actually fixing the gender parameter at "Female." This "treament coding" of variables is the default for factor columns in R dataframes. 194 | 195 | We will fix this in the exact same way as before: centering. To center a non-numeric factor variable with only 2 levels, we will re-code it as a numeric variable where our two groups differ by a total of 1 unit. This is called deviation coding: 196 | 197 | ```{r} 198 | vocab$genderC <- ifelse(vocab$gender == 'F', -.5, .5) 199 | vocab$genderC <- scale(vocab$genderC, center=T, scale=F) # make sure it's centered 200 | 201 | summary(lm(median ~ ageC*genderC, data=vocab)) 202 | ``` 203 | 204 | Now we have a proper model! 205 | 206 | The moral here is that we need to pay close attention to the way our variables are treated (e.g., centered, coded, standardized) because they change the way the model's effects are estimated and intepreted. It's more of an issue in R than elsewhere because, unlike SPSS and other packages which do all of this automatically in the background, R won't assume you want these transformations done in the background. 207 | 208 | I see mistakes with this all of the time. Here are my recommendations when fitting any linear model in R: 209 | 210 | (1) Deviation code and center all variables, by default. 211 | 212 | (2) Ask yourself, do the model's estimates make sense intuitively? 213 | 214 | (3) Visualize your raw data and model predictions. 215 | 216 | (4) Try to replicate main effect parameter's estimate using arithmetic. For example: 217 | 218 | ```{r} 219 | # our intercept should be equivalent to our total mean 220 | mean(vocab$median) 221 | 222 | # our gender effect should equal the difference between our male and female groups 223 | mean(vocab[which(vocab$gender=='F'), 'median']) - mean(vocab[which(vocab$gender=='M'), 'median']) 224 | ``` 225 | 226 | The reason we have to be so careful with `lm()` is that what it's giving us is *simple* effects -- estimates of the effect of a single variable while holding other independent variables constant at one level. 227 | 228 | What we usually want are *main effects* -- estimates of the effect of a single variable while holding all other independent variables at their average. 229 | 230 | By centering and using deviation-coding, we can get main effects from `lm()`. 231 | 232 | ### aov() 233 | 234 | But what `aov()`? If you remember, it wasn't so bad. It already knew there was a reliable main effect of gender (though we don't know which direction, because we haven't plotted its predictions). 235 | 236 | Is it better than `lm()`? No! 237 | 238 | Here's why: 239 | 240 | ```{r} 241 | # generate three random vectors 242 | random_dv <- rnorm(30,5,5) 243 | random_iv1 <- rnorm(30,5,5) 244 | random_iv2 <- rnorm(30,5,5) 245 | 246 | # fit a model using these random vectors 247 | summary(aov(random_dv ~ random_iv1 + random_iv2)) 248 | 249 | # now enter them in the opposite order: 250 | summary(aov(random_dv ~ random_iv2 + random_iv1)) 251 | ``` 252 | 253 | We got different answers depending on the order we entered the variables. This is because `aov()` uses Type I (sequential) sums of squares. No amount of re-coding or centering can overcome this property and give us the answers we (probably) want. 254 | 255 | Notice, however, that changing the order of age and gender in our model as no effect: 256 | 257 | ```{r} 258 | summary(aov(median ~ age + gender, data=vocab)) 259 | 260 | summary(aov(median ~ gender + age, data=vocab)) 261 | ``` 262 | 263 | This is because these predictors are perfectly uncorrelated (because we have a balanced design): 264 | 265 | ```{r} 266 | cor.test(vocab$age, as.numeric(vocab$gender)) 267 | ``` 268 | 269 | With even the slightest correlation, our estimates are going to change based on order of entry in the model. 270 | 271 | ## When should I use aov()? 272 | 273 | There is only one time you should use `aov()` -- repeated-measures ANOVAs. In tutorial 3, we will begin using mixed-effects models which, by way of random effects, serve much the same function. However, sometimes you just want a good old-fashioned repeated measures ANOVA and, *provided your design is balanced and everything is coded properly*, you will get an appropriate answer from `aov()`. 274 | 275 | Here's a quick example: 276 | 277 | ```{r} 278 | # build a quick dataframe 279 | dv <- c(1,3,4,2,2,3,2,5,6,3,4,4,3,5,6) 280 | subject <- factor(rep(paste0('s',1:5),each=3)) 281 | myfactor <- factor(rep(paste0('f',1:3),times=5)) 282 | mydata <- data.frame(dv, subject, myfactor) 283 | 284 | model <- aov(dv ~ myfactor + Error(subject/myfactor), data=mydata) 285 | summary(model) 286 | ``` 287 | 288 | ## Other helpful lm() functions 289 | 290 | Here are some other helpful functions for linear models: 291 | 292 | ```{r} 293 | model <- lm(median ~ ageC*genderC, data=vocab) 294 | 295 | # confint(): get confidence intervals for estimates 296 | confint(model) 297 | 298 | # pairwise.t.test(): run pairwise.t.tests between groups (ignoring other factors) 299 | pairwise.t.test(vocab$median, vocab$genderC) 300 | # insignificant, because we aren't controlling for Age 301 | 302 | # drop1(): drop each parameter and retrieve an F, for reporting like an ANOVA 303 | model <- lm(median ~ ageC*genderC, data=vocab) 304 | drop1(model,~.,test="F") 305 | ``` 306 | 307 | Clean up our workspace. 308 | 309 | ```{r} 310 | ls() 311 | rm(list=ls()) 312 | ``` -------------------------------------------------------------------------------- /tutorial3-generalized-linear-models.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 3: Generalized Linear Models" 3 | author: "Brock Ferguson" 4 | date: "July 1, 2015" 5 | output: 6 | html_document: 7 | toc: true 8 | theme: readable 9 | --- 10 | 11 | Load required packages. 12 | 13 | ```{r} 14 | library(ggplot2) 15 | library(lme4) 16 | library(dplyr) 17 | ``` 18 | 19 | Load our eyetracking dataset. These data are from the familiar world trials for 19-month-olds reported in Ferguson, Graf, & Waxman (2014, Cognition). 20 | 21 | ```{r} 22 | data <- read.csv('data-eyetracking.csv') 23 | ``` 24 | 25 | What kind of columns do we need for these analyses? 26 | 27 | ```{r} 28 | summary(data) 29 | ``` 30 | 31 | This dataset is only a few steps removed from what comes out of the standard eyetracker. 32 | 33 | I have done some prep, though: 34 | 35 | * removed trackloss 36 | * converted GazeX and Y columns to AOI 1 or 0 columns 37 | * merged in Subject information 38 | 39 | # Analysis 1: By-subjects and By-items ANOVAs (i.e., F1 and F2) 40 | 41 | Pros: 42 | * simple 43 | * allows for generalization beyond samples participants and sampled items. 44 | 45 | Cons: 46 | * lose all timing information 47 | * what do we do with ambiguities between analyses? 48 | 49 | We need to aggregate across trials by target within participants: 50 | 51 | ```{r} 52 | agg_subjects <- data %>% 53 | group_by(ParticipantName,Sex,Age,Target) %>% 54 | summarise(PropAnimal = mean(Animate)) %>% 55 | ungroup() 56 | ``` 57 | 58 | Visualize our aggregated data: 59 | 60 | ```{r} 61 | ggplot(agg_subjects, aes(x=Target, y=PropAnimal)) + 62 | geom_point(position=position_jitter(.3)) 63 | ``` 64 | 65 | Use our best practices from the `lm()` tutorial to prepare and model these data: 66 | 67 | ```{r} 68 | agg_subjects$TargetCoded <- ifelse(agg_subjects$Target == 'Artefact', -.5, .5) 69 | ``` 70 | 71 | There's no need to center here because these data are balanced (every subject is present in both conditions). But this is how we would center, anyways: 72 | 73 | ```{r} 74 | agg_subjects$TargetCoded <- scale(agg_subjects$TargetCoded, center=T, scale=F) 75 | ``` 76 | 77 | Here we use `aov()` because it allows for a repeated-measures Error() term. 78 | 79 | As we learned before, aov() uses Type I sums of squares, but with only one factor (i.e., no correlation issues), it's safe. 80 | 81 | ```{r} 82 | model <- aov(PropAnimal ~ TargetCoded + Error(ParticipantName/TargetCoded), data = agg_subjects) 83 | summary(model) 84 | ``` 85 | 86 | Looks good! That's our F2 "subjects" ANOVA. 87 | 88 | Now we can do an F1 "items" ANOVA as well. This just involves slightly changing our `group_by()` call: 89 | 90 | ```{r} 91 | agg_items <- data %>% 92 | group_by(Trial,Target) %>% 93 | summarise(PropAnimal = mean(Animate)) %>% 94 | ungroup() 95 | 96 | agg_items$TargetCoded <- ifelse(agg_items$Target == 'Artefact', -.5, .5) 97 | ``` 98 | 99 | Visualize effects by items: 100 | 101 | ```{r} 102 | ggplot(agg_items, aes(x=Target, y=PropAnimal, fill=Trial)) + 103 | geom_point(position=position_jitter(.3)) 104 | ``` 105 | 106 | Normally, in an F2 analysis, we would include an Error() term because we would have observed each condition within each item and thus have a sense about the size of the condition effect for each item. However, this was a study with infants and we couldn't do that. So we won't include an Error() term here, and just do a between-subjects one-way ANOVA. 107 | 108 | ```{r} 109 | model <- aov(PropAnimal ~ TargetCoded, data = agg_items) 110 | summary(model) 111 | ``` 112 | 113 | These F1/F2 analyses are both crystal clear (reject the null!). But, in other cases, there can be ambiguit. For example, what if one is significant and the other is marginal? 114 | 115 | Ideally, we could have one test which allows us to control for random trial AND subject factors simultaneously. Enter `lmer()`... 116 | 117 | # Analysis 2: Simultaneous Trial and Subject Random Effects 118 | 119 | Aggregate data by Trials (Items) and Participants (i.e., one datapoint for each trial). 120 | 121 | ```{r} 122 | agg_sub_items <- data %>% 123 | group_by(ParticipantName,Trial,Target) %>% 124 | summarise(PropAnimal = mean(Animate)) %>% 125 | ungroup() 126 | 127 | agg_sub_items$TargetCoded <- ifelse(agg_sub_items$Target == 'Artefact', -.5, .5) 128 | agg_sub_items$TargetCoded <- scale(agg_sub_items$TargetCoded, center=T, scale=F) 129 | ``` 130 | 131 | Fit a mixed-effects model allowing the intercept (represented by a "1") to vary by both Participants and Trials. By allowing the intercept to vary by subjects and items, we are allowing the model to estimate (and thus control for) each participants' and trials' mean tendency to look (cause participants to) look at the animal regardless of condition. For example, Billy may just love animals while Sammy may hate animals. Importantly, we want to know whether they word they heard (represented here by TargetCoded) caused them to look more/less to the animal above and beyond these baseline preferences. 132 | 133 | ```{r} 134 | model <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items) 135 | summary(model) 136 | ``` 137 | 138 | This looks good, and converges with our previous estimate of a significant effect. 139 | 140 | Note that, at the top of the summary output, we can see that there is very little variance in our random effect estimates. This means that the model is having trouble estimating them and is therefore keeping them all very near zero. 141 | 142 | This is likely the result of (a) small random differences between subjects and trials in this sample and, (b) a relatively small dataset. We will want to continue keep an eye on the variance of random effects -- when this is essentially 0, we may want to consider using a regular ANOVA. 143 | 144 | But in the meantime let's dive into this model a bit more. 145 | 146 | We can see the fixed effects: 147 | 148 | ```{r} 149 | fixef(model) 150 | ``` 151 | 152 | Here, because we centered the variable, `Intercept` is the overall mean looking to animal. `TargetCoded` represents the difference between looking to the animal between our two conditions. If we subtract 1/2 of it from the intercept, we get our mean looking to the animal when the "Artefact" was named and, if we add 1/2 of it to the intercept, we get our mean looking to the animal when the "Animal" was named. 153 | 154 | We can also see random effects (some people use these to describe individual differences): 155 | 156 | ```{r} 157 | ranef(model) 158 | ``` 159 | 160 | Here they are nearly zero, which corresponds to our assessment of their variance earlier. 161 | 162 | We can also see what the model's predictions were, for each participant/item. Here we will get the mean prediction for each subject for each type of Target trial. 163 | 164 | ```{r} 165 | agg_sub_items$prediction <- predict(model, agg_sub_items) 166 | 167 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) + 168 | stat_summary(fun.y='mean', geom='point', position=position_jitter(.3)) 169 | ``` 170 | 171 | You can see that the model is making different predictions for each subject. By allowing their intercepts to vary, the model is accounting for the fact that some kids just like looking at animals more than others. But it's being very conservative with its estimates because, frankly, we haven't given it much data to go on. 172 | 173 | Importantly, with these models, we aren't limited to varying just the intercept by specific grouping factors. We can also vary slopes (i.e., fixed effects) by participants as well. By adding "TargetCoded" as a random slope, we allow the model to vary the magnitude of the difference between Target conditions within participants. 174 | 175 | ```{r} 176 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 177 | summary(model) 178 | ``` 179 | 180 | We now see an additional random effect by the `ParticipantName` group: `TargetCoded`. We also see that it is actually varying between subjects, suggesting the model is become more confident in the differences between our subjects. 181 | 182 | Look at which participants responded stronger/weaker to the Target manipulation. 183 | 184 | These random effects are centered with respect to the fixed effect estimated. Therefore, negative effects represented a weaker effect of TargetCoded for this participant. This could be a cool way to look at individual differences -- who is responding most accurately to the words spoken? 185 | 186 | ```{r} 187 | ranef(model) 188 | 189 | subject_slopes <- ranef(model)$ParticipantName 190 | subject_slopes$subject <- factor(rownames(subject_slopes)) 191 | colnames(subject_slopes) <- c('Baseline','ConditionEffect','Subject') 192 | 193 | ggplot(subject_slopes, aes(x=Subject, y=ConditionEffect)) + 194 | geom_point() + 195 | geom_text(aes(label=Subject),hjust=0,vjust=0,size=3) + 196 | geom_hline(yint=0, linetype="dashed", alpha=.5) + 197 | theme(axis.text.x=element_blank()) 198 | ``` 199 | 200 | With this random slope included in the model, we see more variation now in our predictions by subjects. It's even picking up on the fact that participants consistently look to the animal and the real variance lies in whether they look away from the animal when the artefact is named. 201 | 202 | ```{r} 203 | agg_sub_items$prediction <- predict(model, agg_sub_items) 204 | 205 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) + 206 | stat_summary(fun.y='mean', geom='point', position=position_jitter(.3)) 207 | ``` 208 | 209 | # Two important questions: 210 | 211 | Although you've just played around with your first two mixed-effects models, you're likely already asking yourself two pressing questions... 212 | 213 | ## Question 1: Which random effect structure should we specify? 214 | 215 | If you are going to use mixed-effects models, you are going to require AT LEAST a random intercept for every natural "grouping" of data. However, beyond random intercepts, what random slopes should you include? 216 | 217 | ### Guidelines 218 | 219 | One way you can decide this is by adhering to guidelines. Unfortunately, you'll find guidelines pull you in opposing directions. 220 | 221 | For example, one school of thought (see Barr et al., 2013) is to "keep it maximal". That is, include every random effect that your experimental design permits (i.e., every factor that appeared across subjects or trials). Another school of thought (see Bates et al., 2015) is to keep it parsimonious. Don't overcomplexify your models with lots of random slopes, as this will make model estimates increasingly hard to reconcile with the data and risk overparamterizing your model: 222 | 223 | ```{r, error=TRUE} 224 | # Will return error: too many random effects 225 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded + Trial | ParticipantName) + (1 | Trial), data = agg_sub_items) 226 | ``` 227 | 228 | ### Model comparison 229 | 230 | A second way to decide is to think bottom-up from the data. Compare two models -- one with your random slope and another without your random slope -- and see if your random slope model is actually a better fit. 231 | 232 | ```{r} 233 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 234 | 235 | model_null <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items) 236 | 237 | anova(model,model_null) # -2 log-likelihood ratio test, gives you Chisq(df) = X and a p-value. 238 | ``` 239 | 240 | We'll talk more about model comparison -- a very powerful tool -- in just a minute. 241 | 242 | ### Variance of random effects 243 | 244 | Another bottom-up approach is to look at the variance of the random effect estimates (like we did before) rather than thinking dichotomously about the "significance" of the random slope. 245 | 246 | ```{r} 247 | summary(model) 248 | ``` 249 | 250 | ### Design limitations 251 | 252 | A final approach -- and my preferred approach -- is to combine the bottom-up and top-down approaches to ask yourself, what random effects can my design *actually* allow me to estimate with reasonable precision? Here, by "precision", I mean precision with respect to the subject's true random effect. 253 | 254 | For example, if I had only two trials of each type (one in which I named an animal, and the other in which I named an artefact), I can calculate each subject's *exact* random slope for the effect of Target. This is a very precise estimate, but it's artificially precise -- there's no chance I've accurately captured this subject's responsiveness to which target was named. With a handful of trials of each type, I'm less precise in accounting for my exact data but I'm more precise in estimating their true responsiveness. 255 | 256 | Think about your design and what it can estimate. Include those variables that it estimates well as random slopes. Ignore those variables that you can't estimate well and even consider collapsing across those observations which you aren't distinguishing with a random effect. 257 | 258 | ## Question 2: How do we get p-values for specific factors? 259 | 260 | Unlike `lm()` and `aov()`, `lmer()` doesn't generate p-values... It's unclear, in these kinds of models, how to calculate the degrees of freedom -- required by the 't' statistic -- in order to get a p-value. 261 | 262 | Nevertheless, we have several options for doing so: 263 | 264 | ### Use model comparison 265 | 266 | Fit your model with the factor of interest: 267 | 268 | ```{r} 269 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 270 | ``` 271 | 272 | Fit a second model without this factor: 273 | 274 | ```{r} 275 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 276 | anova(model,model_null) 277 | ``` 278 | 279 | In this test, the degrees of freedom represent the difference between the number of parameters in the models. Note that this will be [NUMBER OF LEVELS - 1] for a removed factor, i.e., it can be >1 for a single factor. 280 | 281 | For many factors, you can save time by automating the removal of each factor with `drop1()`: 282 | 283 | ```{r} 284 | drop1(model,~.,test="Chisq") 285 | ``` 286 | 287 | So what's really going on when we compare these models? 288 | 289 | The test we are actually running is called a **-2 log-likelihood ratio test**. The gist of this approach is that it compares the likelihood of the data under the full model (including our parameter) to the likelihood of the data under our null model (without our parameter). If the difference is great enough relative to the number of parameters difference between the models (the degrees of freedom in the test), then we can say the difference is significant. 290 | 291 | Let's do one by hand. We begin by re-fitting both our models with the new option `REML=FALSE`. `anova()` and `drop()` do this automatically for us but, because we are going to do this by hand, we need to do it manually. 292 | 293 | ```{r} 294 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F) 295 | 296 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F) 297 | ``` 298 | 299 | Second, let's extract the log likelihood of the data under the full model and null model: 300 | 301 | ```{r} 302 | log_model <- logLik(model) 303 | log_model 304 | 305 | log_null <- logLik(model_null) 306 | log_null 307 | ``` 308 | 309 | Higher is better for log-likelihoods. Higher is more likely. 310 | 311 | Now calculate -2*log() the ratio of these two numbers, after calculating their natural exponents: 312 | 313 | ```{r} 314 | -2*log(exp(log_null)/exp(log_model)) 315 | ``` 316 | 317 | We are now left with a log-likelihood ratio test statistic of `20.77924`. We can now see whether this is significant by checking the Chi-square distribution with a degrees of freedom equivalent to the difference in parameters between our two models. The df's can be retrieved by the `logLik()` function, or calculated as `(factors removed)*(levels per factor) - (factors removed)`. 318 | 319 | ```{r} 320 | log_model 321 | log_null 322 | # 1 df 323 | 324 | pchisq(20.77924, df=1, lower.tail=F) 325 | ``` 326 | 327 | Note that this technique is only appropriate for nested models -- i.e., comparing a null model which is some subset of a larger model. To compare non-nested models, you should use another criterion such as AIC or BIC (also reported in `drop1()`). 328 | 329 | For example, is what word the subjects heard a better predictor than a random factor I create? 330 | 331 | ```{r} 332 | agg_sub_items$RandomFactor <- ifelse(rbinom(nrow(agg_sub_items),1,.5) == 1, 'On','Off') 333 | 334 | model1 <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F) 335 | 336 | model2 <- lmer(PropAnimal ~ RandomFactor + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F) 337 | 338 | anova(model1,model2) 339 | ``` 340 | 341 | For BIC and AIC, lower is better (unlike log-likelihoods). They represent a score representing the fit of the model after penalizing the model for the number of additional parameters. BIC carries a stricter penalty. Interpreting the degree of difference in fit involves using agreed-upon standards, for example, that a <2 BIC difference isn't great evidence but a 2-6 point BIC difference is good, 6-10 is very strong, etc. 342 | 343 | ### Treat the t's as z's 344 | 345 | This is not as dangerous as it sounds with sufficient sample sizes. 346 | 347 | ```{r} 348 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 349 | ts <- fixef(model)/sqrt(diag(vcov(model))) 350 | ts 351 | 2*pnorm(abs(ts),lower.tail=FALSE) 352 | ``` 353 | 354 | ### Just use confidence intervals (new statistics!) 355 | 356 | ```{r} 357 | confint(model) 358 | ``` 359 | 360 | ### Use an available technique to approximate the degrees of freedom 361 | 362 | Kenward-Roger approximation: 363 | 364 | ```{r} 365 | library(pbkrtest) 366 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 367 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 368 | KRmodcomp(model,model_null) 369 | detach('package:pbkrtest',unload=TRUE) 370 | ``` 371 | 372 | Satterthwaite approximation: 373 | 374 | ```{r} 375 | library(lmerTest) 376 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items) 377 | summary(model) 378 | detach('package:lmerTest',unload=TRUE) 379 | ``` 380 | 381 | # Analysis 3: Linear time analyses using logistic regression 382 | 383 | All of the analyses thus far have immediately disregarded time as a variable of interest. Yet we know that look changes over time and, moreover, that this change over time can offer insight into the cognitive processes underlying participants' behaviour. So how can we look at behaviour over time? 384 | 385 | One way to do this is to use logistic regression (family:"binomial") on our raw dataset, predicting the binomial value of looking to the Animate or not (which is either a 1 or 0). This model estimates the log-odds (i.e., probability in logit space) of looking at the animal by the fixed and random factors specified, including a factor for `TimeFromSubphaseOnset`. 386 | 387 | Logistic regression solves a problem we are now going to start taking seriously. Until now, we were predicting bounded, non-linear proportional data using a linear model. With logistic regression, and future linear models in which we employ transformations, we are going to correct for this. 388 | 389 | When we do logistic regression, we receive our slope estimates in logits. These represent the log-odds of getting a 1 versus a 0 based on some factor. Here's a plot showing how these relate to proportions (how we typically think of probabilities): 390 | 391 | ```{r} 392 | probabilities <- seq(0.1,0.9,by=.1) 393 | log_odds <- log(probabilities/(1-probabilities)) 394 | plot(probabilities, log_odds) 395 | ``` 396 | 397 | You can see that the probability of .8 roughly equals a log-odds or logit of 1.39. You can take the exponent of this number to calculate the odds when the probability is .8... 398 | 399 | ```{r} 400 | exp(1.39) 401 | ``` 402 | 403 | ... and you get roughly 4:1 odds, as you would expect. 404 | 405 | If probabilities/proportions are so easily transformed into logits/log-odds, then why bother doing it? Well, because we are doing LINEAR modeling and log-odds are linear but proportions are not. That is, proportions have a restricted range of 0-1 thus causing ceiling and floor effects. Log-odds, on the other hand, do not. 406 | 407 | To illustrate: A 2-to-1 increase in odds has a constant, linear effect on the log-odds regardless of where you are in logit space. 408 | 409 | ```{r} 410 | log(8/1) - log(4/1) 411 | 412 | log(16/1) - log(8/1) 413 | ``` 414 | 415 | In contrast, a .1 increase in probabilities has a non-constant, non-linear effect in probability space: 416 | 417 | ```{r} 418 | probabilities <- seq(0.1,0.9,by=.1) 419 | odds <- (probabilities/(1-probabilities)) 420 | plot(probabilities, odds) 421 | ``` 422 | 423 | This is why we see a bend at the tails when comparing proportions to log-odds -- floor/ceiling effects mean that small differences in proportions correspond to larger differences in actual odds. 424 | 425 | Okay, that said, let's run a logistic regression model: 426 | 427 | ```{r} 428 | data$TargetC <- ifelse(data$Target == 'Animal', .5, -.5) 429 | data$TargetC <- scale(data$TargetC, center=T, scale=F) 430 | 431 | # note: we are not going to Center TimeFromSubphaseOnset for now 432 | 433 | model <- glmer(Animate ~ TargetC*TimeFromSubphaseOnset + (1 | Trial) + (1 | ParticipantName), data = data, family="binomial") 434 | ``` 435 | 436 | We are getting some warnings, here. It suggests we should rescale our variables (likely because Target range is -.5 to .5 while TimeFromSubPhaseOnset is 0-5500), and reports a convergence error. 437 | 438 | Let's try fixing the scale issue... 439 | 440 | ```{r} 441 | data$TimeS <- data$TimeFromSubphaseOnset / 1000 442 | 443 | model <- glmer(Animate ~ TargetC*TimeS + (1 | Trial) + (1 | ParticipantName), data = data, family="binomial") 444 | summary(model) 445 | ``` 446 | 447 | No errors! We'll hold off on adding random slopes to this model for now (it will be really slow). 448 | 449 | Let's interpret these effects - what do main effects and interactions mean? Note that we centered and deviation-coded our Target variable but *not* our Time variable. 450 | 451 | Therefore, the TargetC estimate represents the increase in log-odds of looking to the Target between our Animal and Artefact conditions and, because it's positive, they look more to the animal when it is named than when the artefact is named. Because we did not center TimeS before fitting this model, this effect of TargetC is the model's estimate at the *beginning* of the trial. Although we want a main effect, we know from the raw data that we should not be seeing this effect at the beginning of the trial. Instead, it should emerge through the trial as a Time*Target interaction. 452 | 453 | The effect of TimeS represents the main effect of looking away from the Animal over time. 454 | 455 | The Intercept represents the odds of looking at the animal at Time==0 (regardless of condition). 456 | 457 | Visualize the fitted model: 458 | 459 | ```{r} 460 | data$prediction <- predict(model, data) 461 | 462 | # in logit space... 463 | ggplot(data, aes(x=TimeS, y=prediction, color=Target)) + 464 | stat_summary(fun.y='mean', geom='line') 465 | 466 | # as proportions... 467 | data$prediction_prob <- exp(data$prediction) / (1 + exp(data$prediction) ) 468 | ggplot(data, aes(x=TimeS, y=prediction_prob, color=Target)) + 469 | stat_summary(fun.y='mean', geom='line') 470 | ``` 471 | 472 | At first glance, three things are worth mentioning: 473 | 474 | (1) The proportion plot is slightly warped. This represents the warping that occurs in the transformation between probabilities and log-odds. 475 | 476 | (2) The model predictions look noisy. Why? Model predictions should be clean and constant given that we only have two factors and an interaction. 477 | 478 | The problem is that we generated predictions for our datapoints that included the random effects. So, taking the mean of the predictions included idiosynractic estimates for our participants. To remove these, we can simply pass the `re.form=NA` parameter to the `predict()` function: 479 | 480 | ```{r} 481 | data$prediction <- predict(model, data, re.form=NA) 482 | 483 | ggplot(data, aes(x=TimeS, y=prediction, color=Target)) + 484 | stat_summary(fun.y='mean', geom='line') 485 | ``` 486 | 487 | (3) These are a horrible fit to our data! Let's convert these predictions to probabilities and then plot them alongside our raw data: 488 | 489 | ```{r} 490 | ggplot(data, aes(x=TimeS, y=Animate, color=Target)) + 491 | stat_summary(fun.y='mean', geom='line') + 492 | stat_summary(aes(y=exp(prediction)/(1+exp(prediction))), fun.y='mean', geom='line') 493 | ``` 494 | 495 | We fit two straight lines to two curves which clearly change in a non-linear fashion over time. This not only results in this poor visual fit but also, for example, in seeing a main effect of TargetC at timepoint 0 when we know the groups to be very similar at the start of the trial. 496 | 497 | Therefore, we need a model that will let us capture this non-linear growth over time, and tell us how our two conditions differ with respect to it (growth curves!). 498 | 499 | # A note about empirical logits, arc-sine roots, and other transformations 500 | 501 | Sometimes you just need to aggregate your data in a way where you *must* predict proportions even though your raw measure was binomial (i.e., logistic) in nature. For time analyses of eye-tracking data, this is especially common. If you aggregate across trials/items/subjects in any way, you are going to lose the sample-by-sample binomial responses collected from subjects. 502 | 503 | When this occurs, you'll want to implement a correction which takes the proportional data as and returns a linear-corrected DV as output. 504 | 505 | Here we'll walk through a couple of transformations, after binning our data by subjects across time. 506 | 507 | ```{r} 508 | # bin data into 50ms bins 509 | data$Bin <- data$TimeFromSubphaseOnset %/% 50 510 | # equivalent: data$Bin <- floor(data$TimeFromSubphaseOnset / 50) 511 | 512 | head(data[, c('TimeFromSubphaseOnset','Bin')]) 513 | ``` 514 | 515 | Calculate the proportion of looking to the animal (PropAnimal), as well as the number of samples looking to the Animal (y), and the total number of samples (N), and the start of the Bin in ms (Time). 516 | 517 | Of course, we'll lose the Trial grouping variable/random effect, because we aggregate across trials. 518 | 519 | ```{r} 520 | binned <- data %>% 521 | group_by(ParticipantName,Target,Bin) %>% 522 | summarise(PropAnimal = mean(Animate), y=sum(Animate), N=length(Animate), TimeS=min(TimeS)) 523 | ``` 524 | 525 | One transformation that will work for us is called the *empirical logit* (calculated as the log-odds with a constant value to keep from getting -Infinity): 526 | 527 | ```{r} 528 | binned$elog <- log( (binned$y + .5) / (binned$N - binned$y + .5) ) 529 | ``` 530 | 531 | Fit an unweighted empirical logit model: 532 | 533 | ```{r} 534 | model <- lmer(elog ~ Target*TimeS + (1 | ParticipantName), data = binned) 535 | summary(model) 536 | ``` 537 | 538 | Notice the empirical logit estimates are very similar to our actual logits from the previous logistic model. 539 | 540 | We can also account for `Target` and `TimeS` random slopes by Participant. Accounting for Time as a random effect is always recommended -- lots of individual noise here. 541 | 542 | ```{r} 543 | model <- lmer(elog ~ Target*TimeS + (1 + Target + TimeS | ParticipantName), data = binned) 544 | summary(model) 545 | ``` 546 | 547 | Another transformation we can use, beside the empirical logit, is the Arcsine-root transformation. Just like logits and empirical logits, it "stretches out" the ceiling and floor of the distribution, transforming the natural logistic curve inherent to probabilities to be appropriate for linear models. 548 | 549 | ```{r} 550 | asin(sqrt(.5)) 551 | asin(sqrt(.6)) 552 | asin(sqrt(.95)) 553 | asin(sqrt(.99)) 554 | 555 | plot(seq(.0,1,by=.1),asin(sqrt(seq(0,1,by=.1)))) 556 | 557 | binned$Arcsin <- asin(sqrt(binned$PropAnimal)) 558 | head(binned) 559 | 560 | model <- lmer(Arcsin ~ Target*TimeS + (1 + Target + TimeS | ParticipantName), data = binned) 561 | summary(model) 562 | ``` 563 | 564 | Clean up our workspace. 565 | 566 | ```{r} 567 | ls() 568 | rm(list=ls()) 569 | ``` -------------------------------------------------------------------------------- /tutorial4-growth-curve-analysis.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 4: Growth Curve Analyses" 3 | author: "Brock Ferguson" 4 | date: "July 1, 2015" 5 | output: 6 | html_document: 7 | toc: true 8 | theme: readable 9 | --- 10 | 11 | Load required packages. 12 | 13 | ```{r} 14 | library(ggplot2) 15 | library(lme4) 16 | library(dplyr) 17 | ``` 18 | 19 | Load our eyetracking data. 20 | 21 | ```{r} 22 | data <- read.csv('data-eyetracking.csv') 23 | ``` 24 | 25 | Note: In this tutorial, we will do by-subjects time analyses only (aggregating across trials within-subjects) for simplicity but you can do by-items/trials analyses in the same way if you aggregate within trials instead of subjects. 26 | 27 | # Bin data within participants by time 28 | 29 | Let's prep these data for analysis much like we prepped for the empirical logit analysis. 30 | 31 | ```{r} 32 | # rescale, bin, aggregate, and transform our DV's all in one step 33 | binned <- data %>% 34 | mutate(TimeS = TimeFromSubphaseOnset / 1000, 35 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin 36 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins 37 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>% 38 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit 39 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights 40 | Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt 41 | ungroup() 42 | ``` 43 | 44 | # Fit our old linear model 45 | 46 | Here's a model similar to the linear model we had before : 47 | 48 | ```{r} 49 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5) 50 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F) 51 | 52 | model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned) 53 | summary(model) 54 | ``` 55 | 56 | Visualize the data and model fit: 57 | 58 | ```{r} 59 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 60 | stat_summary(fun.y=mean, geom="point") + 61 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") 62 | ``` 63 | 64 | We already know that this model isn't a great fit to our data. By assuming linear growth, it gives us weird estimates, one major one being that it thinks our two groups differ at timepoint 0 (which we know not to be true -- the differences emerge over time). 65 | 66 | # Natural polynomial growth curve analysis 67 | 68 | Let's do our first stab at examining non-linear growth by creating and entering natural polynomials into the model. 69 | 70 | ```{r} 71 | binned <- binned %>% 72 | mutate(TimeS_2 = TimeS^2, 73 | TimeS_3 = TimeS^3, 74 | TimeS_4 = TimeS^4) 75 | 76 | head(binned) 77 | 78 | ggplot(binned, aes(x=TimeS, y=TimeS)) + 79 | geom_point() + 80 | geom_point(aes(y=TimeS_2), color='red') + 81 | geom_point(aes(y=TimeS_3), color='blue') + 82 | geom_point(aes(y=TimeS_4), color='green') 83 | ``` 84 | 85 | Each natural polynomial accelerates at a different rate. These have an interesting property in that, when combined, they capture the correlates/estimated slopes of a variable at successive "bends" in the data. i.e., The 2nd (quadratic) polynomial will capture the first bend, the 3rd (cubic) polynomial will capture the second bend, and so on. 86 | 87 | In this way, models that include these natural polynomials can model non-linear growth. 88 | 89 | ```{r} 90 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3 + TimeS_4) + (1 + Target + TimeS + TimeS_2 + TimeS_3 + TimeS_4 | ParticipantName), data = binned) 91 | summary(model) 92 | ``` 93 | 94 | We have some convergence issues caused by overparameterization, so let's scale back on these polynomials to something that seems more reasonable. A good rule of thumb is to count the number of "bends" in the data before breaking it down by condition. 95 | 96 | How many bends do we see? 97 | 98 | ```{r} 99 | ggplot(binned, aes(x=TimeS, y=elog)) + 100 | stat_smooth(method="loess") 101 | ``` 102 | 103 | There looks to be two bends in the data which, because of the N-1 relationship between polynomials and bends, means we should include 3 polynomial terms. 104 | 105 | ```{r} 106 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3) + (1 + Target + TimeS + TimeS_2 + TimeS_3 | ParticipantName), data = binned) 107 | summary(model) 108 | ``` 109 | 110 | There is still a convergence error, likely because the scales of our variables are too different, like before. We will ignore this for now, because our final GCA approach will fix this. 111 | 112 | In the mean time, let's see what this model did for us. Looking at the estimates above, it seems to have gotten rid of our timepoint 0 main effect of Target (yes!), and instead shows some strong interactions over time. This seems promising... but let's visualize. 113 | 114 | ```{r} 115 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 116 | stat_summary(fun.y=mean, geom="point") + 117 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") 118 | ``` 119 | 120 | Awesome! Much better. 121 | 122 | Let's store this model because are going to use it for comparison later. 123 | 124 | ```{r} 125 | natural_model <- model 126 | ``` 127 | 128 | # Orthogonal polynomial growth curve analysis 129 | 130 | ## Solving the multicollinearity of natural polynomials 131 | 132 | We just did our first non-linear growth curve analysis, but it was sub-optimal for two reasons: 133 | 134 | (1) these polynomial terms we generated are highly correlated with one another, and multicollinearity in linear models is always bad 135 | (2) our model had trouble converging because of the different scales of our DV's 136 | 137 | Thankfully, we have something that will help: *orthogonal polynomials*. 138 | 139 | Let's first consider problem (1): our natural polynomial are highly correlated with each other. Here you can see this in a standard correlation matrix. 140 | 141 | ```{r} 142 | cor(binned[, c('TimeS','TimeS_2','TimeS_3','TimeS_4')]) 143 | ``` 144 | 145 | This is not a good thing when we are trying to attribute variance to each factor independently. This isn't unique to time-based models -- any linear model suffers from multicollinearity. 146 | 147 | So, what we can do is actually create replacement timecodes for linear, quadratic, cubic, etc. change over time that we *know* by design will be uncorrelated. 148 | 149 | `poly()` will generate higher-order polynomials for us, with a vector length equivalent to the length of our original time vector. 150 | 151 | We'll go up to 6th-order polynomials, but we'll stick to the first 3 for most of our models. 152 | 153 | ```{r} 154 | orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6) 155 | head(orthogonal_polynomials) 156 | ``` 157 | 158 | Column 1 grows linearly. Column 2 grows quadratically. Column 3 grows cubicly... etc. 159 | 160 | Visualize this, and verify that they are indeed uncorrelated. 161 | 162 | ```{r} 163 | ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) + 164 | geom_point() + 165 | geom_point(aes(y=X2), color='red') + 166 | geom_point(aes(y=X3), color='blue') + 167 | geom_point(aes(y=X4), color='green') + 168 | geom_point(aes(y=X5), color='purple') + 169 | geom_point(aes(y=X6), color='yellow') 170 | 171 | cor(orthogonal_polynomials[, c(1:6)]) 172 | round(cor(orthogonal_polynomials[, c(1:6)]),5) 173 | ``` 174 | 175 | Perfect! 176 | 177 | I like to merge them into the original dataframe using this technique, which allows for missing data from any given participant. 178 | 179 | ```{r} 180 | time_codes <- data.frame( 181 | sort(as.vector(unique(binned$TimeS))), 182 | orthogonal_polynomials[, c(1:6)] 183 | ) 184 | colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6') 185 | 186 | binned <- merge(binned, time_codes, by='TimeS') 187 | ``` 188 | 189 | ## Orthogonal modeling 190 | 191 | Now let's model our data exactly like we did before but using these orthogonal polynomials: 192 | 193 | ```{r} 194 | model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned) 195 | summary(model) 196 | ``` 197 | 198 | Great fit! No errors. 199 | 200 | Interestingly, we are back to seeing a main effect of `TargetArtefact`, though... Why? The reason is simple: Our natural polynomials all started at Timepoint 0, meaning that main effects represented differences at the *start* of the time window. In contrast, orthogonal polynomials are, by default, centered at 0, meaning that main effects represent *average* differences (across time) between levels of a factor. 201 | 202 | Let's visualize our data and model fit: 203 | 204 | ```{r} 205 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 206 | stat_summary(fun.y=mean, geom="point") + 207 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") 208 | ``` 209 | 210 | Compare this model to our natural polynomial model. 211 | 212 | ```{r} 213 | summary(natural_model) 214 | summary(model) 215 | ``` 216 | 217 | We can use the same methods as before to get confidence intervals, test for Type III significance, etc. 218 | 219 | ```{r} 220 | # confint(model) 221 | # this takes a long with a model this complex.... 222 | 223 | drop1(model, ~., test="Chisq") 224 | ``` 225 | 226 | `drop1()` suggests that all of our parameters are reliable predictors. 227 | 228 | Let's try adding 4th and 5th orthogonal polynomials manually and seeing their effects on this model. 229 | 230 | ```{r} 231 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned) 232 | summary(model_quartic) 233 | 234 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 235 | stat_summary(fun.y=mean, geom="point") + 236 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model 237 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model 238 | 239 | anova(model, model_quartic) 240 | ``` 241 | 242 | Despite the very underwhelming difference in model fits, model comparison says that `ot4` is a significant predictor. 243 | 244 | One possibility is that this significant difference is because we not only added `ot4` as a fixed effect, we also added it to the random structure. When examining the influence of a fixed effect, it's best to keep your random effect structure constant. The fact that we changed our random structure is why the `anova()` says that our models differ by 8 degrees of freedom when we only wanted to see the influence of the main effect of `ot4` and its interaction with `TargetC` (i.e., two parameters or 2 df difference). 245 | 246 | ```{r} 247 | model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F) 248 | 249 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F) 250 | 251 | anova(model_cubic, model_quartic) 252 | ``` 253 | 254 | Despite the underwhelming difference in visualized model fits, it still says it's a significantly better fit. 255 | 256 | What about a 5th polynomial? 257 | 258 | ```{r} 259 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F) 260 | 261 | model_quintic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4 + ot5) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F) 262 | 263 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 264 | stat_summary(fun.y=mean, geom="point") + 265 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 4th-order model 266 | stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model 267 | 268 | anova(model_quartic, model_quintic) 269 | ``` 270 | 271 | Adding a 5th polynomial did not improve our fit. 272 | 273 | Here's a final reminder of how bad our linear model was: 274 | 275 | ```{r} 276 | model_linear <- lmer(elog ~ TargetC*(ot1) + (1 + TargetC + ot1 | ParticipantName), data = binned) 277 | summary(model_linear) 278 | 279 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) + 280 | stat_summary(fun.y=mean, geom="point") + 281 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model 282 | stat_summary(aes(y=predict(model_linear,binned,re.form=NA)), fun.y=mean, geom="line") # 2nd-order model 283 | ``` 284 | 285 | ## Growth curve analyses with 3+ levels of a factor 286 | 287 | I like to design experiments with only 2 levels per factor for simplicity but sometimes we have 3 levels in a factor and, now, main effects do not equal simple effects. 288 | 289 | To demonstrate, let's add a third "Neutral" Target level that will follow a similar trajectory to the "Animal" level but shifted down towards 50% chance. 290 | 291 | ```{r} 292 | new_condition <- binned[which(binned$Target == 'Animal'), ] 293 | new_condition$Target <- 'Neutral' 294 | #new_condition$y <- new_condition$y - round(new_condition$N / 3) 295 | new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),-.5,2)) 296 | new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y) 297 | new_condition[which(new_condition$y < 1), 'y'] <- 1 298 | new_condition$PropAnimal <- new_condition$y / new_condition$N 299 | new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) ) 300 | new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5) 301 | new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal)) 302 | 303 | binned_3levels <- rbind(binned,new_condition) 304 | binned_3levels$Target <- factor(binned_3levels$Target) 305 | 306 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) + 307 | stat_summary(fun.y=mean, geom="point") 308 | ``` 309 | 310 | Fit a model with Target treatment-coded. 311 | 312 | ```{r} 313 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels) 314 | summary(model) 315 | 316 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) + 317 | stat_summary(fun.y=mean, geom="point") + 318 | stat_summary(aes(y=predict(model,binned_3levels,re.form=NA)), fun.y=mean, geom="line") 319 | ``` 320 | 321 | In order to know how to interpret these simple effects, we need to remember which is our reference level. 322 | 323 | Get main effects via model comparison... 324 | 325 | ```{r} 326 | model_null <- lmer(elog ~ Target*(ot1 + ot2) + ot3 + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels) 327 | summary(model_null) 328 | 329 | anova(model,model_null) 330 | ``` 331 | 332 | Get simple effects by re-ordering factor levels. 333 | 334 | ```{r} 335 | levels(binned_3levels$Target) 336 | binned_3levels$Target <- factor(binned_3levels$Target,c('Neutral','Animal','Artefact')) 337 | levels(binned_3levels$Target) 338 | 339 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels) 340 | summary(model) 341 | ``` 342 | 343 | Clean up our workspace. 344 | 345 | ```{r} 346 | ls() 347 | rm(list=ls()) 348 | ``` --------------------------------------------------------------------------------