├── .RData
├── .Rhistory
├── README.md
├── data-eyetracking.csv
├── data-vocab-females.csv
├── data-vocab-individuals.csv
├── data-vocab.csv
├── independent problems
    ├── tutorial2-independent.R
    └── tutorial3-independent.R
├── old
    ├── README.md
    └── gca-tutorial.R
├── probabilities-refresher.R
├── tutorial1-introduction.Rmd
├── tutorial1-introduction.html
├── tutorial2-general-linear-models.Rmd
├── tutorial2-general-linear-models.html
├── tutorial3-generalized-linear-models.Rmd
├── tutorial3-generalized-linear-models.html
├── tutorial4-growth-curve-analysis.Rmd
└── tutorial4-growth-curve-analysis.html


/.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brockf/tutorial-GCA-R/364d56eae9b9e0fc767c5e79d561bf7177961fb8/.RData


--------------------------------------------------------------------------------
/.Rhistory:
--------------------------------------------------------------------------------
  1 | I have done some prep, though:
  2 | * removed trackloss
  3 | * converted GazeX and Y columns to AOI 1 or 0 columns
  4 | * merged in Subject information
  5 | # Analysis 1: By-subjects and By-items ANOVAs (i.e., F1 and F2)
  6 | Pros:
  7 | * simple
  8 | * allows for generalization beyond samples participants and sampled items.
  9 | Cons:
 10 | * lose all timing information
 11 | * what do we do with ambiguities between analyses?
 12 | We need to aggregate across trials by target within participants:
 13 | ```{r}
 14 | agg_subjects <- data %>%
 15 | group_by(ParticipantName,Sex,Age,Target) %>%
 16 | summarise(PropAnimal = mean(Animate)) %>%
 17 | ungroup()
 18 | ```
 19 | Visualize our aggregated data:
 20 | ```{r}
 21 | ggplot(agg_subjects, aes(x=Target, y=PropAnimal)) +
 22 | geom_point(position=position_jitter(.3))
 23 | ```
 24 | Use our best practices from the `lm()` tutorial to prepare and model these data:
 25 | ```{r}
 26 | agg_subjects$TargetCoded <- ifelse(agg_subjects$Target == 'Artefact', -.5, .5)
 27 | ```
 28 | There's no need to center here because these data are balanced (every subject is present in both conditions). But this is how we would center, anyways:
 29 | ```{r}
 30 | agg_subjects$TargetCoded <- scale(agg_subjects$TargetCoded, center=T, scale=F)
 31 | ```
 32 | Here we use `aov()` because it allows for a repeated-measures Error() term.
 33 | As we learned before, aov() uses Type I sums of squares, but with only one factor (i.e., no correlation issues), it's safe.
 34 | ```{r}
 35 | model <- aov(PropAnimal ~ TargetCoded + Error(ParticipantName/TargetCoded), data = agg_subjects)
 36 | summary(model)
 37 | ```
 38 | Looks good! That's our F2 "subjects" ANOVA.
 39 | Now we can do an F1 "items" ANOVA as well. This just involves slightly changing our `group_by()` call:
 40 | ```{r}
 41 | agg_items <- data %>%
 42 | group_by(Trial,Target) %>%
 43 | summarise(PropAnimal = mean(Animate)) %>%
 44 | ungroup()
 45 | agg_items$TargetCoded <- ifelse(agg_items$Target == 'Artefact', -.5, .5)
 46 | ```
 47 | Visualize effects by items:
 48 | ```{r}
 49 | ggplot(agg_items, aes(x=Target, y=PropAnimal, fill=Trial)) +
 50 | geom_point(position=position_jitter(.3))
 51 | ```
 52 | Normally, in an F2 analysis, we would include an Error() term because we would have observed each condition within each item and thus have a sense about the size of the condition effect for each item. However, this was a study with infants and we couldn't do that. So we won't include an Error() term here, and just do a between-subjects one-way ANOVA.
 53 | ```{r}
 54 | model <- aov(PropAnimal ~ TargetCoded, data = agg_items)
 55 | summary(model)
 56 | ```
 57 | These F1/F2 analyses are both crystal clear (reject the null!). But, in other cases, there can be ambiguit. For example, what if one is significant and the other is marginal?
 58 | Ideally, we could have one test which allows us to control for random trial AND subject factors simultaneously. Enter `lmer()`...
 59 | # Analysis 2: Simultaneous Trial and Subject Random Effects
 60 | Aggregate data by Trials (Items) and Participants (i.e., one datapoint for each trial).
 61 | ```{r}
 62 | agg_sub_items <- data %>%
 63 | group_by(ParticipantName,Trial,Target) %>%
 64 | summarise(PropAnimal = mean(Animate)) %>%
 65 | ungroup()
 66 | agg_sub_items$TargetCoded <- ifelse(agg_sub_items$Target == 'Artefact', -.5, .5)
 67 | ```
 68 | Fit a mixed-effects model allowing the intercept (represented by a "1") to vary by both Participants and Trials. By allowing the intercept to vary by subjects and items, we are allowing the model to estimate (and thus control for) each participants' and trials' mean tendency to look (cause participants to) look at the animal regardless of condition. For example, Billy may just love animals while Sammy may hate animals. Importantly, we want to know whether they word they heard (represented here by TargetCoded) caused them to look more/less to the animal above and beyond these baseline preferences.
 69 | ```{r}
 70 | model <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items)
 71 | summary(model)
 72 | ```
 73 | This looks good, and converges with our previous estimate of a significant effect.
 74 | Note that, at the top of the summary output, we can see that there is very little variance in our random effect estimates. This means that the model is having trouble estimating them and is therefore keeping them all very near zero.
 75 | This is likely the result of (a) small random differences between subjects and trials in this sample and, (b) a relatively small dataset. We will want to continue keep an eye on the variance of random effects -- when this is essentially 0, we may want to consider using a regular ANOVA.
 76 | But in the meantime let's dive into this model a bit more.
 77 | We can see the fixed effects:
 78 | ```{r}
 79 | fixef(model)
 80 | ```
 81 | Here, because we centered the variable, `Intercept` is the overall mean looking to animal. `TargetCoded` represents the difference between looking to the animal between our two conditions. If we subtract 1/2 of it from the intercept, we get our mean looking to the animal when the "Artefact" was named and, if we add 1/2 of it to the intercept, we get our mean looking to the animal when the "Animal" was named.
 82 | We can also see random effects (some people use these to describe individual differences):
 83 | ```{r}
 84 | ranef(model)
 85 | ```
 86 | Here they are nearly zero, which corresponds to our assessment of their variance earlier.
 87 | We can also see what the model's predictions were, for each participant/item. Here we will get the mean prediction for each subject for each type of Target trial.
 88 | ```{r}
 89 | agg_sub_items$prediction <- predict(model, agg_sub_items)
 90 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) +
 91 | stat_summary(fun.y='mean', geom='point', position=position_jitter(.3))
 92 | ```
 93 | You can see that the model is making different predictions for each subject. By allowing their intercepts to vary, the model is accounting for the fact that some kids just like looking at animals more than others. But it's being very conservative with its estimates because, frankly, we haven't given it much data to go on.
 94 | Importantly, with these models, we aren't limited to varying just the intercept by specific grouping factors. We can also vary slopes (i.e., fixed effects) by participants as well. By adding "TargetCoded" as a random slope, we allow the model to vary the magnitude of the difference between Target conditions within participants.
 95 | ```{r}
 96 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
 97 | summary(model)
 98 | ```
 99 | We now see an additional random effect by the `ParticipantName` group: `TargetCoded`. We also see that it is actually varying between subjects, suggesting the model is become more confident in the differences between our subjects.
100 | Look at which participants responded stronger/weaker to the Target manipulation.
101 | These random effects are centered with respect to the fixed effect estimated. Therefore, negative effects represented a weaker effect of TargetCoded for this participant. This could be a cool way to look at individual differences -- who is responding most accurately to the words spoken?
102 | ```{r}
103 | ranef(model)
104 | subject_slopes <- ranef(model)$ParticipantName
105 | subject_slopes$subject <- factor(rownames(subject_slopes))
106 | colnames(subject_slopes) <- c('Baseline','ConditionEffect','Subject')
107 | ggplot(subject_slopes, aes(x=Subject, y=ConditionEffect)) +
108 | geom_point() +
109 | geom_text(aes(label=Subject),hjust=0,vjust=0,size=3) +
110 | geom_hline(yint=0, linetype="dashed", alpha=.5) +
111 | theme(axis.text.x=element_blank())
112 | ```
113 | With this random slope included in the model, we see more variation now in our predictions by subjects. It's even picking up on the fact that participants consistently look to the animal and the real variance lies in whether they look away from the animal when the artefact is named.
114 | ```{r}
115 | agg_sub_items$prediction <- predict(model, agg_sub_items)
116 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) +
117 | stat_summary(fun.y='mean', geom='point', position=position_jitter(.3))
118 | ```
119 | # Two important questions:
120 | Although you've just played around with your first two mixed-effects models, you're likely already asking yourself two pressing questions...
121 | ## Question 1: Which random effect structure should we specify?
122 | If you are going to use mixed-effects models, you are going to require AT LEAST a random intercept for every natural "grouping" of data. However, beyond random intercepts, what random slopes should you include?
123 | ### Guidelines
124 | One way you can decide this is by adhering to guidelines. Unfortunately, you'll find guidelines pull you in opposing directions.
125 | For example, one school of thought (see Barr et al., 2013) is to "keep it maximal". That is, include every random effect that your experimental design permits (i.e., every factor that appeared across subjects or trials). Another school of thought (see Bates et al., 2015) is to keep it parsimonious. Don't overcomplexify your models with lots of random slopes, as this will make model estimates increasingly hard to reconcile with the data and risk overparamterizing your model:
126 | ```{r}
127 | # commented out because it kills knitr
128 | # model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded + Trial | ParticipantName) + (1 | Trial), data = agg_sub_items)
129 | ```
130 | ### Model comparison
131 | A second way to decide is to think bottom-up from the data. Compare two models -- one with your random slope and another without your random slope -- and see if your random slope model is actually a better fit.
132 | ```{r}
133 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
134 | model_null <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items)
135 | anova(model,model_null) # -2 log-likelihood ratio test, gives you Chisq(df) = X and a p-value.
136 | ```
137 | We'll talk more about model comparison -- a very powerful tool -- in just a minute.
138 | summary(model)
139 | summary(model)
140 | summary(model)
141 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
142 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
143 | anova(model,model_null)
144 | log_model <- logLik(model)
145 | log_model
146 | log_null <- logLik(model_null)
147 | log_null
148 | exp(39)
149 | log_model
150 | log_null
151 | View(agg_sub_items)
152 | agg_sub_items$RandomFactor <- ifelse(rbinom(nrow(agg_sub_items),.5) == 1, 'On','Off')
153 | ?rbinom
154 | agg_sub_items$RandomFactor <- ifelse(rbinom(nrow(agg_sub_items),1,.5) == 1, 'On','Off')
155 | View(agg_sub_items)
156 | model1 <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F)
157 | model2 <- lmer(PropAnimal ~ RandomFactor + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F)
158 | anova(model1,model2)
159 | library(ggplot2)
160 | library(lme4)
161 | library(dplyr)
162 | data <- read.csv('data-eyetracking.csv')
163 | binned <- data %>%
164 | mutate(TimeFromSubphaseOnset = (TimeFromSubphaseOnset / 1000) %/% 50) %>% # re-scale, bin
165 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
166 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
167 | ungroup() %>%
168 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
169 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
170 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt
171 | binned <- data %>%
172 | mutate(Bin = (TimeFromSubphaseOnset / 1000) %/% 50) %>% # re-scale, bin
173 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
174 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
175 | ungroup() %>%
176 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
177 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
178 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt
179 | binned <- data %>%
180 | mutate(TimeS = TimeFromSubphaseOnset / 1000,
181 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
182 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
183 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
184 | ungroup() %>%
185 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
186 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
187 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt
188 | View(binned)
189 | binned <- data %>%
190 | mutate(TimeS = TimeFromSubphaseOnset / 1000,
191 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
192 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
193 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
194 | #ungroup() %>%
195 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
196 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
197 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt
198 | View(binned)
199 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5)
200 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F)
201 | model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned)
202 | summary(model)
203 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
204 | stat_summary(fun.y=mean, geom="point") +
205 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
206 | binned <- binned %>%
207 | mutate(TimeS_2 = TimeS^2,
208 | TimeS_3 = TimeS^3,
209 | TimeS_4 = TimeS^4)
210 | head(binned)
211 | plot(binned$TimeS, binned$TimeS_2)
212 | plot(binned$TimeS, binned$TimeS_3)
213 | plot(binned$TimeS, binned$TimeS_4)
214 | ggplot(binned, aes(x=TimeS, y=TimeS_2)) +
215 | geom_point() +
216 | geom_point(aes(y=TimeS_3)) +
217 | geom_point(aes(y=TimeS_4))
218 | ggplot(binned, aes(x=TimeS, y=TimeS)) +
219 | geom_point() +
220 | geom_point(aes(y=TimeS_2), color='red') +
221 | geom_point(aes(y=TimeS_3), color='blue') +
222 | geom_point(aes(y=TimeS_4), color='green')
223 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3 + TimeS_4) + (1 + Target + TimeS + TimeS_2 + TimeS_3 + TimeS_4 | ParticipantName), data = binned)
224 | summary(model)
225 | ggplot(binned, aes(x=TimeS, y=elog)) +
226 | stat_summary(fun.y='mean', geom='line')
227 | ggplot(binned, aes(x=TimeS, y=elog)) +
228 | stat_smooth(method="loess")
229 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3) + (1 + Target + TimeS + TimeS_2 + TimeS_3 | ParticipantName), data = binned)
230 | summary(model)
231 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
232 | stat_summary(fun.y=mean, geom="point") +
233 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
234 | natural_model <- model
235 | cor(binned[, c('TimeS','TimeS_2','TimeS_3','TimeS_4')])
236 | orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6)
237 | head(orthogonal_polynomials)
238 | ggplot(orthogonal_polynomials, aes(x=1, y=1)) +
239 | geom_point() +
240 | geom_point(aes(y=2), color='red') +
241 | geom_point(aes(y=3), color='blue') +
242 | geom_point(aes(y=4), color='green')
243 | ggplot(data.frame(orthogonal_polynomials), aes(x=1, y=1)) +
244 | geom_point() +
245 | geom_point(aes(y=2), color='red') +
246 | geom_point(aes(y=3), color='blue') +
247 | geom_point(aes(y=4), color='green')
248 | data.frame(orthogonal_polynomials)
249 | ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) +
250 | geom_point() +
251 | geom_point(aes(y=X2), color='red') +
252 | geom_point(aes(y=X3), color='blue') +
253 | geom_point(aes(y=X4), color='green')
254 | ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) +
255 | geom_point() +
256 | geom_point(aes(y=X2), color='red') +
257 | geom_point(aes(y=X3), color='blue') +
258 | geom_point(aes(y=X4), color='green') +
259 | geom_point(aes(y=X5), color='purple') +
260 | geom_point(aes(y=X6), color='yellow')
261 | cor(orthogonal_polynomials[, c(1:6)])
262 | round(cor(orthogonal_polynomials[, c(1:6)]),5)
263 | time_codes <- data.frame(
264 | sort(as.vector(unique(binned$TimeS))),
265 | orthogonal_polynomials[, c(1:6)]
266 | )
267 | colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6')
268 | binned <- merge(binned, time_codes, by='TimeS')
269 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned)
270 | summary(model)
271 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
272 | stat_summary(fun.y=mean, geom="point") +
273 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
274 | summary(natural_model)
275 | summary(model)
276 | drop1(model, ~., test="Chisq")
277 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned)
278 | summary(model_quartic)
279 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
280 | stat_summary(fun.y=mean, geom="point") +
281 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
282 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
283 | model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned)
284 | anova(model, model_quartic)
285 | model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned)
286 | model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F)
287 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F)
288 | anova(model_cubic, model_quartic)
289 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F)
290 | model_quintic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4 + ot5) + (1 + TargetC + ot1 + ot2 + ot3 + ot4+ ot5  | ParticipantName), data = binned, REML=F)
291 | anova(model_quartic, model_quintic)
292 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
293 | stat_summary(fun.y=mean, geom="point") +
294 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed', color='black') + # 4th-order model
295 | stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model
296 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
297 | stat_summary(fun.y=mean, geom="point") +
298 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed', color='black') + # 3rd-order model
299 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
300 | model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned)
301 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
302 | stat_summary(fun.y=mean, geom="point") +
303 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed', color='black') + # 3rd-order model
304 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
305 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
306 | stat_summary(fun.y=mean, geom="point") +
307 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line") + # 3rd-order model
308 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
309 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
310 | stat_summary(fun.y=mean, geom="point") +
311 | stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
312 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
313 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
314 | stat_summary(fun.y=mean, geom="point") +
315 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 4th-order model
316 | stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model
317 | model_linear <- lmer(elog ~ Target*(ot1) + (1 + Target + ot1 | ParticipantName), data = binned)
318 | summary(model_linear)
319 | model_linear <- lmer(elog ~ TargetC*(ot1) + (1 + TargetC + ot1 | ParticipantName), data = binned)
320 | summary(model_linear)
321 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
322 | stat_summary(fun.y=mean, geom="point") +
323 | stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
324 | stat_summary(aes(y=predict(model_linear,binned,re.form=NA)), fun.y=mean, geom="line") # 2nd-order model
325 | new_condition <- binned %>%
326 | filter(Target == 'Animal') %>%
327 | mutate(Target = 'Neutral',
328 | y = y + round(rnorm(length(y),0,2)),
329 | y = ifelse(y > N, N, y),
330 | y = ifelse(y < 1, 1, y),
331 | PropAnimal = y / N,
332 | elog = log(y / (N-y+.5)),
333 | ArcSin = asin(sqrt(PropAnimal)))
334 | new_condition <- binned %>%
335 | filter(Target == 'Animal')
336 | View(binned)
337 | new_condition <- binned
338 | new_condition <- binned %>%
339 | filter(Target == 'Animal')
340 | new_condition <- binned %>%
341 | filter(Target == 'Animal')
342 | new_condition <- data.frame(binned) %>%
343 | filter(Target == 'Animal')
344 | summary(binned)
345 | binned[Target == 'Animal']
346 | binned[which(Target == 'Animal'), ]
347 | binned[which(binned$Target == 'Animal'), ]
348 | new_condition <- binned %>%
349 | filter(Target == 'Animal')
350 | new_condition <- binned %>%
351 | group_by(ParticipantName) %>%
352 | summarise(MeanTest=mean(PropAnimal))
353 | View(new_condition)
354 | new_condition <- binned %>%
355 | filter(Target == 'Animal')
356 | colnames(binned)
357 | new_condition <- binned %>%
358 | filter(Target = 'Animal')
359 | filter(binned, Target == 'Animal')
360 | new_condition <- binned %>%
361 | ungroup() %>%
362 | filter(Target == 'Animal')
363 | new_condition <- binned %>%
364 | filter(TargetC == .5)
365 | new_condition <- binned %>%
366 | #filter(Target == 'Animal') %>%
367 | mutate(Target = 'Neutral',
368 | y = y + round(rnorm(length(y),0,2)),
369 | y = ifelse(y > N, N, y),
370 | y = ifelse(y < 1, 1, y),
371 | PropAnimal = y / N,
372 | elog = log(y / (N-y+.5)),
373 | ArcSin = asin(sqrt(PropAnimal)))
374 | summary(binned)
375 | detach("package:plyr",detach=T)
376 | detach("package:plyr",unload=T)
377 | binned <- data %>%
378 | mutate(TimeS = TimeFromSubphaseOnset / 1000,
379 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
380 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
381 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
382 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
383 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
384 | Arcsin = asin(sqrt(PropAnimal))) # arcsin-sqrt %>%
385 | ungroup()
386 | binned <- data %>%
387 | mutate(TimeS = TimeFromSubphaseOnset / 1000,
388 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
389 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
390 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
391 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
392 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
393 | Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt
394 | ungroup()
395 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5)
396 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F)
397 | model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned)
398 | summary(model)
399 | new_condition <- binned %>%
400 | filter(Target == 'Animal') %>%
401 | mutate(Target = 'Neutral',
402 | y = y + round(rnorm(length(y),0,2)),
403 | y = ifelse(y > N, N, y),
404 | y = ifelse(y < 1, 1, y),
405 | PropAnimal = y / N,
406 | elog = log(y / (N-y+.5)),
407 | ArcSin = asin(sqrt(PropAnimal)))
408 | new_condition <- binned %>%
409 | filter(Target == 'Animal') %>%
410 | mutate(Target = 'Neutral',
411 | y = y + round(rnorm(length(y),0,2)),
412 | y = ifelse(y > N, N, y),
413 | y = ifelse(y < 1, 1, y),
414 | PropAnimal = y / N,
415 | elog = log(y / (N-y+.5)),
416 | ArcSin = asin(sqrt(PropAnimal)))
417 | binned_3levels <- rbind(binned,new_condition)
418 | binned_3levels$Target <- factor(binned_3levels$Target)
419 | View(new_condition)
420 | View(binned)
421 | binned <- data %>%
422 | mutate(TimeS = TimeFromSubphaseOnset / 1000,
423 | Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
424 | group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
425 | summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
426 | mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
427 | wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
428 | Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt
429 | ungroup()
430 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5)
431 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F)
432 | binned <- binned %>%
433 | mutate(TimeS_2 = TimeS^2,
434 | TimeS_3 = TimeS^3,
435 | TimeS_4 = TimeS^4)
436 | orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6)
437 | time_codes <- data.frame(
438 | sort(as.vector(unique(binned$TimeS))),
439 | orthogonal_polynomials[, c(1:6)]
440 | )
441 | colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6')
442 | binned <- merge(binned, time_codes, by='TimeS')
443 | new_condition <- binned %>%
444 | filter(Target == 'Animal') %>%
445 | mutate(Target = 'Neutral',
446 | y = y + round(rnorm(length(y),0,2)),
447 | y = ifelse(y > N, N, y),
448 | y = ifelse(y < 1, 1, y),
449 | PropAnimal = y / N,
450 | elog = log(y / (N-y+.5)),
451 | ArcSin = asin(sqrt(PropAnimal)))
452 | summary(binned)
453 | data.frame(as.matrix(binned))
454 | binned2 <- data.frame(as.matrix(binned))
455 | View(binned2)
456 | new_condition <- binned2 %>%
457 | filter(Target == 'Animal') %>%
458 | mutate(Target = 'Neutral',
459 | y = y + round(rnorm(length(y),0,2)),
460 | y = ifelse(y > N, N, y),
461 | y = ifelse(y < 1, 1, y),
462 | PropAnimal = y / N,
463 | elog = log(y / (N-y+.5)),
464 | ArcSin = asin(sqrt(PropAnimal)))
465 | View(new_condition)
466 | new_condition <- binned2 %>%
467 | filter(Target == 'Animal') %>%
468 | mutate(Target = 'Neutral',
469 | y = y + round(rnorm(length(y),0,2)))
470 | summary(binned2)
471 | binned2 <- as.matrix(binned)
472 | new_condition <- binned2 %>%
473 | filter(Target == 'Animal')
474 | View(binned)
475 | new_condition <- binned[which(binned$Target == 'Animal'), ]
476 | new_condition$Target <- 'Neutral'
477 | #new_condition$y <- new_condition$y - round(new_condition$N / 3)
478 | new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),0,2))
479 | new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y)
480 | new_condition[which(new_condition$y < 1), 'y'] <- 1
481 | new_condition$PropAnimal <- new_condition$y / new_condition$N
482 | new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) )
483 | new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5)
484 | new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal))
485 | binned_3levels <- rbind(binned,new_condition)
486 | binned_3levels$Target <- factor(binned_3levels$Target)
487 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
488 | stat_summary(fun.y=mean, geom="point")
489 | new_condition <- binned[which(binned$Target == 'Animal'), ]
490 | new_condition$Target <- 'Neutral'
491 | #new_condition$y <- new_condition$y - round(new_condition$N / 3)
492 | new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),-.5,2))
493 | new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y)
494 | new_condition[which(new_condition$y < 1), 'y'] <- 1
495 | new_condition$PropAnimal <- new_condition$y / new_condition$N
496 | new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) )
497 | new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5)
498 | new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal))
499 | binned_3levels <- rbind(binned,new_condition)
500 | binned_3levels$Target <- factor(binned_3levels$Target)
501 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
502 | stat_summary(fun.y=mean, geom="point")
503 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
504 | summary(model)
505 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
506 | stat_summary(fun.y=mean, geom="point") +
507 | stat_summary(aes(y=predict(model,binned_3levels,re.form=NA)), fun.y=mean, geom="line")
508 | summary(binned_3levels$Target)
509 | levels(binned_3levels$Target)
510 | binned_3levels$Target <- factor(binned_3levels$Target, levels=c('Neutral','Animal','Artefact'))
511 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
512 | summary(model)
513 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # R Workshop on Using Linear Models, Logistic Regression, and Growth Curve Analyses to Analyze Eye-tracking Data
 2 | 
 3 | A 4-part series culminating in using growth-curve analyses to model eye-tracking data.
 4 | 
 5 | ## Introduction to R
 6 | What are dataframes and vectors? How do R functions work? How do statistical tests in R work? How can I import and export data?
 7 |  
 8 | ## General Linear Models
 9 | How can I fit linear models in R? When should I use aov() and when should I use lm()? How can I interpret parameter estimates (without the help of SPSS...)?
10 |  
11 | ## Generalized Linear Models
12 | How can I use generalized linear models (e.g., logistic regression) to do time-based eye-tracking analyses? How can I use empirical logit regression to the same end? And the arcsin-root transformation? How do mixed-effects models' random effects (intercepts and slopes) work in lmer()?
13 |  
14 | ## Growth Curve Analyses
15 | How do I look at non-linear change over time? What are the differences between natural and orthogonal polynomials? How can interpret estimates in a growth curve model versus an empirical logit model? How can I visualize my raw data and model fits simultaneously?
16 | 
17 | ### Acknowledgments
18 | 
19 | * Dan Mirman for GCA techniques
20 | * Dale Barr for empirical logit regression
21 | * Florian Jaeger for mixed-effects models
22 | * Mike Frank and the Wordbank team for vocabulary data in the first two tutorials
23 | 


--------------------------------------------------------------------------------
/data-vocab-females.csv:
--------------------------------------------------------------------------------
 1 | "","age","median"
 2 | "1",16,50.5
 3 | "3",17,75
 4 | "5",18,121
 5 | "7",19,148
 6 | "9",20,221
 7 | "11",21,213
 8 | "13",22,309
 9 | "15",23,379
10 | "17",24,350
11 | "19",25,435
12 | "21",26,471
13 | "23",27,531.5
14 | "25",28,513
15 | "27",29,579
16 | "29",30,575
17 | 


--------------------------------------------------------------------------------
/data-vocab.csv:
--------------------------------------------------------------------------------
 1 | "","age","gender","median","ci.l","ci.h","n"
 2 | "1",16,"F",50.5,8.5,17.5,65
 3 | "2",16,"M",31,16,40,59
 4 | "3",17,"F",75,31.05,37,35
 5 | "4",17,"M",56.5,14.5,25.0874999999999,36
 6 | "5",18,"F",121,50,48.5999999999999,34
 7 | "6",18,"M",51,15,40.5999999999999,47
 8 | "7",19,"F",148,54.5,71,38
 9 | "8",19,"M",139,62,52,35
10 | "9",20,"F",221,76,40,45
11 | "10",20,"M",117,33,73.5,48
12 | "11",21,"F",213,73,114,31
13 | "12",21,"M",182,70,103,39
14 | "13",22,"F",309,82.2125,73.5,40
15 | "14",22,"M",190,53.625,104,32
16 | "15",23,"F",379,71,77,41
17 | "16",23,"M",314,69,32,41
18 | "17",24,"F",350,63,47,59
19 | "18",24,"M",270,52.1375,82,48
20 | "19",25,"F",435,53,44,45
21 | "20",25,"M",363,89,27,31
22 | "21",26,"F",471,101,70,39
23 | "22",26,"M",420,116,48,39
24 | "23",27,"F",531.5,66.5,36.5,38
25 | "24",27,"M",416,85,75,43
26 | "25",28,"F",513,45,30,29
27 | "26",28,"M",398.5,78.0875,77.0125,30
28 | "27",29,"F",579,84,49,28
29 | "28",29,"M",444,80,39,31
30 | "29",30,"F",575,34,42,35
31 | "30",30,"M",540,60,48,31
32 | 


--------------------------------------------------------------------------------
/independent problems/tutorial2-independent.R:
--------------------------------------------------------------------------------
 1 | vocab_inds <- read.csv('data-vocab-individuals.csv')
 2 | vocab_inds$ageC <- scale(vocab_inds$age, center=T,scale=F)
 3 | vocab_inds$birth_orderC <- scale(vocab_inds$birth_order, center=T, scale=F)
 4 | vocab_inds$mom_edC <- scale(vocab_inds$mom_ed, center=T, scale=F)
 5 | vocab_inds$genderContrast <- -.5
 6 | vocab_inds[which(vocab_inds$gender == 'M'), 'genderContrast'] <- .5
 7 | 
 8 | model <- lm(productive ~ ageC + genderContrast + birth_orderC + mom_edC, data = vocab_inds)
 9 | summary(model)
10 | 
11 | library(ggplot2)
12 | 
13 | # visualize gender and age effects
14 | ggplot(vocab_inds, aes(x=age, y=productive, color=gender)) + 
15 |                       geom_point() +
16 |                       stat_smooth(method="lm") +
17 |                       scale_x_continuous(name="Age (Months)") +
18 |                       scale_y_continuous(name="Vocabulary")
19 |                     
20 | # visualize birth order effects
21 | # ugly median split:
22 | vocab_inds$birth_orderSplit <- 'Early'
23 | vocab_inds[which(vocab_inds$birth_order > median(vocab_inds$birth_order, na.rm=T)), 'birth_orderSplit'] <- 'Late'
24 | ggplot(vocab_inds, aes(x=age, y=productive, color=birth_orderSplit)) + 
25 |                                       geom_point() +
26 |                                       stat_smooth(method="lm") +
27 |                                       scale_x_continuous(name="Age (Months)") +
28 |                                       scale_y_continuous(name="Vocabulary")
29 | 
30 | # right now, our model's intercept (292.37) represents the estimated productive vocabulary
31 | # for a genderless baby at 22.91 months born as the 1.59th child born to a mom with 14.98 years
32 | # of education
33 | mean(vocab_inds$age)
34 | mean(vocab_inds$birth_order, na.rm=T)
35 | mean(vocab_inds$mom_ed, na.rm=T)
36 | 
37 | # this is neither the overall mean, nor the mean of any particular cell
38 | mean(vocab_inds$productive, na.rm=T)
39 |   # 292.64
40 | 
41 | # but if we center the genderContrast variable, it should be the overall mean
42 | vocab_inds$genderContrastC <- scale(vocab_inds$genderContrast, center=T, scale=F)
43 | model <- lm(productive ~ ageC + genderContrastC + birth_orderC + mom_edC, data = vocab_inds)
44 | summary(model)
45 | 
46 | # replicates our model
47 | drop1(model,~.,test="F")
48 | 
49 | # visualize our model's predictions for our important factors
50 | vocab_inds$predicted <- predict(model, vocab_inds)
51 | 
52 | vocab_inds$group <- paste(vocab_inds$gender, vocab_inds$birth_orderSplit, sep=' / ')
53 | vocab_inds$group <- factor(vocab_inds$group)
54 | 
55 | ggplot(vocab_inds, aes(x=age, y=predicted, color=group)) + 
56 |                               geom_point() +
57 |                               stat_smooth(method="lm", se=F) +
58 |                               scale_x_continuous(name="Age (Months)") +
59 |                               scale_y_continuous(name="Predicted Vocabulary")
60 | 
61 | # fit a crazy interaction model
62 | crazy_model <- lm(productive ~ (ageC + genderContrastC + birth_orderC + mom_edC)^3, data = vocab_inds)
63 | summary(crazy_model)
64 | 
65 | vocab_inds$predicted <- predict(crazy_model, vocab_inds)
66 | 
67 | ggplot(vocab_inds, aes(x=age, y=predicted, color=group)) + 
68 |   geom_point() +
69 |   stat_smooth(method="lm", se=F) +
70 |   scale_x_continuous(name="Age (Months)") +
71 |   scale_y_continuous(name="Predicted Vocabulary")
72 | 
73 | # is the crazy model worth keeping?
74 | anova(crazy_model, model, test="F")
75 | 


--------------------------------------------------------------------------------
/independent problems/tutorial3-independent.R:
--------------------------------------------------------------------------------
 1 | # tutorial 3 - independent work
 2 | 
 3 | library(ggplot2)
 4 | library(lme4)
 5 | library(plyr)
 6 | library(ez)
 7 | 
 8 | # load our data
 9 | data <- read.csv('data-eyetracking.csv')
10 | 
11 | data$TargetCoded <- -.5
12 | data[which(data$Target == 'Artefact'), 'TargetCoded'] <- .5
13 | 
14 | data$SexCoded <- -.5
15 | data[which(data$Sex == 'F'), 'SexCoded'] <- .5
16 | 
17 | data$TimeS <- data$TimeFromSubphaseOnset / 1000;
18 | 
19 | model <- glmer(Animate ~ (TargetCoded + SexCoded + TrialNumber + Age + TimeS)^2 + (1 + TargetCoded + TimeS + TrialNumber | ParticipantName) + (1 | Trial), family="binomial", data=data)
20 |   # takes forever to fit... I stopped after 10 minutes
21 | 
22 | data$Bin <- data$TimeFromSubphaseOnset %/% 50
23 | 
24 | binned <- ddply(data, .(ParticipantName,Age,TargetCoded,SexCoded,Bin), summarize, PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS))
25 | 
26 | # calculate the empirical logit
27 | binned$elog <- log( (binned$y + .5) / (binned$N - binned$y + .5) )
28 | binned$wts <- 1/(binned$y + .5) + 1/(binned$N - binned$y + .5)
29 | 
30 | model <- lmer(elog ~ (TargetCoded + SexCoded + Age + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned, weights=1/wts)
31 | summary(model)
32 |   # weird, no main effect of TargetCoded anymore...
33 |   # why?
34 |   # because Age isn't centered, and so it's estimating main effects at Age 0 (newborn...)
35 | 
36 | model <- lmer(elog ~ (TargetCoded + SexCoded + scale(Age,center=T,scale=F) + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned, weights=1/wts)
37 | summary(model)
38 |   # fixes it...
39 | 
40 | # let's center Age for good, now
41 | binned$AgeC <- scale(binned$Age, scale=F, center=T)
42 |   
43 | binned$Arcsin <- asin(sqrt(binned$PropAnimal))
44 | 
45 | model <- lmer(Arcsin ~ (TargetCoded + SexCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned)
46 | summary(model)
47 | 
48 | drop1(model,~.,test="Chisq")
49 | 
50 | # visualize a model's predictions
51 | # note I use PropAnimal here for interpretable DV's
52 | model <- lmer(PropAnimal ~ (TargetCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned)
53 | summary(model)
54 | 
55 | predictions <- data.frame(
56 |                       rep(as.vector(unique(binned$TimeS)),times=4),
57 |                       rep(c(-1,+1),each=(length(unique(binned$TimeS))*2)), # Age, centered (younger/older)
58 |                       rep(c(-.5,.5,-.5,.5),each=length(unique(binned$TimeS))),
59 |                       rep(c('Animal','Artefact','Animal','Artefact'),each=length(unique(binned$TimeS)))
60 |                         )
61 | colnames(predictions) <- c('TimeS','AgeC','TargetCoded','Target')
62 | 
63 | predictions$prediction <- predict(model, re.form=NA, predictions)
64 |   # note: re.form = NA sets all random effects to 0 so that
65 |   # it works on new data!!
66 | 
67 | ggplot(predictions, aes(x=TimeS, y=prediction, color=Target)) +
68 |                                                     geom_point() +
69 |                                                     geom_line() +
70 |                                                     facet_grid(.~AgeC)
71 | 
72 |   # this is how a linear-only model reconstructs our data...
73 |   # could be better...
74 | 
75 | # what happens when we mess with the Age column?
76 | 
77 | binned$AgeC <- factor(binned$AgeC)
78 | 
79 | model <- lmer(Arcsin ~ (TargetCoded + SexCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned)
80 | summary(model)
81 | 
82 | binned$AgeC <- as.numeric(as.character(binned$AgeC))
83 | 
84 | model <- lmer(Arcsin ~ (TargetCoded + SexCoded + AgeC + TimeS)^2 + (1 + TargetCoded + TimeS | ParticipantName), data=binned)
85 | summary(model)
86 |   # better...
87 | 
88 | 


--------------------------------------------------------------------------------
/old/README.md:
--------------------------------------------------------------------------------
 1 | tutorial-GCA-R
 2 | ==============
 3 | 
 4 | This tutorial uses simulated data to compare mixed-effects growth curve analyses (GCAs) to
 5 | traditional ANOVA and natural polynomial mixed-effects modelling approaches.
 6 | 
 7 | Everything I know about GCA, I learned from Dan Mirman (http://www.danmirman.org/gca)
 8 | and by playing around in R. Here's a good Mirman paper on the topic:
 9 | 
10 |    Mirman, D., Dixon, J., & Magnuson, J. S. (2008). Statistical and computational models of the  
11 |       visual world paradigm: Growth curves and individual differences. Journal of Memory and
12 |       Language, 59, 474–494.
13 | 
14 | Requires: install.packages('lmer','ggplot2','plyr','ez')
15 | 
16 | @author: Brock Ferguson
17 | @website: brockferguson.com
18 | 
19 | @modified June 29, 2014
20 | 


--------------------------------------------------------------------------------
/old/gca-tutorial.R:
--------------------------------------------------------------------------------
  1 | #
  2 | # Tutorial: Growth Curve Analyses in R
  3 | #
  4 | # This tutorial uses simulated data to compare mixed-effects growth curve analyses (GCAs) to
  5 | # traditional ANOVA and natural polynomial mixed-effects modelling approaches.
  6 | #
  7 | # Everything I know about GCA, I learned from Dan Mirman (http://www.danmirman.org/gca)
  8 | # and by playing around in R. Here's a good Mirman paper on the topic:
  9 | #
 10 | #   Mirman, D., Dixon, J., & Magnuson, J. S. (2008). Statistical and computational models of the  
 11 | #      visual world paradigm: Growth curves and individual differences. Journal of Memory and
 12 | #      Language, 59, 474–494.
 13 | #
 14 | # Requires: install.packages('lmer','ggplot2','plyr','ez')
 15 | #
 16 | # @author: Brock Ferguson
 17 | # @website: brockferguson.com
 18 | #
 19 | # @modified June 29, 2014
 20 | #
 21 | 
 22 | #####################################################
 23 | # (1) Simulate Data
 24 | #
 25 | # Generate a random dataset for an experiment that has 2 conditions (Practice and Control).
 26 | # Each participant is assessed for their "accuracy" at 10 timepoints. In the Practice condiition,
 27 | # they rapidly become more accurate but then taper ~63%. In the Control condition, they get better
 28 | # at a constant rate, and also max out around ~63%. Thus the group's differences do not differ in their
 29 | # linear rate of improvement, maximum improvement, or starting point (~50%), but in their 
 30 | # growth trajectory. This is an ideal problem for Growth Curve Analyses, because we are interested
 31 | # in the "shape" or functional form of their improvement over time, and we have multiple measurements
 32 | # from the same individuals (i.e., it's longitudinal, not cross-sectional).
 33 | #####################################################
 34 | 
 35 | # set parameters for Subjects and Timepoints
 36 | # these are configured at 240/10 by default, and things may look different
 37 | # if you change them
 38 | num_subjects <- 240
 39 | num_timepoints <- 10
 40 | 
 41 | # create dataframe and begin populating it
 42 | data <- data.frame(matrix(nrow=(num_subjects*num_timepoints), ncol=4))
 43 | colnames(data) <- c('Subject','Condition','Time','Accuracy')
 44 | data$Subject <- paste('SUB',rep(1:num_subjects, each=num_timepoints),sep="")
 45 | data$Condition <- rep(c('Control','Practice'), each=((num_subjects*num_timepoints)/2))
 46 | data$Time <- rep(1:num_timepoints,times=num_subjects)
 47 | 
 48 | # generate data for each participant...
 49 | 
 50 | # control condition gets slightly better over time
 51 | data[which(data$Condition == 'Control'), 'Accuracy'] <- seq(.5,.64,by=(.15/num_timepoints))
 52 | 
 53 | # but the practice+sleep condition improves logarthmically
 54 | data[which(data$Condition == 'Practice'), 'Accuracy'] <- .5 + log(1 + ((data[which(data$Condition == 'Practice'), 'Time'] - 1)*.55), 450000)
 55 | 
 56 | # and make sure our data has the proper column formats
 57 | data$Subject <- factor(data$Subject)
 58 | data$Condition <- factor(data$Condition)
 59 | 
 60 | #####################################################
 61 | # (2) Visualization and summaries
 62 | #####################################################
 63 | 
 64 | # first, let's take a look at our "perfect" data (before we add some random noise)
 65 | 
 66 | library(ggplot2)
 67 | 
 68 | # this is what we want our models to eventually discover in the noise...
 69 | ggplot(data=data, aes(x=Time, y=Accuracy, group=Condition, colour=Condition)) + geom_line() + geom_point()
 70 | 
 71 | # add said noise
 72 | data$Accuracy <- data$Accuracy + rnorm(num_subjects*num_timepoints,0,.05)
 73 | data[which(data$Accuracy > 1), 'Accuracy'] <- 1
 74 | data[which(data$Accuracy < 0), 'Accuracy'] <- 0
 75 | 
 76 | # visualize with noise
 77 | ggplot(data=data, aes(x=Time, y=Accuracy, group=Condition, colour=Condition)) + geom_point()
 78 | 
 79 | # take some summary statistics
 80 | 
 81 | library(plyr)
 82 | 
 83 | # just grab total accuracy by condition
 84 | summary <- ddply(data, .(Condition), summarize, MeanAccuracy = mean(Accuracy), sd = sd(Accuracy), N = length(Accuracy))
 85 | summary$se <- summary$sd / sqrt(summary$N)
 86 | 
 87 | # hold up though, those N's are way too big. collapse by subjects first.
 88 | subjects <- ddply(data, .(Subject,Condition), summarize, SubjectAccuracy = mean(Accuracy))
 89 | 
 90 | summary <- ddply(subjects, .(Condition), summarize, MeanAccuracy = mean(SubjectAccuracy), sd = sd(SubjectAccuracy), N = length(SubjectAccuracy))
 91 | summary$se <- summary$sd / sqrt(summary$N)
 92 | 
 93 | # effects are looking strong
 94 | 
 95 | # now breakdown by timepoint...
 96 | summary_time <- ddply(data, .(Condition,Time), summarize, MeanAccuracy = mean(Accuracy), sd = sd(Accuracy), N = length(Accuracy))
 97 | summary_time$se <- summary_time$sd / sqrt(summary_time$N)
 98 | 
 99 | #####################################################
100 | # (3) Standard ANOVA Analysis
101 | #####################################################
102 | 
103 | library(ez)
104 | 
105 | # ANOVA (Type-III) predicting Accuracy from Condition (between) and Time (within, nested under Subject)
106 | model <- ezANOVA(data, type = 3, dv = 'Accuracy', wid = Subject, between = .(Condition), within = .(Time))
107 | model
108 |   # effects:
109 |   #   Condition, p < .001
110 |   #   Time, p < .001
111 |   #   Interaction, ???
112 |   #     most of the time, we won't see an interaction here (depends on your random data)
113 |   #     - if it's reliable, we would conclude that Practice participants improved more quickly over time (wrong)
114 |   #     - if it's insignificant, we would conclude that they improved at the same rate (wrong)
115 |   #     this type of analysis just can't capture the form of the growth, and will lead us in wrong directions.
116 | 
117 | #####################################################
118 | # (4) Hierarchical Linear Regression
119 | #####################################################
120 | 
121 | library(lme4)
122 | 
123 | # let's just replicate our ANOVA analysis first...
124 | model <- lmer(Accuracy ~ Time*Condition + (1 | Subject), data = data)
125 | summary(model)
126 |   # weird! now our main effect is less reliable...
127 |   #   what happened?
128 |   #   ezANOVA centered our predictors automatically, and so the main effect in the ANOVA
129 |   #   was captured at t==0 or, theoretically speaking, timepoint 5.5
130 | 
131 | # center variables and try again...
132 | data$TimeC <- data$Time - mean(data$Time)
133 | 
134 | model <- lmer(Accuracy ~ TimeC*Condition + (1 | Subject), data = data)
135 | summary(model)
136 |   # better, but still suffering from the same problems as the ANOVA because we are only including
137 |   # a term that can inherently only capture linear growth (i.e., linear time)
138 | 
139 | #####################################################
140 | # (5) Growth Curve Analysis
141 | #####################################################
142 | 
143 | # first, let's review what we've been doing in the previous 2 analyses...
144 | # essentially: correlating a perfectly linear variable (TimeC) with our DV (Accuracy)
145 | # and seeing if these correlations differed by an IV (Condition)
146 | # this would isolate differences in linear growth (i.e., slope)
147 | 
148 | # let's re-create that manually... (of course, breaking all DF rules)
149 | practice <- data[which(data$Condition == 'Practice'), ]
150 | control <- data[which(data$Condition == 'Control'), ]
151 | 
152 | cor.test(control$TimeC, control$Accuracy)
153 |   # r ~ .67
154 |   
155 | cor.test(practice$TimeC, practice$Accuracy)
156 |   # r ~ .62
157 | 
158 | # the linear models we used compared these slopes and (probably) told us they were not different
159 | # there, slopes == linear coefficients
160 | # (note: the main effect of condition in prior analyses is because the Time slope in the practice condition
161 | #        is lower than the slope in the control condition... when extrapolating to timepoint=0,
162 | #        this makes the Practice condition slightly higher)
163 | 
164 | # in GCA, we do the same basic thing except we estimate linear and non-linear coefficients
165 | 
166 | # let's continue to do this manually just to begin
167 | # TimeC is linear, let's create a quadratic vector (we know our data is quadratic)
168 | data$TimeQuad <- data$Time^2
169 | data$TimeQuadC <- scale(data$TimeQuad, center = T, scale = F)
170 | practice <- data[which(data$Condition == 'Practice'), ]
171 | control <- data[which(data$Condition == 'Control'), ]
172 | 
173 | # take a look...
174 | data[1:10, 'Time']
175 | data[1:10, 'TimeQuad']
176 | 
177 | cor.test(control$TimeQuadC, control$Accuracy)
178 |   # r ~ .66
179 | 
180 | cor.test(practice$TimeQuadC, practice$Accuracy)
181 |   # r ~ .58
182 | 
183 | # these coefficients are very similar to the linear TimeC
184 | # is the data linear or quadratic? or both? and do they differ by condition?
185 | # in order to answer these questions, we can fit them simultaneously to see which one
186 | # is the best predictor of Accuracy (as we would with any 2 variables of interest)
187 | 
188 | model <- lmer(Accuracy ~ Condition + TimeC + TimeQuadC + TimeC:Condition + TimeQuadC:Condition + (1 | Subject), data = data)
189 | summary(model)
190 |   # we should now see lots of significant effects, giving us some hint of form.
191 |   # however, this look at form -- using natural polynomials -- is not ideal
192 |   # because natural polynomials correlate with one another and, just as with any 
193 |   # 2 variables, when there is significant correlation between two IV's, our ability to
194 |   # attribute variance to them is hindered.
195 | 
196 | # to demonstrate: TimeC and TimeQuadC are strongly correlated.
197 | cor.test(data[1:10, 'TimeC'], data[1:10, 'TimeQuadC'])
198 |   # p ~ .97
199 | 
200 | # so when our linear model is attempting to parcel variance, it has no idea where it goes.
201 | # solution: replace our time codes with *orthogonal* polynomial contrast codes
202 | # these are, by definition, uncorrelated vectors of N length which perfectly fit linear,
203 | # quadratic, cubic, quartic, etc. functions
204 | 
205 | # let's confirm what I said by generating 3: linear, quadratic, and cubic
206 | polycodes <- poly(1:10, 3)
207 | 
208 | # linear code
209 | polycodes[, 1]
210 | 
211 | # quadratic code
212 | polycodes[, 2]
213 | 
214 | # cubic code
215 | polycodes[, 3]
216 | 
217 | # visualize these polycodes...
218 | poly_df <- data.frame(matrix(nrow=30, ncol=3))
219 | poly_df[, 1] <- paste('Poly', rep(1:3, each=10), sep='')
220 | poly_df[, 2] <- rep(1:10, times=3)
221 | poly_df[1:10, 3] <- polycodes[, 1]
222 | poly_df[11:20, 3] <- polycodes[, 2]
223 | poly_df[21:30, 3] <- polycodes[, 3]
224 | colnames(poly_df) <- c('Code','Timepoint','Value')
225 | 
226 | ggplot(data=poly_df, aes(x=Timepoint, y=Value, group=Code, colour=Code)) + geom_line() + geom_point()
227 | 
228 | # they better not be correlated... or we're going to run into the same problem.
229 | cor(polycodes)
230 | round(cor(polycodes), 2)
231 |   # not at all, nice!
232 | 
233 | # the linear code correlates 100% with our previous linear time code
234 | cor.test(polycodes[, 1], data[1:10, 'TimeC'])
235 | 
236 | # the quadratic code correlates only partially with our old code, because this
237 | # is a complete quadratic function and not just our exponential function
238 | cor.test(polycodes[, 2], data[1:10, 'TimeQuadC'])
239 | 
240 | # but because they don't correlate with eachother, we can substitute them into a model
241 | # and see how they predict our DV
242 | # we will also simultaneously compare the estimated coefficients for each Condition (i.e., the
243 | # interactions) to see if they differ
244 | 
245 | # add them into our dataset
246 | data$ot1 <- rep(polycodes[, 1],times=num_subjects)
247 | data$ot2 <- rep(polycodes[, 2],times=num_subjects)
248 | data$ot3 <- rep(polycodes[, 3],times=num_subjects)
249 | 
250 | # now fit a GCA model with these new orthogonal polynomials
251 | model <- lmer(Accuracy ~ ot1*ot2*Condition + (1 | Subject), data = data)
252 | summary(model)
253 | 
254 | # this is good! but we need to add a couple of things:
255 | # to this point, all we have done is vary the intercept by subject.
256 | # when we are looking at a within-subjects variable (time, or our orthogonal timecodes)
257 | # we should allow the model to vary those by subject too
258 | 
259 | # fit another model...
260 | model <- lmer(Accuracy ~ ot1*ot2*Condition + (1 + ot1 + ot2 | Subject), data = data)
261 | summary(model)
262 | 
263 | # in our world of perfect data, it doesn't make much difference, but it's good practice
264 | 
265 | # visualize our raw data with our model fit on top
266 | ggplot(data=data, aes(x=Time, y=Accuracy, group=Condition, colour=Condition)) + geom_point() + stat_summary(aes(y=fitted(model)), fun.y=mean, geom="line")
267 |   # we recovered the original form pretty well!
268 | 
269 | # what happens we add in the cubic parameter?
270 | model <- lmer(Accuracy ~ ot1*ot2*ot3*Condition + (1 + ot1 + ot2 + ot3 | Subject), data = data)
271 | summary(model)
272 | 
273 | # pretty messy...
274 | # we can use model comparison to tell us if this parameter is justified
275 | 
276 | model_null <- lmer(Accuracy ~ ot1*ot2*Condition + (1 + ot1 + ot2 + ot3 | Subject), data = data)
277 | anova(model,model_null)
278 |   # p ~ .78... nope
279 | 
280 | 


--------------------------------------------------------------------------------
/probabilities-refresher.R:
--------------------------------------------------------------------------------
 1 | # Let's say we start with a probability of .8
 2 | prob <- .8
 3 | 
 4 | # we can calculate the odds that we get a "hit"
 5 | # .8 / .2
 6 | odds <- prob / (1-prob)
 7 | 
 8 | # we can then calculate the log-odds, used in logistic regression
 9 | log_odds <- log(odds)
10 | log_odds <- log(prob / (1-prob))
11 | 
12 | # and if we want to convert the log-odds back to probability
13 | prob2 <- exp(log_odds) / (1+exp(log_odds))
14 | 
15 | # empirical logit is just like log-odds except we add a constant
16 | # to keep from getting odds of -Inf
17 | hits <- 8
18 | total <- 10
19 | empirical_logit <- log( (hits + .5) / (total - hits + .5) )
20 | 
21 | # compare probabilities to log-odds
22 | probs <- seq(.1,.9,by=.1)
23 | probs_logodds <- log(probs / (1-probs))
24 | probs_arcsin <- asin(sqrt(probs))
25 | 
26 | # both Arcsin and log-odds do a nice job of extending the tails of the probability
27 | # distribution
28 | plot(probs,probs_logodds)
29 | plot(probs,probs_arcsin)


--------------------------------------------------------------------------------
/tutorial1-introduction.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 |   title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 1: Introduction to R"
  3 |   author: "Brock Ferguson"
  4 |   date: "July 1, 2015"
  5 |   output:
  6 |     html_document:
  7 |       toc: true
  8 |       theme: readable
  9 | ---
 10 | 
 11 | # Dependencies
 12 |   
 13 | Install the packages we'll be using today, and load ggplot -- which we'll use in this first tutorial.
 14 | 
 15 | ```{r quiet=T}
 16 | # commented out because I already have these
 17 | # install.packages('reshape2','dplyr','ggplot2','lme4','lmerTest')
 18 | 
 19 | library(ggplot2)
 20 | ```
 21 | 
 22 | # Working with dataframes
 23 | 
 24 | We are going to use dataset from Wordbank (http://wordbank.stanford.edu/) for these
 25 | early tutorials. This dataset tracks boys' and girls' productive vocabularies
 26 | through toddlerhood. To make things easier, I have aggregated the dataset across
 27 | subjects to yield a median vocabulary size by gender and age, with confidence intervals.
 28 | 
 29 | ```{r}
 30 | vocab <- read.csv('data-vocab.csv')
 31 | ```
 32 | 
 33 | This dataset is loaded as a "dataframe" -- a series of columns.
 34 | You can see this dataframe now in our "Environment" window if you are using RStudio.
 35 | 
 36 | Each column in a dataframe can be of a unique type.
 37 | 
 38 | Let's see what types of columns we have:
 39 | 
 40 | ```{r}
 41 | summary(vocab)
 42 | ```
 43 | 
 44 | We don't need the X column (it's just indicates the row number) so we will get rid of it
 45 | by selecting the columns of the dataframe using a vector representing the range 2 to 7.
 46 | 
 47 | ```{r}
 48 | vocab <- vocab[, c(2:7)]
 49 | summary(vocab)
 50 | ```
 51 | 
 52 | We can select one column and assign it to a new vector variable.
 53 | 
 54 | ```{r}
 55 | medians <- vocab[, 'median']
 56 | ```
 57 | 
 58 | This vector is now listed in the Environment as well -- under "values".
 59 | 
 60 | We can also select rows by number. Because we haven't assigned their values to a new variable, they will just be printed to the screen.
 61 | 
 62 | ```{r}
 63 | vocab[c(1:5), ]
 64 | vocab[c(1,2,3,4,5), ]
 65 | ```
 66 | 
 67 | ... Or combine row and column selections...
 68 | 
 69 | ```{r}
 70 | vocab[5, 'gender']
 71 | vocab[c(1:5), c('age','gender')]
 72 | ```
 73 | 
 74 | Want to search through a dataframe for matching columns? We can do that too.
 75 | 
 76 | ```{r}
 77 | vocab[which(vocab$gender == 'F'), ]
 78 | vocab[which(vocab$gender == 'F' & vocab$age == 17), ]
 79 | vocab[which(vocab$age == 18 | vocab$age == 17), ]
 80 | vocab[which(vocab$age > 16 & vocab$age < 19), ]
 81 | ```
 82 | 
 83 | Using assignment, we can overwrite values in a dataframe.
 84 | For example, let's set the median vocab for 17-month-old females to 1000.
 85 | 
 86 | ```{r}
 87 | vocab[which(vocab$gender == 'F' & vocab$age == 17), 'median'] <- 1000
 88 | vocab[which(vocab$gender == 'F' & vocab$age == 17), ] # now shows 1000
 89 | 
 90 | vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(1000,NA)
 91 | 
 92 | # reset it now...
 93 | vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(75,37)
 94 | ```
 95 | 
 96 | # Vectors
 97 | 
 98 | R is a vectorized language, meaning that we can perform operations on entire vectors
 99 | in parallel and not have to operate on each cell one-at-a-time.
100 | 
101 | Take our vector of medians, for example:
102 | 
103 | ```{r}
104 | medians
105 | ```
106 | 
107 | Let's perform some mathematical operations on it:
108 | 
109 | ```{r}
110 | medians + 2
111 | medians*4
112 | (medians^2) / 4
113 | log(medians)
114 | exp(medians)
115 | sqrt(medians)
116 | ```
117 | 
118 | Those vectorized functions return new vectors that have been transformed in some way.
119 | 
120 | We can also use summary functions to boil our medians vector down to a single meaningful metric:
121 | 
122 | ```{r}
123 | mean(medians)
124 | var(medians)
125 | sqrt(var(medians))
126 | sd(medians)
127 | length(medians)
128 | 
129 | length(vocab)
130 | nrow(vocab)
131 | ncol(vocab)
132 | ```
133 | 
134 | Or combine summary functions and vector transformations to do useful things, like
135 | calculate z-scores:
136 | 
137 | ```{r}
138 | z_scores <- (medians - mean(medians)) / sd(medians)
139 | z_scores
140 | 
141 | # compare to R's functions for z-score scaling...
142 | z_scores <- scale(medians,center=T,scale=T)
143 | ```
144 | 
145 | Almost all functions can apply to vectors, even non-mathematical functions:
146 | 
147 | For example, `paste()` combines strings together:
148 | 
149 | ```{r}
150 | paste('this','is','a','sentence',sep=' ')
151 | paste('median: ',medians,sep='')
152 | ```
153 | 
154 | And, of course, we can perform operations involving multiple vectors:
155 | 
156 | ```{r}
157 | # vector * integer
158 | medians * rnorm(1, 0, 1)
159 | 
160 | # vector * vector
161 | medians * rnorm(length(medians), 0, 1)
162 | ```
163 | 
164 | # Visualization
165 | 
166 | R includes basic functions for visualizing your data:
167 | 
168 | ```{r}
169 | plot(medians)
170 | hist(medians)
171 | plot(vocab$age, vocab$median)
172 | ```
173 | 
174 | ...but I prefer using ggplot2 which gives us more intuitive options for visualization
175 | and greater control over the appearance:
176 | 
177 | ```{r}
178 | ggplot(vocab, aes(x=age, y=median, color=gender)) +
179 |         geom_point()
180 | 
181 | ggplot(vocab, aes(x=age, y=median, color=gender)) +
182 |         geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) +
183 |         labs(x="Age (Months)", y="Productive Vocabulary")
184 | 
185 | ggplot(vocab, aes(x=age, y=median, fill=gender)) +
186 |         geom_bar(stat='identity', position=position_dodge()) +
187 |         geom_errorbar(aes(ymax=median+ci.h, ymin=median-ci.l), position=position_dodge(.9), width=0.25) +
188 |         labs(x="Age (Months)", y="Productive Vocabulary")
189 | ```
190 | 
191 | We'll gradually use more and more advanced aspects of ggplot throughout these tutorials. For now, we are simply plotting single variables from a dataframe (much like we could with Excel). However, by the 4th tutorial, we'll be using ggplot to group, transform, and summarize our (almost) raw data and visualize it in a single shot.
192 | 
193 | # Statistical tests and models
194 | 
195 | Finally, R provides a host of functions to run statistical tests and/or fit models to your data.
196 | 
197 | Let's do a basic correlation just to see how these functions work.
198 | 
199 | First, get just the female data:
200 | 
201 | ```{r}
202 | female_vocab <- vocab[which(vocab$gender == 'F'), c('age','median')]
203 | female_vocab 
204 | ```
205 |                   
206 | ## Does age correlate with productive vocabulary?
207 | 
208 | Plot using basic R plots.
209 | 
210 | ```{r}
211 | plot(female_vocab$age, female_vocab$median)
212 | 
213 | # add a line of best fit
214 | abline(lm(female_vocab$median ~ female_vocab$age))
215 | ```
216 | 
217 | We'll get more into the syntax of `lm()` later, but this function uses a linear model
218 | to predict the 'median' vector from the 'age' vector, then feeds this to `abline()`
219 | to plot this line of best fit.
220 | 
221 | We can do the same thing with ggplot, though it is nice enough to give us the SE around our linear model fit:
222 | 
223 | ```{r}
224 | ggplot(female_vocab, aes(x=age, y=median)) +
225 |                                   geom_point() +
226 |                                   stat_smooth(method="lm")
227 | ```
228 | 
229 | What is the Pearson's r for this correlation?
230 | 
231 | ```{r}
232 | cor(female_vocab$age, female_vocab$median)
233 | ```
234 | 
235 | Is this significant?
236 | 
237 | ```{r}
238 | cor.test(female_vocab$age, female_vocab$median)
239 | ```
240 | 
241 | We can access specific variables in this statistical test's output directly:
242 | 
243 | ```{r}
244 | cor_output <- cor.test(female_vocab$age, female_vocab$median)
245 | 
246 | # useful command to show the available data in any variable
247 | str(cor_output)
248 | 
249 | cor_output$p.value
250 | cor_output$statistic
251 | 
252 | # also:
253 | names(cor_output)
254 | cor_output$conf.int
255 | 
256 | # round the p-value:
257 | cor_output$p.value
258 | round(cor_output$p.value, 3)
259 | 
260 | # how much variance does this account for (e.g., R-squared, R^2)?
261 | cor_output$estimate^2
262 | ```
263 | 
264 | When we run statistical models, the vector we return is not only a collection of variables
265 | (e.g., from which we can select the p-value, estimate, etc.) but also an *object*
266 | on which we can run various *class* methods.
267 | 
268 | ```{r}
269 | class(cor_output)
270 | ```
271 | 
272 | Here we can see we have an "htest" (i.e., "hypothesis test") object.
273 | 
274 | This class is limited to the "print" method.
275 | 
276 | ```{r}
277 | print(cor_output)
278 | ```
279 | 
280 | ... but other classes include other methods like 'plot', 'summary', etc.
281 | 
282 | Let's run another statistical test on some random data. A Welch's t-test:
283 | 
284 | ```{r}
285 | t.test(rnorm(30,5,5), rnorm(30,10,5))
286 | ```
287 | 
288 | We can pass specific options to these tests. One common option for 't.test' is var.equal=T so that we run a standard Student t-test rather than a Welch's t-test.
289 | 
290 | ```{r}
291 | t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T)
292 | ```
293 | 
294 | We can also use `paired` to run a dependent, paired t-test, or use `alt` to specify whether we want a
295 | one-sided or two-sided p-value.
296 | 
297 | ```{r}
298 | t.test(rnorm(30,5,5), rnorm(30,10,5), paired=T)
299 | t.test(rnorm(30,5,5), rnorm(30,10,5), alt='less',var.equal=T) # by default: 'two.sided'
300 | ```
301 | 
302 | What class is a `t.test()` object?
303 | 
304 | ```{r}
305 | class(t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T))
306 | ```
307 | 
308 | It's the same as `cor.test()`, which is why the output is similar.
309 | 
310 | Finally, we loaded our data from a CSV, now let's save a dataset to a CSV file.
311 | 
312 | ```{r}
313 | write.csv(female_vocab, 'data-vocab-females.csv')
314 | ```
315 | 
316 | Clean up our workspace.
317 | 
318 | ```{r}
319 | ls()
320 | rm(list=ls())
321 | ```


--------------------------------------------------------------------------------
/tutorial2-general-linear-models.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 |   title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 2: General Linear Models"
  3 |   author: "Brock Ferguson"
  4 |   date: "July 1, 2015"
  5 |   output:
  6 |     html_document:
  7 |       toc: true
  8 |       theme: readable
  9 | ---
 10 |   
 11 | Load packages we'll be using:
 12 | 
 13 | ```{r}
 14 | library(ggplot2)
 15 | ```
 16 | 
 17 | Load in our vocabulary data again.
 18 | 
 19 | ```{r}
 20 | vocab <- read.csv('data-vocab.csv')
 21 | ```
 22 | 
 23 | # Introduction to lm()
 24 | 
 25 | Re-run our correlation as a basic linear model:
 26 | 
 27 | ```{r}
 28 | model <- lm(median ~ age, data=subset(vocab, gender == 'F'))
 29 | ```
 30 | 
 31 | What methods are available to us on this object?
 32 | 
 33 | ```{r}
 34 | class(model)
 35 | 
 36 | model
 37 | print(model)
 38 | 
 39 | summary(model)
 40 | 
 41 | plot(model)
 42 | ```
 43 | 
 44 | # lm() and aov() comparison
 45 | 
 46 | Let's do some analyses with our vocab dataset to compare the different linear modeling
 47 | methods made available by R.
 48 | 
 49 | First, let's begin by visualizing the raw data again:
 50 | 
 51 | ```{r}
 52 | ggplot(vocab, aes(x=age, y=median, color=gender)) +
 53 |                     geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) +
 54 |                     scale_x_continuous(name="Age (Months)") +
 55 |                     scale_y_continuous(name="Productive Vocabulary")
 56 | ```
 57 | 
 58 | ## Single factor models
 59 | 
 60 | Let's compare `lm()` (Linear Model) and `aov()` (Analysis of Variance) in modeling these data.
 61 | 
 62 | Note: For simplicity, we are going to treat each datapoint here as if it were a different subject, as if they were 1 subject of each gender at each age.
 63 | 
 64 | ```{r}
 65 | summary(lm(median ~ age, data=subset(vocab, gender == 'F')))
 66 | 
 67 | summary(aov(median ~ age, data=subset(vocab, gender == 'F')))
 68 | ```
 69 | 
 70 | What we expect is that a single term linear model should give us the exact same results as a one-way ANOVA. This is actually what we see when compare these models. Although the ANOVA did not return a precise estimate of the slope (like `lm()`), their p-values are identical and the ANOVA's F-value is exactly the linear model's t, squared:
 71 | 
 72 | ```{r}
 73 | sqrt(630.2)
 74 | ```
 75 | 
 76 | ## Two factor models, no interaction
 77 | 
 78 | What happens when we add in a second main effect (but no interaction)?
 79 | 
 80 | ```{r}
 81 | summary(lm(median ~ age + gender, data=vocab))
 82 | summary(aov(median ~ age + gender, data=vocab))
 83 | ```
 84 | 
 85 | Perfect again! Exact same results using both methods which is what we would expect when we have a two-way, non-interaction ANOVA and a two-parameter linear model.
 86 | 
 87 | ## Two factor models, with interactions
 88 | 
 89 | What happens when we allow our two factors to interact in predicting vocabulary?
 90 | 
 91 | ```{r}
 92 | summary(lm(median ~ age*gender, data=vocab))
 93 | summary(aov(median ~ age*gender, data=vocab))
 94 | ```
 95 | 
 96 | Now we see (a) vastly different results between the models and, (b) strange results within the linear model especially. Let's break this down.
 97 | 
 98 | In our linear model, we still have our significant effect of age (which is good) but now no reliable effect of gender (which is bad) and a positive estimate of being male (which is very bad).
 99 | 
100 | In our ANOVA, we see a wildly significant effect of age, and a wildly significant effect of gender.
101 | 
102 | What's going on?
103 | 
104 | ## Why are lm() and aov() different?
105 | 
106 | The answer to this questions lays in (a) the ways the models are fitted, and (b) the way our parameters are coded before running the model. We normally don't have to think about either of these things in SPSS, for example, because they are done for us automatically.
107 | 
108 | ### lm()
109 | 
110 | Let's first look at the linear model which, at first glance, seems the most questionable.
111 | 
112 | Specifically, (1) why did it not return a main effect of gender?
113 |               (2) why is its estimate of the effect of being male positive, when each male "subject" is lower than the female at that same age?
114 | 
115 | Let's re-fit our model and save its results in a variable.
116 | 
117 | ```{r}
118 | model <- lm(median ~ age*gender, data=vocab)
119 | ```
120 | 
121 | We want to know what this model fit looks like. We can do this by retrieving the predictions of our model and storing them in our dataframe.
122 | 
123 | ```{r}
124 | vocab$predictions <- predict(model, vocab)
125 | ```
126 | 
127 | Rather than comparing our predictions and actual values line-by-line, it's best to visualize them on a single plot.
128 | 
129 | Here we will plot the data like we did before, however we are now going to add a new "layer" to the plot which will include our predicted values.
130 | 
131 | ```{r}
132 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 
133 |                     geom_point() + # the raw data, plots the 'y' aesthetic above
134 |                     geom_line(aes(y=predictions)) + # our predictions, plotted as a line
135 |                     scale_x_continuous(name="Age (Months)") +
136 |                     scale_y_continuous(name="Predicted Vocabulary")
137 | ```
138 | 
139 | Looks like a good fit! Indeed, our model fit maps exactly on to the line of best we can generate using ggplot:
140 | 
141 | ```{r}
142 | ggplot(vocab, aes(x=age, y=median, color=gender)) + 
143 |                     geom_point() + # the raw data, plots the 'y' aesthetic above
144 |                     stat_smooth(method="lm", se = FALSE) + # add a line of best fit, without SE
145 |                     scale_x_continuous(name="Age (Months)") +
146 |                     scale_y_continuous(name="Predicted Vocabulary")
147 | ```
148 | 
149 | So what's wrong with our model?
150 | 
151 | Let's think about we want to know with each parameter:
152 | 
153 | For Age, we want to know what the effect of Age is while controlling for Gender.
154 | 
155 | For Gender, we want to know what the effect of Gender is while controlling for Age.
156 | 
157 | When we say "Controlling" in a linear model, what we actually mean is "while holding that parameter constant at zero."
158 | 
159 | If we set Age to 0 in our current model, as the variables are currently coded, we are essentially asking for the effect of Gender on a newborn's vocabulary.
160 | 
161 | We can extrapolate our model's predictions down to a newborn to take the model's perspective on this question...
162 | 
163 | ```{r}
164 | vocab_extrapolations <- data.frame(
165 |                               c(rep(0:30,each=2)),
166 |                               c(rep(c('M','F'),times=31)),
167 |                               c(rep(0,62))
168 |                               )
169 | colnames(vocab_extrapolations) <- c('age','gender','prediction')
170 | 
171 | vocab_extrapolations$prediction <- predict(model, vocab_extrapolations)
172 | 
173 | ggplot(vocab_extrapolations, aes(x=age, y=prediction, color=gender)) + 
174 |                                     geom_line() +
175 |                                     scale_x_continuous(name="Age (Months)") +
176 |                                     scale_y_continuous(name="Predicted Vocabulary")
177 | ```
178 | 
179 | Now everything should begin to make sense. Our non-significant, positive effect of being a Male is the model telling us that it thinks male newborns have a slight advantage on female newborns. This obviously doesn't make sense.
180 | 
181 | This wasn't a problem for us before adding the interaction term because, without the interaction, it's estimate of the effect of Gender is constant and therefore equally reliable at all ages.
182 | 
183 | To fix this problem, let's *center* the age variable so when we tell the model to give us the effect of gender while controlling for Age, we are holding Age constant at the *mean* age for our sample.
184 | 
185 | ```{r}
186 | vocab$ageC <- scale(vocab$age, center=T, scale=F)
187 | 
188 | summary(lm(median ~ ageC*gender, data=vocab))
189 | ```
190 |   
191 | This is better: a significant, negative effect for gender.
192 | 
193 | Now, we have to think about our effect of Age. As things stand, when we hold Gender constant at zero, we are actually fixing the gender parameter at "Female." This "treament coding" of variables is the default for factor columns in R dataframes.
194 | 
195 | We will fix this in the exact same way as before: centering. To center a non-numeric factor variable with only 2 levels, we will re-code it as a numeric variable where our two groups differ by a total of 1 unit. This is called deviation coding:
196 | 
197 | ```{r}
198 | vocab$genderC <- ifelse(vocab$gender == 'F', -.5, .5)
199 | vocab$genderC <- scale(vocab$genderC, center=T, scale=F) # make sure it's centered
200 | 
201 | summary(lm(median ~ ageC*genderC, data=vocab))
202 | ```
203 | 
204 | Now we have a proper model!
205 | 
206 | The moral here is that we need to pay close attention to the way our variables are treated (e.g., centered, coded, standardized) because they change the way the model's effects are estimated and intepreted. It's more of an issue in R than elsewhere because, unlike SPSS and other packages which do all of this automatically in the background, R won't assume you want these transformations done in the background.
207 | 
208 | I see mistakes with this all of the time. Here are my recommendations when fitting any linear model in R:
209 | 
210 |   (1) Deviation code and center all variables, by default.
211 |   
212 |   (2) Ask yourself, do the model's estimates make sense intuitively?
213 |   
214 |   (3) Visualize your raw data and model predictions.
215 |   
216 |   (4) Try to replicate main effect parameter's estimate using arithmetic. For example:
217 |   
218 | ```{r}
219 | # our intercept should be equivalent to our total mean
220 | mean(vocab$median)
221 | 
222 | # our gender effect should equal the difference between our male and female groups
223 | mean(vocab[which(vocab$gender=='F'), 'median']) - mean(vocab[which(vocab$gender=='M'), 'median'])
224 | ```
225 | 
226 | The reason we have to be so careful with `lm()` is that what it's giving us is *simple* effects -- estimates of the effect of a single variable while holding other independent variables constant at one level.
227 | 
228 | What we usually want are *main effects* -- estimates of the effect of a single variable while holding all other independent variables at their average.
229 | 
230 | By centering and using deviation-coding, we can get main effects from `lm()`.
231 | 
232 | ### aov()
233 | 
234 | But what `aov()`? If you remember, it wasn't so bad. It already knew there was a reliable main effect of gender (though we don't know which direction, because we haven't plotted its predictions).
235 | 
236 | Is it better than `lm()`? No!
237 | 
238 | Here's why:
239 | 
240 | ```{r}
241 | # generate three random vectors
242 | random_dv <- rnorm(30,5,5)
243 | random_iv1 <- rnorm(30,5,5)
244 | random_iv2 <- rnorm(30,5,5)
245 | 
246 | # fit a model using these random vectors
247 | summary(aov(random_dv ~ random_iv1 + random_iv2))
248 | 
249 | # now enter them in the opposite order:
250 | summary(aov(random_dv ~ random_iv2 + random_iv1))
251 | ```
252 | 
253 | We got different answers depending on the order we entered the variables. This is because `aov()` uses Type I (sequential) sums of squares. No amount of re-coding or centering can overcome this property and give us the answers we (probably) want.
254 | 
255 | Notice, however, that changing the order of age and gender in our model as no effect:
256 | 
257 | ```{r}
258 | summary(aov(median ~ age + gender, data=vocab))
259 | 
260 | summary(aov(median ~ gender + age, data=vocab))
261 | ```
262 | 
263 | This is because these predictors are perfectly uncorrelated (because we have a balanced design):
264 | 
265 | ```{r}
266 | cor.test(vocab$age, as.numeric(vocab$gender))
267 | ```
268 |   
269 | With even the slightest correlation, our estimates are going to change based on order of entry in the model.
270 | 
271 | ## When should I use aov()?
272 | 
273 | There is only one time you should use `aov()` -- repeated-measures ANOVAs. In tutorial 3, we will begin using mixed-effects models which, by way of random effects, serve much the same function. However, sometimes you just want a good old-fashioned repeated measures ANOVA and, *provided your design is balanced and everything is coded properly*, you will get an appropriate answer from `aov()`.
274 | 
275 | Here's a quick example:
276 | 
277 | ```{r}
278 | # build a quick dataframe
279 | dv <- c(1,3,4,2,2,3,2,5,6,3,4,4,3,5,6)
280 | subject <- factor(rep(paste0('s',1:5),each=3))
281 | myfactor <- factor(rep(paste0('f',1:3),times=5))
282 | mydata <- data.frame(dv, subject, myfactor)
283 | 
284 | model <- aov(dv ~ myfactor + Error(subject/myfactor), data=mydata)
285 | summary(model)
286 | ```
287 | 
288 | ## Other helpful lm() functions
289 | 
290 | Here are some other helpful functions for linear models:
291 | 
292 | ```{r}
293 | model <- lm(median ~ ageC*genderC, data=vocab)
294 | 
295 | # confint(): get confidence intervals for estimates
296 | confint(model)
297 | 
298 | # pairwise.t.test(): run pairwise.t.tests between groups (ignoring other factors)
299 | pairwise.t.test(vocab$median, vocab$genderC)
300 |   # insignificant, because we aren't controlling for Age
301 | 
302 | # drop1(): drop each parameter and retrieve an F, for reporting like an ANOVA
303 | model <- lm(median ~ ageC*genderC, data=vocab)
304 | drop1(model,~.,test="F")
305 | ```
306 | 
307 | Clean up our workspace.
308 | 
309 | ```{r}
310 | ls()
311 | rm(list=ls())
312 | ```


--------------------------------------------------------------------------------
/tutorial3-generalized-linear-models.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 |   title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 3: Generalized Linear Models"
  3 |   author: "Brock Ferguson"
  4 |   date: "July 1, 2015"
  5 |   output:
  6 |     html_document:
  7 |       toc: true
  8 |       theme: readable
  9 | ---
 10 | 
 11 | Load required packages.
 12 | 
 13 | ```{r}
 14 | library(ggplot2)
 15 | library(lme4)
 16 | library(dplyr)
 17 | ```
 18 | 
 19 | Load our eyetracking dataset. These data are from the familiar world trials for 19-month-olds reported in Ferguson, Graf, & Waxman (2014, Cognition).
 20 | 
 21 | ```{r}
 22 | data <- read.csv('data-eyetracking.csv')
 23 | ```
 24 | 
 25 | What kind of columns do we need for these analyses?
 26 | 
 27 | ```{r}
 28 | summary(data)
 29 | ```
 30 | 
 31 | This dataset is only a few steps removed from what comes out of the standard eyetracker.
 32 | 
 33 | I have done some prep, though:
 34 | 
 35 | * removed trackloss
 36 | * converted GazeX and Y columns to AOI 1 or 0 columns
 37 | * merged in Subject information
 38 | 
 39 | # Analysis 1: By-subjects and By-items ANOVAs (i.e., F1 and F2)
 40 | 
 41 | Pros:
 42 | * simple
 43 | * allows for generalization beyond samples participants and sampled items.
 44 | 
 45 | Cons:
 46 | * lose all timing information
 47 | * what do we do with ambiguities between analyses?
 48 | 
 49 | We need to aggregate across trials by target within participants:
 50 | 
 51 | ```{r}
 52 | agg_subjects <- data %>%
 53 |                 group_by(ParticipantName,Sex,Age,Target) %>%
 54 |                 summarise(PropAnimal = mean(Animate)) %>%
 55 |                 ungroup()
 56 | ```
 57 | 
 58 | Visualize our aggregated data:
 59 | 
 60 | ```{r}
 61 | ggplot(agg_subjects, aes(x=Target, y=PropAnimal)) +
 62 |                   geom_point(position=position_jitter(.3))
 63 | ```
 64 | 
 65 | Use our best practices from the `lm()` tutorial to prepare and model these data:
 66 | 
 67 | ```{r}
 68 | agg_subjects$TargetCoded <- ifelse(agg_subjects$Target == 'Artefact', -.5, .5)
 69 | ```
 70 | 
 71 | There's no need to center here because these data are balanced (every subject is present in both conditions). But this is how we would center, anyways:
 72 | 
 73 | ```{r}
 74 | agg_subjects$TargetCoded <- scale(agg_subjects$TargetCoded, center=T, scale=F)
 75 | ```
 76 | 
 77 | Here we use `aov()` because it allows for a repeated-measures Error() term.
 78 | 
 79 | As we learned before, aov() uses Type I sums of squares, but with only one factor (i.e., no correlation issues), it's safe.
 80 | 
 81 | ```{r}
 82 | model <- aov(PropAnimal ~ TargetCoded + Error(ParticipantName/TargetCoded), data = agg_subjects)
 83 | summary(model)
 84 | ```
 85 | 
 86 | Looks good! That's our F2 "subjects" ANOVA.
 87 | 
 88 | Now we can do an F1 "items" ANOVA as well. This just involves slightly changing our `group_by()` call:
 89 | 
 90 | ```{r}
 91 | agg_items <- data %>%
 92 |               group_by(Trial,Target) %>%
 93 |               summarise(PropAnimal = mean(Animate)) %>%
 94 |               ungroup()
 95 |   
 96 | agg_items$TargetCoded <- ifelse(agg_items$Target == 'Artefact', -.5, .5)
 97 | ```
 98 | 
 99 | Visualize effects by items:
100 | 
101 | ```{r}
102 | ggplot(agg_items, aes(x=Target, y=PropAnimal, fill=Trial)) +
103 |                   geom_point(position=position_jitter(.3))
104 | ```
105 | 
106 | Normally, in an F2 analysis, we would include an Error() term because we would have observed each condition within each item and thus have a sense about the size of the condition effect for each item. However, this was a study with infants and we couldn't do that. So we won't include an Error() term here, and just do a between-subjects one-way ANOVA.
107 | 
108 | ```{r}
109 | model <- aov(PropAnimal ~ TargetCoded, data = agg_items)
110 | summary(model)
111 | ```
112 | 
113 | These F1/F2 analyses are both crystal clear (reject the null!). But, in other cases, there can be ambiguit. For example, what if one is significant and the other is marginal?
114 | 
115 | Ideally, we could have one test which allows us to control for random trial AND subject factors simultaneously. Enter `lmer()`...
116 | 
117 | # Analysis 2: Simultaneous Trial and Subject Random Effects
118 | 
119 | Aggregate data by Trials (Items) and Participants (i.e., one datapoint for each trial).
120 | 
121 | ```{r}
122 | agg_sub_items <- data %>%
123 |                  group_by(ParticipantName,Trial,Target) %>%
124 |                  summarise(PropAnimal = mean(Animate)) %>%
125 |                  ungroup()
126 |   
127 | agg_sub_items$TargetCoded <- ifelse(agg_sub_items$Target == 'Artefact', -.5, .5)
128 | agg_sub_items$TargetCoded <- scale(agg_sub_items$TargetCoded, center=T, scale=F)
129 | ```
130 | 
131 | Fit a mixed-effects model allowing the intercept (represented by a "1") to vary by both Participants and Trials. By allowing the intercept to vary by subjects and items, we are allowing the model to estimate (and thus control for) each participants' and trials' mean tendency to look (cause participants to) look at the animal regardless of condition. For example, Billy may just love animals while Sammy may hate animals. Importantly, we want to know whether they word they heard (represented here by TargetCoded) caused them to look more/less to the animal above and beyond these baseline preferences.
132 | 
133 | ```{r}
134 | model <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items)
135 | summary(model)
136 | ```
137 | 
138 | This looks good, and converges with our previous estimate of a significant effect.
139 | 
140 | Note that, at the top of the summary output, we can see that there is very little variance in our random effect estimates. This means that the model is having trouble estimating them and is therefore keeping them all very near zero.
141 | 
142 | This is likely the result of (a) small random differences between subjects and trials in this sample and, (b) a relatively small dataset. We will want to continue keep an eye on the variance of random effects -- when this is essentially 0, we may want to consider using a regular ANOVA.
143 | 
144 | But in the meantime let's dive into this model a bit more.
145 | 
146 | We can see the fixed effects:
147 | 
148 | ```{r}
149 | fixef(model)
150 | ```
151 | 
152 | Here, because we centered the variable, `Intercept` is the overall mean looking to animal. `TargetCoded` represents the difference between looking to the animal between our two conditions. If we subtract 1/2 of it from the intercept, we get our mean looking to the animal when the "Artefact" was named and, if we add 1/2 of it to the intercept, we get our mean looking to the animal when the "Animal" was named.
153 | 
154 | We can also see random effects (some people use these to describe individual differences):
155 | 
156 | ```{r}
157 | ranef(model)
158 | ```
159 | 
160 | Here they are nearly zero, which corresponds to our assessment of their variance earlier.
161 | 
162 | We can also see what the model's predictions were, for each participant/item. Here we will get the mean prediction for each subject for each type of Target trial.
163 | 
164 | ```{r}
165 | agg_sub_items$prediction <- predict(model, agg_sub_items)
166 | 
167 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) +
168 |                           stat_summary(fun.y='mean', geom='point', position=position_jitter(.3))
169 | ```
170 | 
171 | You can see that the model is making different predictions for each subject. By allowing their intercepts to vary, the model is accounting for the fact that some kids just like looking at animals more than others. But it's being very conservative with its estimates because, frankly, we haven't given it much data to go on.
172 | 
173 | Importantly, with these models, we aren't limited to varying just the intercept by specific grouping factors. We can also vary slopes (i.e., fixed effects) by participants as well. By adding "TargetCoded" as a random slope, we allow the model to vary the magnitude of the difference between Target conditions within participants.
174 | 
175 | ```{r}
176 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
177 | summary(model)
178 | ```
179 | 
180 | We now see an additional random effect by the `ParticipantName` group: `TargetCoded`. We also see that it is actually varying between subjects, suggesting the model is become more confident in the differences between our subjects.
181 | 
182 | Look at which participants responded stronger/weaker to the Target manipulation.
183 | 
184 | These random effects are centered with respect to the fixed effect estimated. Therefore, negative effects represented a weaker effect of TargetCoded for this participant. This could be a cool way to look at individual differences -- who is responding most accurately to the words spoken?
185 | 
186 | ```{r}
187 | ranef(model)
188 | 
189 | subject_slopes <- ranef(model)$ParticipantName
190 | subject_slopes$subject <- factor(rownames(subject_slopes))
191 | colnames(subject_slopes) <- c('Baseline','ConditionEffect','Subject')
192 | 
193 | ggplot(subject_slopes, aes(x=Subject, y=ConditionEffect)) +
194 |               geom_point() +
195 |               geom_text(aes(label=Subject),hjust=0,vjust=0,size=3) +
196 |               geom_hline(yint=0, linetype="dashed", alpha=.5) +
197 |               theme(axis.text.x=element_blank())
198 | ```
199 | 
200 | With this random slope included in the model, we see more variation now in our predictions by subjects. It's even picking up on the fact that participants consistently look to the animal and the real variance lies in whether they look away from the animal when the artefact is named.
201 | 
202 | ```{r}
203 | agg_sub_items$prediction <- predict(model, agg_sub_items)
204 | 
205 | ggplot(agg_sub_items, aes(x=Target, y=prediction, color=ParticipantName)) +
206 |                           stat_summary(fun.y='mean', geom='point', position=position_jitter(.3))
207 | ```
208 | 
209 | # Two important questions:
210 | 
211 | Although you've just played around with your first two mixed-effects models, you're likely already asking yourself two pressing questions...
212 | 
213 | ## Question 1: Which random effect structure should we specify?
214 | 
215 | If you are going to use mixed-effects models, you are going to require AT LEAST a random intercept for every natural "grouping" of data. However, beyond random intercepts, what random slopes should you include?
216 | 
217 | ### Guidelines
218 | 
219 | One way you can decide this is by adhering to guidelines. Unfortunately, you'll find guidelines pull you in opposing directions.
220 | 
221 | For example, one school of thought (see Barr et al., 2013) is to "keep it maximal". That is, include every random effect that your experimental design permits (i.e., every factor that appeared across subjects or trials). Another school of thought (see Bates et al., 2015) is to keep it parsimonious. Don't overcomplexify your models with lots of random slopes, as this will make model estimates increasingly hard to reconcile with the data and risk overparamterizing your model:
222 | 
223 | ```{r, error=TRUE}
224 | # Will return error: too many random effects
225 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded + Trial | ParticipantName) + (1 | Trial), data = agg_sub_items)
226 | ```
227 | 
228 | ### Model comparison
229 | 
230 | A second way to decide is to think bottom-up from the data. Compare two models -- one with your random slope and another without your random slope -- and see if your random slope model is actually a better fit.
231 | 
232 | ```{r}
233 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
234 | 
235 | model_null <- lmer(PropAnimal ~ TargetCoded + (1 | ParticipantName) + (1 | Trial), data = agg_sub_items)
236 | 
237 | anova(model,model_null) # -2 log-likelihood ratio test, gives you Chisq(df) = X and a p-value.
238 | ```
239 | 
240 | We'll talk more about model comparison -- a very powerful tool -- in just a minute.
241 | 
242 | ### Variance of random effects
243 | 
244 | Another bottom-up approach is to look at the variance of the random effect estimates (like we did before) rather than thinking dichotomously about the "significance" of the random slope.
245 | 
246 | ```{r}
247 | summary(model)
248 | ```
249 | 
250 | ### Design limitations
251 | 
252 | A final approach -- and my preferred approach -- is to combine the bottom-up and top-down approaches to ask yourself, what random effects can my design *actually* allow me to estimate with reasonable precision? Here, by "precision", I mean precision with respect to the subject's true random effect.
253 | 
254 | For example, if I had only two trials of each type (one in which I named an animal, and the other in which I named an artefact), I can calculate each subject's *exact* random slope for the effect of Target. This is a very precise estimate, but it's artificially precise -- there's no chance I've accurately captured this subject's responsiveness to which target was named. With a handful of trials of each type, I'm less precise in accounting for my exact data but I'm more precise in estimating their true responsiveness.
255 | 
256 | Think about your design and what it can estimate. Include those variables that it estimates well as random slopes. Ignore those variables that you can't estimate well and even consider collapsing across those observations which you aren't distinguishing with a random effect.
257 | 
258 | ## Question 2: How do we get p-values for specific factors?
259 | 
260 | Unlike `lm()` and `aov()`, `lmer()` doesn't generate p-values... It's unclear, in these kinds of models, how to calculate the degrees of freedom -- required by the 't' statistic -- in order to get a p-value.
261 | 
262 | Nevertheless, we have several options for doing so:
263 | 
264 | ### Use model comparison
265 | 
266 | Fit your model with the factor of interest:
267 | 
268 | ```{r}
269 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
270 | ```
271 | 
272 | Fit a second model without this factor:
273 | 
274 | ```{r}
275 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
276 | anova(model,model_null)
277 | ```
278 | 
279 | In this test, the degrees of freedom represent the difference between the number of parameters in the models. Note that this will be [NUMBER OF LEVELS - 1] for a removed factor, i.e., it can be >1 for a single factor.
280 | 
281 | For many factors, you can save time by automating the removal of each factor with `drop1()`:
282 | 
283 | ```{r}
284 | drop1(model,~.,test="Chisq")
285 | ```
286 | 
287 | So what's really going on when we compare these models?
288 | 
289 | The test we are actually running is called a **-2 log-likelihood ratio test**. The gist of this approach is that it compares the likelihood of the data under the full model (including our parameter) to the likelihood of the data under our null model (without our parameter). If the difference is great enough relative to the number of parameters difference between the models (the degrees of freedom in the test), then we can say the difference is significant.
290 | 
291 | Let's do one by hand. We begin by re-fitting both our models with the new option `REML=FALSE`. `anova()` and `drop()` do this automatically for us but, because we are going to do this by hand, we need to do it manually.
292 | 
293 | ```{r}
294 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F)
295 | 
296 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F)
297 | ```
298 | 
299 | Second, let's extract the log likelihood of the data under the full model and null model:
300 | 
301 | ```{r}
302 | log_model <- logLik(model)
303 | log_model
304 | 
305 | log_null <- logLik(model_null)
306 | log_null
307 | ```
308 | 
309 | Higher is better for log-likelihoods. Higher is more likely.
310 | 
311 | Now calculate -2*log() the ratio of these two numbers, after calculating their natural exponents:
312 | 
313 | ```{r}
314 | -2*log(exp(log_null)/exp(log_model))
315 | ```
316 | 
317 | We are now left with a log-likelihood ratio test statistic of `20.77924`. We can now see whether this is significant by checking the Chi-square distribution with a degrees of freedom equivalent to the difference in parameters between our two models. The df's can be retrieved by the `logLik()` function, or calculated as `(factors removed)*(levels per factor) - (factors removed)`.
318 | 
319 | ```{r}
320 | log_model
321 | log_null
322 |   # 1 df
323 | 
324 | pchisq(20.77924, df=1, lower.tail=F)
325 | ```
326 | 
327 | Note that this technique is only appropriate for nested models -- i.e., comparing a null model which is some subset of a larger model. To compare non-nested models, you should use another criterion such as AIC or BIC (also reported in `drop1()`).
328 | 
329 | For example, is what word the subjects heard a better predictor than a random factor I create?
330 | 
331 | ```{r}
332 | agg_sub_items$RandomFactor <- ifelse(rbinom(nrow(agg_sub_items),1,.5) == 1, 'On','Off')
333 | 
334 | model1 <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F)
335 | 
336 | model2 <- lmer(PropAnimal ~ RandomFactor + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items, REML=F)
337 | 
338 | anova(model1,model2)
339 | ```
340 | 
341 | For BIC and AIC, lower is better (unlike log-likelihoods). They represent a score representing the fit of the model after penalizing the model for the number of additional parameters. BIC carries a stricter penalty. Interpreting the degree of difference in fit involves using agreed-upon standards, for example, that a <2 BIC difference isn't great evidence but a 2-6 point BIC difference is good, 6-10 is very strong, etc.
342 | 
343 | ### Treat the t's as z's
344 | 
345 | This is not as dangerous as it sounds with sufficient sample sizes.
346 | 
347 | ```{r}
348 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
349 | ts <- fixef(model)/sqrt(diag(vcov(model)))
350 | ts
351 | 2*pnorm(abs(ts),lower.tail=FALSE)
352 | ```
353 | 
354 | ### Just use confidence intervals (new statistics!)
355 | 
356 | ```{r}
357 | confint(model)
358 | ```
359 | 
360 | ### Use an available technique to approximate the degrees of freedom
361 | 
362 | Kenward-Roger approximation:
363 | 
364 | ```{r}
365 | library(pbkrtest)
366 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
367 | model_null <- lmer(PropAnimal ~ 1 + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
368 | KRmodcomp(model,model_null)
369 | detach('package:pbkrtest',unload=TRUE)
370 | ```
371 | 
372 | Satterthwaite approximation:
373 | 
374 | ```{r}
375 | library(lmerTest)
376 | model <- lmer(PropAnimal ~ TargetCoded + (1 + TargetCoded | ParticipantName) + (1 | Trial), data = agg_sub_items)
377 | summary(model)
378 | detach('package:lmerTest',unload=TRUE)
379 | ```
380 | 
381 | # Analysis 3: Linear time analyses using logistic regression
382 | 
383 | All of the analyses thus far have immediately disregarded time as a variable of interest. Yet we know that look changes over time and, moreover, that this change over time can offer insight into the cognitive processes underlying participants' behaviour. So how can we look at behaviour over time?
384 | 
385 | One way to do this is to use logistic regression (family:"binomial") on our raw dataset, predicting the binomial value of looking to the Animate or not (which is either a 1 or 0). This model estimates the log-odds (i.e., probability in logit space) of looking at the animal by the fixed and random factors specified, including a factor for `TimeFromSubphaseOnset`.
386 | 
387 | Logistic regression solves a problem we are now going to start taking seriously. Until now, we were predicting bounded, non-linear proportional data using a linear model. With logistic regression, and future linear models in which we employ transformations, we are going to correct for this.
388 | 
389 | When we do logistic regression, we receive our slope estimates in logits. These represent the log-odds of getting a 1 versus a 0 based on some factor. Here's a plot showing how these relate to proportions (how we typically think of probabilities):
390 | 
391 | ```{r}
392 | probabilities <- seq(0.1,0.9,by=.1)
393 | log_odds <- log(probabilities/(1-probabilities))
394 | plot(probabilities, log_odds)
395 | ```
396 | 
397 | You can see that the probability of .8 roughly equals a log-odds or logit of 1.39. You can take the exponent of this number to calculate the odds when the probability is .8...
398 | 
399 | ```{r}
400 | exp(1.39)
401 | ```
402 | 
403 | ... and you get roughly 4:1 odds, as you would expect.
404 | 
405 | If probabilities/proportions are so easily transformed into logits/log-odds, then why bother doing it? Well, because we are doing LINEAR modeling and log-odds are linear but proportions are not. That is, proportions have a restricted range of 0-1 thus causing ceiling and floor effects. Log-odds, on the other hand, do not.
406 | 
407 | To illustrate: A 2-to-1 increase in odds has a constant, linear effect on the log-odds regardless of where you are in logit space.
408 | 
409 | ```{r}
410 | log(8/1) - log(4/1)
411 | 
412 | log(16/1) - log(8/1)
413 | ```
414 | 
415 | In contrast, a .1 increase in probabilities has a non-constant, non-linear effect in probability space:
416 | 
417 | ```{r}
418 | probabilities <- seq(0.1,0.9,by=.1)
419 | odds <- (probabilities/(1-probabilities))
420 | plot(probabilities, odds)
421 | ```
422 | 
423 | This is why we see a bend at the tails when comparing proportions to log-odds -- floor/ceiling effects mean that small differences in proportions correspond to larger differences in actual odds.
424 | 
425 | Okay, that said, let's run a logistic regression model:
426 | 
427 | ```{r}
428 | data$TargetC <- ifelse(data$Target == 'Animal', .5, -.5)
429 | data$TargetC <- scale(data$TargetC, center=T, scale=F)
430 | 
431 | # note: we are not going to Center TimeFromSubphaseOnset for now
432 | 
433 | model <- glmer(Animate ~ TargetC*TimeFromSubphaseOnset + (1 | Trial) + (1 | ParticipantName), data = data, family="binomial")
434 | ```
435 | 
436 | We are getting some warnings, here. It suggests we should rescale our variables (likely because Target range is -.5 to .5 while TimeFromSubPhaseOnset is 0-5500), and reports a convergence error.
437 | 
438 | Let's try fixing the scale issue...
439 | 
440 | ```{r}
441 | data$TimeS <- data$TimeFromSubphaseOnset / 1000
442 | 
443 | model <- glmer(Animate ~ TargetC*TimeS + (1 | Trial) + (1 | ParticipantName), data = data, family="binomial")
444 | summary(model)
445 | ```
446 | 
447 | No errors! We'll hold off on adding random slopes to this model for now (it will be really slow).
448 | 
449 | Let's interpret these effects - what do main effects and interactions mean? Note that we centered and deviation-coded our Target variable but *not* our Time variable.
450 | 
451 | Therefore, the TargetC estimate represents the increase in log-odds of looking to the Target between our Animal and Artefact conditions and, because it's positive, they look more to the animal when it is named than when the artefact is named. Because we did not center TimeS before fitting this model, this effect of TargetC is the model's estimate at the *beginning* of the trial. Although we want a main effect, we know from the raw data that we should not be seeing this effect at the beginning of the trial. Instead, it should emerge through the trial as a Time*Target interaction.
452 | 
453 | The effect of TimeS represents the main effect of looking away from the Animal over time.
454 | 
455 | The Intercept represents the odds of looking at the animal at Time==0 (regardless of condition).
456 | 
457 | Visualize the fitted model:
458 | 
459 | ```{r}
460 | data$prediction <- predict(model, data)
461 | 
462 | # in logit space...
463 | ggplot(data, aes(x=TimeS, y=prediction, color=Target)) +
464 |                             stat_summary(fun.y='mean', geom='line')
465 | 
466 | # as proportions...
467 | data$prediction_prob <- exp(data$prediction) / (1 + exp(data$prediction) )
468 | ggplot(data, aes(x=TimeS, y=prediction_prob, color=Target)) +
469 |                             stat_summary(fun.y='mean', geom='line')
470 | ```
471 | 
472 | At first glance, three things are worth mentioning:
473 | 
474 | (1) The proportion plot is slightly warped. This represents the warping that occurs in the transformation between probabilities and log-odds.
475 | 
476 | (2) The model predictions look noisy. Why? Model predictions should be clean and constant given that we only have two factors and an interaction.
477 | 
478 | The problem is that we generated predictions for our datapoints that included the random effects. So, taking the mean of the predictions included idiosynractic estimates for our participants. To remove these, we can simply pass the `re.form=NA` parameter to the `predict()` function:
479 | 
480 | ```{r}
481 | data$prediction <- predict(model, data, re.form=NA)
482 | 
483 | ggplot(data, aes(x=TimeS, y=prediction, color=Target)) +
484 |                             stat_summary(fun.y='mean', geom='line')
485 | ```
486 | 
487 | (3) These are a horrible fit to our data! Let's convert these predictions to probabilities and then plot them alongside our raw data:
488 | 
489 | ```{r}
490 | ggplot(data, aes(x=TimeS, y=Animate, color=Target)) +
491 |                   stat_summary(fun.y='mean', geom='line') +
492 |                   stat_summary(aes(y=exp(prediction)/(1+exp(prediction))), fun.y='mean', geom='line')
493 | ```
494 | 
495 | We fit two straight lines to two curves which clearly change in a non-linear fashion over time. This not only results in this poor visual fit but also, for example, in seeing a main effect of TargetC at timepoint 0 when we know the groups to be very similar at the start of the trial.
496 | 
497 | Therefore, we need a model that will let us capture this non-linear growth over time, and tell us how our two conditions differ with respect to it (growth curves!).
498 | 
499 | # A note about empirical logits, arc-sine roots, and other transformations
500 | 
501 | Sometimes you just need to aggregate your data in a way where you *must* predict proportions even though your raw measure was binomial (i.e., logistic) in nature. For time analyses of eye-tracking data, this is especially common. If you aggregate across trials/items/subjects in any way, you are going to lose the sample-by-sample binomial responses collected from subjects.
502 | 
503 | When this occurs, you'll want to implement a correction which takes the proportional data as and returns a linear-corrected DV as output.
504 | 
505 | Here we'll walk through a couple of transformations, after binning our data by subjects across time.
506 | 
507 | ```{r}
508 | # bin data into 50ms bins
509 | data$Bin <- data$TimeFromSubphaseOnset %/% 50
510 | # equivalent: data$Bin <- floor(data$TimeFromSubphaseOnset / 50)
511 | 
512 | head(data[, c('TimeFromSubphaseOnset','Bin')])
513 | ```
514 | 
515 | Calculate the proportion of looking to the animal (PropAnimal), as well as the number of samples looking to the Animal (y), and the total number of samples (N), and the start of the Bin in ms (Time).
516 | 
517 | Of course, we'll lose the Trial grouping variable/random effect, because we aggregate across trials.
518 | 
519 | ```{r}
520 | binned <- data %>%
521 |           group_by(ParticipantName,Target,Bin) %>%
522 |           summarise(PropAnimal = mean(Animate), y=sum(Animate), N=length(Animate), TimeS=min(TimeS))
523 | ```
524 | 
525 | One transformation that will work for us is called the *empirical logit* (calculated as the log-odds with a constant value to keep from getting -Infinity):
526 | 
527 | ```{r}
528 | binned$elog <- log( (binned$y + .5) / (binned$N - binned$y + .5) )
529 | ```
530 | 
531 | Fit an unweighted empirical logit model:
532 | 
533 | ```{r}
534 | model <- lmer(elog ~ Target*TimeS + (1 | ParticipantName), data = binned)
535 | summary(model)
536 | ```
537 |   
538 | Notice the empirical logit estimates are very similar to our actual logits from the previous logistic model.
539 | 
540 | We can also account for `Target` and `TimeS` random slopes by Participant. Accounting for Time as a random effect is always recommended -- lots of individual noise here.
541 | 
542 | ```{r}
543 | model <- lmer(elog ~ Target*TimeS + (1 + Target + TimeS | ParticipantName), data = binned)
544 | summary(model)
545 | ```
546 | 
547 | Another transformation we can use, beside the empirical logit, is the Arcsine-root transformation. Just like logits and empirical logits, it "stretches out" the ceiling and floor of the distribution, transforming the natural logistic curve inherent to probabilities to be appropriate for linear models.
548 | 
549 | ```{r}
550 | asin(sqrt(.5))
551 | asin(sqrt(.6))
552 | asin(sqrt(.95))
553 | asin(sqrt(.99))
554 | 
555 | plot(seq(.0,1,by=.1),asin(sqrt(seq(0,1,by=.1))))
556 | 
557 | binned$Arcsin <- asin(sqrt(binned$PropAnimal))
558 | head(binned)
559 | 
560 | model <- lmer(Arcsin ~ Target*TimeS + (1 + Target + TimeS | ParticipantName), data = binned)
561 | summary(model)
562 | ```
563 | 
564 | Clean up our workspace.
565 | 
566 | ```{r}
567 | ls()
568 | rm(list=ls())
569 | ```


--------------------------------------------------------------------------------
/tutorial4-growth-curve-analysis.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 |   title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 4: Growth Curve Analyses"
  3 |   author: "Brock Ferguson"
  4 |   date: "July 1, 2015"
  5 |   output:
  6 |     html_document:
  7 |       toc: true
  8 |       theme: readable
  9 | ---
 10 | 
 11 | Load required packages.
 12 | 
 13 | ```{r}
 14 | library(ggplot2)
 15 | library(lme4)
 16 | library(dplyr)
 17 | ```
 18 | 
 19 | Load our eyetracking data.
 20 | 
 21 | ```{r}
 22 | data <- read.csv('data-eyetracking.csv')
 23 | ```
 24 | 
 25 | Note: In this tutorial, we will do by-subjects time analyses only (aggregating across trials within-subjects) for simplicity but you can do by-items/trials analyses in the same way if you aggregate within trials instead of subjects.
 26 | 
 27 | # Bin data within participants by time
 28 | 
 29 | Let's prep these data for analysis much like we prepped for the empirical logit analysis.
 30 | 
 31 | ```{r}
 32 | # rescale, bin, aggregate, and transform our DV's all in one step
 33 | binned <- data %>%
 34 |         mutate(TimeS = TimeFromSubphaseOnset / 1000,
 35 |                Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
 36 |         group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
 37 |         summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
 38 |         mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
 39 |                wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
 40 |                Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt 
 41 |         ungroup()        
 42 | ```
 43 | 
 44 | # Fit our old linear model
 45 | 
 46 | Here's a model similar to the linear model we had before :
 47 | 
 48 | ```{r}
 49 | binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5)
 50 | binned$TargetC <- scale(binned$TargetC, center=T, scale=F)
 51 | 
 52 | model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned)
 53 | summary(model)
 54 | ```
 55 | 
 56 | Visualize the data and model fit:
 57 | 
 58 | ```{r}
 59 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
 60 |                      stat_summary(fun.y=mean, geom="point") +
 61 |                      stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
 62 | ```
 63 | 
 64 | We already know that this model isn't a great fit to our data. By assuming linear growth, it gives us weird estimates, one major one being that it thinks our two groups differ at timepoint 0 (which we know not to be true -- the differences emerge over time).
 65 | 
 66 | # Natural polynomial growth curve analysis
 67 | 
 68 | Let's do our first stab at examining non-linear growth by creating and entering natural polynomials into the model.
 69 | 
 70 | ```{r}
 71 | binned <- binned %>%
 72 |           mutate(TimeS_2 = TimeS^2,
 73 |                  TimeS_3 = TimeS^3,
 74 |                  TimeS_4 = TimeS^4)
 75 | 
 76 | head(binned)
 77 | 
 78 | ggplot(binned, aes(x=TimeS, y=TimeS)) +
 79 |                geom_point() +
 80 |                geom_point(aes(y=TimeS_2), color='red') +  
 81 |                geom_point(aes(y=TimeS_3), color='blue') +
 82 |                geom_point(aes(y=TimeS_4), color='green')
 83 | ```
 84 | 
 85 | Each natural polynomial accelerates at a different rate. These have an interesting property in that, when combined, they capture the correlates/estimated slopes of a variable at successive "bends" in the data. i.e., The 2nd (quadratic) polynomial will capture the first bend, the 3rd (cubic) polynomial will capture the second bend, and so on.
 86 | 
 87 | In this way, models that include these natural polynomials can model non-linear growth.
 88 | 
 89 | ```{r}
 90 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3 + TimeS_4) + (1 + Target + TimeS + TimeS_2 + TimeS_3 + TimeS_4 | ParticipantName), data = binned)
 91 | summary(model)
 92 | ```
 93 |   
 94 | We have some convergence issues caused by overparameterization, so let's scale back on these polynomials to something that seems more reasonable. A good rule of thumb is to count the number of "bends" in the data before breaking it down by condition.
 95 | 
 96 | How many bends do we see?
 97 | 
 98 | ```{r}
 99 | ggplot(binned, aes(x=TimeS, y=elog)) +
100 |                stat_smooth(method="loess")
101 | ```
102 | 
103 | There looks to be two bends in the data which, because of the N-1 relationship between polynomials and bends, means we should include 3 polynomial terms.
104 | 
105 | ```{r}
106 | model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3) + (1 + Target + TimeS + TimeS_2 + TimeS_3 | ParticipantName), data = binned)
107 | summary(model)
108 | ```
109 | 
110 | There is still a convergence error, likely because the scales of our variables are too different, like before. We will ignore this for now, because our final GCA approach will fix this.
111 | 
112 | In the mean time, let's see what this model did for us. Looking at the estimates above, it seems to have gotten rid of our timepoint 0 main effect of Target (yes!), and instead shows some strong interactions over time. This seems promising... but let's visualize.
113 | 
114 | ```{r}
115 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
116 |   stat_summary(fun.y=mean, geom="point") +
117 |   stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
118 | ```
119 | 
120 | Awesome! Much better.
121 | 
122 | Let's store this model because are going to use it for comparison later.
123 | 
124 | ```{r}
125 | natural_model <- model
126 | ```
127 | 
128 | # Orthogonal polynomial growth curve analysis
129 | 
130 | ## Solving the multicollinearity of natural polynomials
131 | 
132 | We just did our first non-linear growth curve analysis, but it was sub-optimal for two reasons:
133 | 
134 | (1) these polynomial terms we generated are highly correlated with one another, and multicollinearity in linear models is always bad
135 | (2) our model had trouble converging because of the different scales of our DV's
136 | 
137 | Thankfully, we have something that will help: *orthogonal polynomials*.
138 | 
139 | Let's first consider problem (1): our natural polynomial are highly correlated with each other. Here you can see this in a standard correlation matrix.
140 | 
141 | ```{r}
142 | cor(binned[, c('TimeS','TimeS_2','TimeS_3','TimeS_4')])
143 | ```
144 | 
145 | This is not a good thing when we are trying to attribute variance to each factor independently. This isn't unique to time-based models -- any linear model suffers from multicollinearity.
146 | 
147 | So, what we can do is actually create replacement timecodes for linear, quadratic, cubic, etc. change over time that we *know* by design will be uncorrelated.
148 | 
149 | `poly()` will generate higher-order polynomials for us, with a vector length equivalent to the length of our original time vector.
150 | 
151 | We'll go up to 6th-order polynomials, but we'll stick to the first 3 for most of our models.
152 | 
153 | ```{r}
154 | orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6)
155 | head(orthogonal_polynomials)
156 | ```
157 | 
158 | Column 1 grows linearly. Column 2 grows quadratically. Column 3 grows cubicly... etc.
159 | 
160 | Visualize this, and verify that they are indeed uncorrelated.
161 | 
162 | ```{r}
163 | ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) +
164 |                geom_point() +
165 |                geom_point(aes(y=X2), color='red') +  
166 |                geom_point(aes(y=X3), color='blue') +
167 |                geom_point(aes(y=X4), color='green') +
168 |                geom_point(aes(y=X5), color='purple') +
169 |                geom_point(aes(y=X6), color='yellow')
170 | 
171 | cor(orthogonal_polynomials[, c(1:6)])
172 | round(cor(orthogonal_polynomials[, c(1:6)]),5)
173 | ```
174 | 
175 | Perfect!
176 | 
177 | I like to merge them into the original dataframe using this technique, which allows for missing data from any given participant.
178 | 
179 | ```{r}
180 | time_codes <- data.frame(
181 |                         sort(as.vector(unique(binned$TimeS))),
182 |                         orthogonal_polynomials[, c(1:6)]
183 |                       )
184 | colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6')
185 | 
186 | binned <- merge(binned, time_codes, by='TimeS')
187 | ```
188 | 
189 | ## Orthogonal modeling
190 | 
191 | Now let's model our data exactly like we did before but using these orthogonal polynomials:
192 | 
193 | ```{r}
194 | model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned)
195 | summary(model)
196 | ```
197 | 
198 | Great fit! No errors.
199 | 
200 | Interestingly, we are back to seeing a main effect of `TargetArtefact`, though... Why? The reason is simple: Our natural polynomials all started at Timepoint 0, meaning that main effects represented differences at the *start* of the time window. In contrast, orthogonal polynomials are, by default, centered at 0, meaning that main effects represent *average* differences (across time) between levels of a factor.
201 | 
202 | Let's visualize our data and model fit:
203 | 
204 | ```{r}
205 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
206 |   stat_summary(fun.y=mean, geom="point") +
207 |   stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
208 | ```
209 | 
210 | Compare this model to our natural polynomial model.
211 | 
212 | ```{r}
213 | summary(natural_model)
214 | summary(model)
215 | ```
216 | 
217 | We can use the same methods as before to get confidence intervals, test for Type III significance, etc.
218 | 
219 | ```{r}
220 | # confint(model)
221 |   # this takes a long with a model this complex....
222 | 
223 | drop1(model, ~., test="Chisq")
224 | ```
225 | 
226 | `drop1()` suggests that all of our parameters are reliable predictors.
227 | 
228 | Let's try adding 4th and 5th orthogonal polynomials manually and seeing their effects on this model.
229 | 
230 | ```{r}
231 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned)
232 | summary(model_quartic)
233 | 
234 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
235 |   stat_summary(fun.y=mean, geom="point") +
236 |   stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
237 |   stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
238 | 
239 | anova(model, model_quartic)
240 | ```
241 | 
242 | Despite the very underwhelming difference in model fits, model comparison says that `ot4` is a significant predictor.
243 | 
244 | One possibility is that this significant difference is because we not only added `ot4` as a fixed effect, we also added it to the random structure. When examining the influence of a fixed effect, it's best to keep your random effect structure constant. The fact that we changed our random structure is why the `anova()` says that our models differ by 8 degrees of freedom when we only wanted to see the influence of the main effect of `ot4` and its interaction with `TargetC` (i.e., two parameters or 2 df difference).
245 | 
246 | ```{r}
247 | model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F)
248 | 
249 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F)
250 | 
251 | anova(model_cubic, model_quartic)
252 | ```
253 | 
254 | Despite the underwhelming difference in visualized model fits, it still says it's a significantly better fit.
255 | 
256 | What about a 5th polynomial?
257 | 
258 | ```{r}
259 | model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F)
260 | 
261 | model_quintic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4 + ot5) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5  | ParticipantName), data = binned, REML=F)
262 | 
263 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
264 |   stat_summary(fun.y=mean, geom="point") +
265 |   stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 4th-order model
266 |   stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model
267 | 
268 | anova(model_quartic, model_quintic)
269 | ```
270 | 
271 | Adding a 5th polynomial did not improve our fit.
272 | 
273 | Here's a final reminder of how bad our linear model was:
274 | 
275 | ```{r}
276 | model_linear <- lmer(elog ~ TargetC*(ot1) + (1 + TargetC + ot1 | ParticipantName), data = binned)
277 | summary(model_linear)
278 | 
279 | ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
280 |   stat_summary(fun.y=mean, geom="point") +
281 |   stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
282 |   stat_summary(aes(y=predict(model_linear,binned,re.form=NA)), fun.y=mean, geom="line") # 2nd-order model
283 | ```
284 | 
285 | ## Growth curve analyses with 3+ levels of a factor
286 | 
287 | I like to design experiments with only 2 levels per factor for simplicity but sometimes we have 3 levels in a factor and, now, main effects do not equal simple effects.
288 | 
289 | To demonstrate, let's add a third "Neutral" Target level that will follow a similar trajectory to the "Animal" level but shifted down towards 50% chance.
290 | 
291 | ```{r}
292 | new_condition <- binned[which(binned$Target == 'Animal'), ]
293 | new_condition$Target <- 'Neutral'
294 | #new_condition$y <- new_condition$y - round(new_condition$N / 3)
295 | new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),-.5,2))
296 | new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y)
297 | new_condition[which(new_condition$y < 1), 'y'] <- 1
298 | new_condition$PropAnimal <- new_condition$y / new_condition$N
299 | new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) )
300 | new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5)
301 | new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal))
302 | 
303 | binned_3levels <- rbind(binned,new_condition)
304 | binned_3levels$Target <- factor(binned_3levels$Target)
305 | 
306 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
307 |   stat_summary(fun.y=mean, geom="point")
308 | ```
309 | 
310 | Fit a model with Target treatment-coded.
311 | 
312 | ```{r}
313 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
314 | summary(model)
315 | 
316 | ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
317 |   stat_summary(fun.y=mean, geom="point") +
318 |   stat_summary(aes(y=predict(model,binned_3levels,re.form=NA)), fun.y=mean, geom="line")
319 | ```
320 | 
321 | In order to know how to interpret these simple effects, we need to remember which is our reference level.
322 | 
323 | Get main effects via model comparison...
324 | 
325 | ```{r}
326 | model_null <- lmer(elog ~ Target*(ot1 + ot2) + ot3 + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
327 | summary(model_null)
328 | 
329 | anova(model,model_null)
330 | ```
331 | 
332 | Get simple effects by re-ordering factor levels.
333 | 
334 | ```{r}
335 | levels(binned_3levels$Target)
336 | binned_3levels$Target <- factor(binned_3levels$Target,c('Neutral','Animal','Artefact'))
337 | levels(binned_3levels$Target)
338 | 
339 | model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
340 | summary(model)
341 | ```
342 | 
343 | Clean up our workspace.
344 | 
345 | ```{r}
346 | ls()
347 | rm(list=ls())
348 | ```


--------------------------------------------------------------------------------