Thursday, 29 May 2014

Eco-Stats Lab May 2014: Longitudinal data

In this week’s lab we will learn about analysing longitudinal data, that is, data collected over time for each study unit (e.g. subject or site). 


Standard (generalised) linear models assume all observations are independent, which is typically not reasonable if repeated measures have been taken over time.  Data collected for one site at times close together will be more similar than data for different sites, or for times further apart. This induces correlation in the data, which we must account for to obtain valid inference.

Normal data

#Load data
library(lme4)
library(nlme)
library(MASS)

data(BodyWeight)

#scatterplot by Diet
plot(weight~Time,col=Diet,data=BodyWeight)

#ignore all dependence, bad analysis
Rat.lm0=lm(weight~Time,data=BodyWeight)
Rat.lm=lm(weight~Time+Diet,data=BodyWeight)
summary(Rat.lm)
anova(Rat.lm0,Rat.lm)
plot(Rat.lm,which=1)
plot(Rat.lm$fitted,Rat.lm$residuals,col=BodyWeight$Rat)

Problem:
The effects of ignoring dependence vary with the type of dependence but in many cases you can expect
  • Standard errors are underestimated, giving “false confidence”
  • T-statistics will be overestimated, and regression coefficients that appear significant may not be


Solution 1 – mixed effects model with random intercept and slope

#plot data by Rat
interaction.plot(BodyWeight$Time, BodyWeight$Rat, BodyWeight$weight)


#model with random effects for intercept and slope
Rat.mixed0<- lmer(weight ~ Time + (1 + Time | Rat) , data = BodyWeight)
Rat.mixed<- lmer(weight ~Time +Diet+ (1 + Time | Rat) , data = BodyWeight)
anova(Rat.mixed0,Rat.mixed)

#There seems to be an effect of diet.

summary(Rat.mixed)
plot(Rat.mixed)
plot(fitted(Rat.mixed),residuals(Rat.mixed),col=BodyWeight$Rat)


This analysis assumes that after fitting a line for each subject, the observations for each subject have the same correlation, regardless of how far away they are in time.  This is not generally realistic; observations closer to each other in time might be more correlated than those further apart.

We can’t use lme4 for analysis which incorporates more flexible correlation structure, we will need to use library(nlme). The autocorrelation structure most commonly used for data correlated in time is autoregressive AR(1). It assumes data points close in time are more strongly correlated than those further apart in time.

Solution 2 – mixed effects model with flexible correlation structures

Rat.lme0 <-lme(weight ~  Time + Diet, random = ~ Time  | Rat, data = BodyWeight)
Rat.lme1 <-lme(weight ~  Time + Diet, random = ~ Time  | Rat, corr=corAR1(, form= ~ Time| Rat), data = BodyWeight)

AIC(Rat.lme0,Rat.lme1 )



#We can see that the AR1 correlation structure seems to work better as the AIC is smaller for this model.


Non Normal Data

What about non normal data? Well we can use the same set up as before in lme4, but now using the glmer function. 

data(epil)
interaction.plot(epil$period, epil$subject, epil$y)
interaction.plot(epil$period, epil$subject, log(epil$y+1),col=epil$subject)

#model with random effects for intercept and slope
Epil.mixed0<- glmer(y ~1 + period + (1 + period | subject) , data = epil,family=poisson)
Epil.mixed<- glmer(y ~1 + period +age+ (1 + period | subject) , data = epil,family=poisson)

# whoops, warning about convergence, let's not worry about it, it's not a problem in this case
# read http://stackoverflow.com/a/21370041 and http://stats.stackexchange.com/a/99719 if you have a similar problem


anova(Epil.mixed0,Epil.mixed)
                                                                                 
#There seems to be no effect of age. 

What if we want an AR(1) correlation structure instead? Well, it is more complicated, some options are glmmPQL in the MASS package, and geeglm in the geepack package. glmmPQL uses the lme function above, and has very similar syntax. geeglm does not fit quite the same model as lme4, but can also be used if you would like to fit an AR(1)  structure, or other more flexible correlation structures.