Monday, 27 October 2014

R-lab: Inference with Spatially correlated data with nlme

Hi guys,

Problem: You have data that may be spatially correlated, and you worry this might affect your inference (p-values and confidence intervals for fixed effects). 

 Solution : This Friday, 31st October at 2pm, R-lab on inference with Spatially correlated data with the nlme package.

Download R code and data here.

See you there,

Material adapted from

What we can and can’t do with nlme:
  • Continuous data only, not counts or presence-absence (for discrete data you could have a look at glmmPQL in MASS package, quick example later, but limited)
  • Cannot have more than one measurement at any location ( unless you have a grouping variable, then you can’t have more than one measurement at any location within any group)
  • Not too much data, or it will give an error
  • No spatial prediction (so you can't draw a map which includes the spatial random effect, which is unfortunate)

Spatial Data- its everywhere

Continuous data

We start with continuous data. We will use a dataset of tree bark thickness. 


# install.packages("nlme")

We will need a dummy variable because lme requires a grouping variable, but it will be the same for all observations, hence dummy

spdata <- cbind(spdata, dummy=1)

First we fit a model with no spatial correlation

nonsp.null = lme(thick ~ 1, data = spdata, random = ~ 1 | dummy,                      method='ML')
nonsp.alt=lme(thick ~ soil, data = spdata, random = ~ 1 | dummy,                   method='ML')

We notice a significant effect of soil. Now let's have a look at a variogram of the residuals

# plot(Variogram(nonsp.alt,resType="normalized",form=~lat+long))

Variograms are weird things, they give us similar information as an autocorrelation function, but are upside down and backwards, so don’t use your intuition from acf() functions. Here are how some common correlation structures look on a variogram, but beware, there’s usually a lot of noise. If there is no upwards trend at all, then there may not be any spatial correlation.

 (from Pinheiro and Bates 2004, p. 233)

Next we will fit some correlations functions to see what fits best

spher.null <- lme(thick ~ 1, data = spdata, random = ~ 1 | dummy,
correlation = corSpher(form = ~ lat+long| dummy),method='ML')

We want the variogram to be flat when we include a spatial random effect with the correct structure.

gauss.null <- lme(thick ~ 1, data = spdata, random = ~ 1 | dummy,
correlation = corGaus(form = ~ lat+long| dummy),method='ML')

We could (should) also try corExp,corLin,corRatio correlation structures.


The Gaussian correlation structure seems to perform the best, so we'll use that.

gauss.alt <- lme(thick ~ soil, data = spdata, random = ~ 1 | dummy,
correlation = corGaus(form = ~ lat+long| dummy),method='ML')

Now there is no evidence of effect of soil. What happen?
The value at a site is best predicted by the values at sites nearby, and the value of soil does not add any information on top of that

Adding a Grouping variable

Sometimes we may only want to model spatial correlation for observations within a value of some grouping variable. We will used species as a grouping variable, but this does not take into account correlation between species

group.null <- lme(thick ~ 1, data = spdata, random = ~ 1 | species,
                  correlation = corGaus(form = ~ lat+long| species),                   method='ML')
group.alt <- lme(thick ~ soil, data = spdata, 
                 random = ~ 1 | species,
                 correlation = corGaus(form = ~ lat+long| species),                  method='ML')
anova(group.null,group.alt )

Counts and presence absence data

This data is of the number of red maples in a quadrat.

trees <- cbind(trees, dummy=1)
nonsp.alt=glmmPQL(red_maple~elev, data = trees, 
                  random = ~ 1 | dummy,family=poisson)

Spatial correlation!

gaus.alt=glmmPQL(red_maple~elev, data = trees, 
                 random = ~ 1 | dummy,
                 correlation = corGaus(form = ~ x+y| dummy),


spher.alt=glmmPQL(red_maple~elev, data = trees, 
                  random = ~ 1 | dummy,                  
                  correlation = corSpher(form = ~ x+y| dummy),


Both seem to make the variogram very flat! Yay! but we can't test with anova(), so visual check is the best bet :(,  We should try other correlation functions too, see what's best. 



What about when we have factors, well we can do a joint wald test 

spher.alt.soil= glmmPQL(red_maple~soil, data = trees, 
                random = ~ 1 | dummy,
                correlation = corSpher(form = ~ x+y| dummy),


#we want to test terms 2 and 3 together, so
wald.test(b = t(coef(spher.alt.soil)), 
          Sigma = vcov(spher.alt.soil),Terms=2:3)

Want to do more?

This is as far as nlme will take you, I believe.  The packages I have found that are able to do more complex spatial modelling are all Bayesian (so no p-values) but they can do things like spatial prediction (drawing a spatial map),  and allow more than one observation per location, which nlme does not seem to do. They are a lot more difficult to fit and use, but people seem happy with them. Some options are:

geoR and geoRglm





  1. Some would argue that you should try to avoid hypothesis tests of assumptions (because for example the data weren't collected to test the a priori hypothesis that data were independent, spherical, etc). Instead you could use a model selection approach, pick the autocorrelation model that best matches the data, eg by AIC. AICs are given in the anova output, but you can get them without the other stuff using something like:
    > AIC(nonsp.null,spher.null,gauss.null)
    df AIC
    nonsp.null 3 346.98685
    spher.null 4 147.89675
    gauss.null 4 89.55566

  2. Francis makes the good point that the spatial structure has been compared on the null models (thick~1) rather than the alternative models (thick~soil), and does this matter. It might, and apparently different people have different ideas on this, but since the goal is to account for residual spatial autocorrelation after accounting for predictors like soil there is an argument for choosing autocorrelation structure using something like
    AIC(nonsp.alt, spher.alt, gauss.alt)
    where these models each have thick~soil in their formula.

  3. There was also discussion about when including spatial correlation is and is not necessary. If you chose sites and then apply a treatment to them randomly, and you test for that treatment, and your model does not contain any predictors which are spatially correlated (like temperature, rainfall etc) then David argues it's not needed, otherwise if you have data with spatial co-ordinates I say plot a Variogram and see if you need to include a spatial term.