Monday, 16 November 2020

ecoCopula : and R package for fast ordination and graphical modelling

I have been a bit tardy putting up October's Eco-Stats Lab, which was on the ecoCopula package (now on CRAN). ecoCopula does fast model based ordination, and lets you create 'graphs' of species associations (interactions) from co-occurrence data. See the vignette for a walk though.

Thursday, 12 November 2020

plotenvelope - making sense of diagnostic plots using simulation envelopes

This month's Eco-Stats Lab is on the plotenvelope function in the ecostats package.  This function can be used (in place of plot) to construct residual plots of most common fitted model objects in R, including global simulation envelopes around fitted values (or around smoothers) on the plot.  These slides explain what this all means

Sunday, 27 September 2020

 

Introduction to Docker

It’s become increasingly important to be able to experiment, reproduce workflows and results, share code and to be able to take advantage of large scale data analysis. 


Docker is a lightweight virtual machine technology that provides an isolated, self-contained, versioned and shareable linux-based environment that is now the tool of choice in industry. With the growth of cloud computing companies regularly spin up clusters containing hundreds or even thousands of docker instances - and with that there’s been a concurrent growth in management and orchestration tools in the docker ecosystem.

Ok - that’s great - but why should statisticians care?

Because this is a great way to easily scale and share your research while it’s in progress. You can package up work-in-progress R code and data without having to build R packages first.


It also allows you to experiment and test without concerns - for example R 4.0 is now available. Given the choice between testing your work against new versions of R and upgrading your base R installation, an easy solution is to use a Docker container configured with R 4.0 and leave your host machine alone.


Here’s some motivating examples:


  • Test your package against new and previous versions of R packages and environments without having to upgrade all your existing packages. This allows you to detect breaking changes in your package before release


  • Try out a completely new technology without polluting your computer - e.g teach yourself Juliet and Jupyter notebooks in an isolated environment


  • Scale parallel code without change - parallel code that runs on your local host can take advantage of massive cloud compute power without any extra coding


  • Ship & Share - docker promotes reproducibility in research. No matter what needs to be installed in the environment to reproduce your results, you can build a complete machine image packaging up code and data and simply share the image. This is the virtual equivalent of “here - just use my computer. It works on that…”


Docker doesn’t replace the value of building and sharing R packages themselves. The point is simply that you do not have to wait until all your code is packaged any longer and can experiment and collaborate freely while research is in progress.



Basic Concepts


DockerHub - (https://hub.docker.com/) this is the equivalent of GitHub but much coarser. It’s the central repository for versioned Docker images - you can pull images from public repositories for your own use or pull/push images to your own private repositories. 


To use Docker you need to signup a free account which automatically gives you access to your own private repositories. There are also enterprise tiers with additional benefits.


Docker Image - this is a blueprint for the machine itself. You can choose from thousands of pre-configured images available at DockerHub, use them as is or as a baseline to customise your own. An image is like a class in programming - it’s used to create instances of the running machine - known as containers.


An image contains all the code, packages and programming environment needed to support the work you want to do. It also has a version associated with it - called a tag.


Docker File - used to build an image. This is where you can specify additional packages, copy source code or add additional linux command line tools to be included in the image. You only need to work with this if you want to customise an image for your own use.


Docker Container - this is a running instance of a machine created from an image. A container is like an object instance created from a programming class. The Docker service on your local machine will instantiate the container for you - you can then connect to it and get busy.


General Notes


While Docker is a mature technology it does have some drawbacks. Docker images are generally a few gigabytes in size and so can quickly take up a lot of space on a standard laptop. 

Consequently, pushing and pulling images to and from DockerHub can be time consuming unless you have a fast internet link. This isn’t a big problem as most of the time you’re probably only working with one container image but it’s worth keeping in mind.


Secondly, interfacing with Docker can be daunting. Bare in mind that it was designed by hardcore sys-admin types so there are options (i.e command line switches) for almost anything you can think of.


However there are generally only a few commands to know that will cover almost all use cases which we’ll summarise next.

Docker Command Cheatsheet

These commands are run in a terminal window on your Windows/Mac/Linux host machine:


Login to DockerHub from your host machine:


docker login


This will prompt your for the username and password you used to sign up with DockerHub. 


Pull an Image to your local machine:


docker pull rocker/geospatial


This pulls the latest version of the rocker/geospatial image. You can also specify a particular version you want to use by including a tag:


docker pull rocker/geospatial:4.0.2


List images:


docker images


Displays all images you have available on your host:


Launch a container (simple version):


docker run -it chenobyte/geospatial:1.4 /bin/bash


This creates a container using the chenobyte/geospatial:1.4 local image, connects to it and runs a bash shell inside the container. The -it switches ensure the container stays interactive. As can be seen below we are inside the container as root:


Once inside, you can kill or exit the container either by Ctrl+D or by typing exit. This drops you back out to the host terminal. You can also exit back to the host without killing the container by typing Ctrl+PQ. This is known as detaching.


List containers:


docker ps -a


Displays all containers including their status. Note that docker automatically assigns names to each instance - although you can override this if you really want to:



Attaching to a Running Container:


As seen above, each container has a container identifier. You can attach (jump back into) a running container using the container id:


docker attach <container_id>



Commit a Running Container:


This is useful when you’ve launched a container, made some changes while inside it (say installed some utilities or software) and now want to save the complete container state as a new image that you can reuse later.


The general format is:


docker commit <container_id> name:tag


As an example, if I’ve launched a container using image chenobyte/geospatial:1.5 and then make some changes I want to keep (e.g installed git) , I can detach and save the current container state as a new image. If the container id is 1234 then I can make a new version:


docker commit 1234 chenobyte/geospatial:1.5


I now have a new image version which is just like any other image and can be used to launch new containers too.


Stopping a Running Container:


You stop a running container from the host using the container id:


docker stop <container_id>

Launch a container (complicated version):


docker run -it 

-p 8888:8888 \

-v /home/ec2-user/data:/data \

-e NB_UID=$(id -u) \

-e NB_GID=$(id -g) \

-e GRANT_SUDO=yes \

-e GEN_CERT=yes \

--user root \

chenobyte/geospatial:1.4 \

/bin/bash


This example looks daunting but is actually the most useful common case. The main issues are around making sure you have permissions inside the container to write to host directories etc. You don’t need to understand all the -e options - you can just re-use as is.


The -p switch exposes port 8888 inside the container as port 8888 to the host. This is useful when the container is running a server (RStudio, Jupyter etc). See the aside below for more information.


The -v option is probably the most useful. This a volume mapping which says that the directory /home/ec2-user/data on the host should be made available inside the container at /data. In this way, data files or code on the host can be accessed from within the container itself. You can have as many mappings as you want here.


There are many many more available options for running containers including specifying available CPUs and RAM, container lifecycle etc. For more information check out the references in the resources section below.

Building Your Own Image

Docker images are built using a plain text file known as the DockerFile. There are a wide range of options available (see resources below) but the main thing for us is to see how we can use an existing base image, and specify some additional R packages to be installed.


The first line in a Docker file specifies the base image you want to use. For example you might want to use a base image that contains a Shiny server. In that case, (once you’ve found the image on DockerHub) you specify it in the DockerFile:


FROM rocker/shiny:4.0.0


Note - while you don’t have to specify the tag (version) it’s best practice to do so - if you don’t it will use the latest version - which can be updated without you knowing!


You can also specify code to be copied into the image, set the initial working directory, environmental variables and execute commands. In the example below we use the geo-spatial image as the base and install additional R packages.


Note that the process shown below in the RUN command is specific to these R images - these ensure that R packages and dependencies are always installed from a fixed snapshot of CRAN at a specific date.



Once you’ve defined your docker file you create the image and give it a tag. This needs to be run from the same directory as the docker file (note the trailing full stop!). Here we’re saying that the new version of the image is to be tagged as version 1.5:


docker build --tag chenobyte/geospatial:1.5 .


Now your image is built you can create containers with it and push the image back to your repository in DockerHub. This is the bit that can take a while as it’s usually a multi-gigabyte upload:


docker push chenobyte/geospatial:1.5


And now you can share your work with colleagues as an entire self-contained unit.

Summary

This is just a quick introduction but while Docker has a steep initial learning curve, it’s one that’s worth investing the time in. You don’t need to become an expert to take advantage of massive scale compute power either.


Imagine you had a high performance 10-node cluster that you wanted to run a simulation on. You would have to make sure every node had the same R environment, packages, versions and data. By creating a Docker image it doesn’t matter if it’s a 10-node or 100-node cluster. You only need to specify a run time environment once. 

In our recent work we used Docker containers across a 4 x 64 CPU node cluster - this allowed us to fit 17 complete large scale spatial models in parallel while using the best allocation of available cluster resources.


This was a great realisation of the power of Docker considering all of the model fitting code was developed on a crotchety 4-core five year old mac laptop!


There is far more below the surface that could be discussed about Docker but most of it is not really relevant to statisticians. 


The main points presented here are hopefully useful for eco stats work and general research.


Resources


Docker Hub - https://hub.docker.com


Docker File Reference - https://docs.docker.com/engine/reference/builder


Docker Reference - https://docs.docker.com/engine/reference


RStudio and Docker - How To: Run RStudio products in Docker containers – RStudio Support







Friday, 29 March 2019

Paper of the year 2018

The paper-of-the-year competition sees Eco-Stats members nominate their favourite article hoping to win "free coffee for a year". This year saw a relaxation of some of the previous requirements - no longer did the paper need to be about ecology or statistics; or even be published; in fact one entry was even from the end of 2017. The result was a wide field - from LEGO investors; to fire wielding sh*t hawks; to the earth getting into a bit of a (blueberry) jam. After much debate the results are in:

The winning paper was:


Anders Samberg (2018) Blueberry Planet.

arXiv:1807.10553 [physics.pop-ph]

This paper was nominated by Gordana Popovic because:

It shows very unpretentiously what research involves. You have a question, you find all the research in the area, you mush the two together, you get an answer.


Honourable mentions go to:


Bonta, M et al. (2017) Intentional Fire-Spreading by “Firehawk” Raptors in Northern Australia. Journal of Ethnobiology

This paper was nominated by Ben Maslen because:

I like this paper as it outlines a very intriguing and at first glance outlandish ecological behaviour that combines empirical evidence with indigenous ecological knowledge. Birds spreading fires, who knew!


Dobrynskaya, Victoria and Kishilova, Julia (2018) LEGO - The Toy of Smart Investors.

This paper was nominated by Michelle (Shi Jie) Lim because:

I like this paper because Lego toys do not belong to the luxury segment and are affordable to most retail investors. Although the returns may not be significant in reality, the study shows that people are willing to pay a premium for Lego sets. Any Lego toy owner would probably find this paper relatable.


The other nominations in no particular order are:


Zhu and Bradic (2018) Linear Hypothesis Testing in Dense High-Dimensional Linear Models. Journal of the American Statistical Association.

This paper was nominated by David Warton because:

...developing an original new machinery for inference and applying it to a tricky problem, that of simultaneous inference for lots of parameters.


Miller and Sanjurjo (2018) Surprised by the Hot Hand Fallacy?A Truth in the Law of Small Numbers. Econometrica.

This paper was nominated by Robert Nguyen because:

I think this is interesting because it tackles something that generally people believe (there is a hot hand in sport) but as of yet there is no evidence it exists or is there?


Fraser et al. (2018) Questionable research practicesin ecology and evolution. PLoS ONE.

This paper was nominated by Mitchell Lyons because:

I like these types of papers, and there’s been a few lately, including some recent press on some editorials in Nature (Google those if you like. It helps me to gain context on why people (we, and me now) in ecology and evolution think about significance the way they do.


Yixin Wang & David M. Blei (2018): Frequentist Consistency of Variational
Bayes. Journal of the American Statistical Association.

This paper was nominated by Elliot Dovers because:

I like this paper because it provides a firmer theoretical footing for a technique (more generally than has been previously for variational approximations) that has perhaps been used willy-nilly for a long time in computer science (I'm not innocent here either). I also like that the authors give time to addressing (and linking) the technique with respect to both a frequentist and Bayesian approaches. The result "bridges the gap in asymptotic theory between the frequentist variational approximation, in particular the variational frequentist estimate (VFE), and variational Bayes"... good to see authors with a less binary take in on what it means to be a statistician.

Wednesday, 20 March 2019

Template Model Builder Tutorial

Many of the Eco-Stats group are using Template Model Builder (TMB) - a very flexible package in R for fitting all sorts of latent variable models quickly. For R users without any C++ coding experience, getting familiar with the package might be a little daunting so we've put together a gentle introduction with some simple examples. Follow the link below and get going with TMB:

TMB Introduction Tutorial

Note: before installing TMB (by your usual means of installing an R package) compiling C++ code will require a working development environment. In Windows you can just install the latest version of Rtools - follow the install guide here. If installing on Mac OS or Linux - following the devtools install guide will do the trick - check it out here.

Thursday, 8 March 2018

Paper of the year 2017

The competition for paper of the year 2017 was heated, with the ecostatistician proposing the winning paper scoring the coveted "free coffee for a year" prize, The nominations were diverse, all the way from pure ecology to very fancy stats. After much debate, the winner was:

Hallmann CA, Sorg M, Jongejans E, Siepel H, Hofland N, et al. (2017) More than 75 percent decline over 27 years in total flying insect biomass in protected areas. PLOS ONE 12(10)

This paper was nominated by John Wilshire, who summarises it as follows:

Flying insects play a very important role in ecosystems, both as pollinators and as food sources for other animals. This paper shows that their populations have massively declined over a relatively short period of time  (at least in protected areas in Germany). I like this paper as it presents the results of a long term study, and it is a pretty scary example of the impacts we are having on ecosystems. Plus it is open access and has data and code available, and the statistical analysis is presented in a clear and easy to follow manner.

Other nominees were (in no particular order):


Thursday, 13 April 2017

Special Feature in Methods in Ecology and Evolution on Eco-Stats '15

There is a Special Feature in the April 2017 issue of Methods in Ecology and Evolution reporting outcomes from the Eco-Stats '15 conference, blog post about it here

https://methodsblog.wordpress.com/2017/04/12/generating-new-ideas/