I have been a bit tardy putting up October's Eco-Stats Lab, which was on the ecoCopula package (now on CRAN). ecoCopula does fast model based ordination, and lets you create 'graphs' of species associations (interactions) from
co-occurrence data. See the vignette for a walk though.
Eco-Stats Research Blog
The life and times of the UNSW Ecological Statistics ("Eco-Stats") research group.
Monday, 16 November 2020
ecoCopula : and R package for fast ordination and graphical modelling
Thursday, 12 November 2020
plotenvelope - making sense of diagnostic plots using simulation envelopes
Sunday, 27 September 2020
Introduction to Docker
It’s become increasingly important to be able to experiment, reproduce workflows and results, share code and to be able to take advantage of large scale data analysis.
Docker is a lightweight virtual machine technology that provides an isolated, self-contained, versioned and shareable linux-based environment that is now the tool of choice in industry. With the growth of cloud computing companies regularly spin up clusters containing hundreds or even thousands of docker instances - and with that there’s been a concurrent growth in management and orchestration tools in the docker ecosystem.
Ok - that’s great - but why should statisticians care?
Because this is a great way to easily scale and share your research while it’s in progress. You can package up work-in-progress R code and data without having to build R packages first.
It also allows you to experiment and test without concerns - for example R 4.0 is now available. Given the choice between testing your work against new versions of R and upgrading your base R installation, an easy solution is to use a Docker container configured with R 4.0 and leave your host machine alone.
Here’s some motivating examples:
Test your package against new and previous versions of R packages and environments without having to upgrade all your existing packages. This allows you to detect breaking changes in your package before release
Jump-start exploratory work with pre-configured images. For example the https://hub.docker.com/r/rocker/geospatial image comes with the full suite of R spatial packages as well as tidyverse out of the box
Try out a completely new technology without polluting your computer - e.g teach yourself Juliet and Jupyter notebooks in an isolated environment
Scale parallel code without change - parallel code that runs on your local host can take advantage of massive cloud compute power without any extra coding
Ship & Share - docker promotes reproducibility in research. No matter what needs to be installed in the environment to reproduce your results, you can build a complete machine image packaging up code and data and simply share the image. This is the virtual equivalent of “here - just use my computer. It works on that…”
Docker doesn’t replace the value of building and sharing R packages themselves. The point is simply that you do not have to wait until all your code is packaged any longer and can experiment and collaborate freely while research is in progress.
Basic Concepts
DockerHub - (https://hub.docker.com/) this is the equivalent of GitHub but much coarser. It’s the central repository for versioned Docker images - you can pull images from public repositories for your own use or pull/push images to your own private repositories.
To use Docker you need to signup a free account which automatically gives you access to your own private repositories. There are also enterprise tiers with additional benefits.
Docker Image - this is a blueprint for the machine itself. You can choose from thousands of pre-configured images available at DockerHub, use them as is or as a baseline to customise your own. An image is like a class in programming - it’s used to create instances of the running machine - known as containers.
An image contains all the code, packages and programming environment needed to support the work you want to do. It also has a version associated with it - called a tag.
Docker File - used to build an image. This is where you can specify additional packages, copy source code or add additional linux command line tools to be included in the image. You only need to work with this if you want to customise an image for your own use.
Docker Container - this is a running instance of a machine created from an image. A container is like an object instance created from a programming class. The Docker service on your local machine will instantiate the container for you - you can then connect to it and get busy.
General Notes
While Docker is a mature technology it does have some drawbacks. Docker images are generally a few gigabytes in size and so can quickly take up a lot of space on a standard laptop.
Consequently, pushing and pulling images to and from DockerHub can be time consuming unless you have a fast internet link. This isn’t a big problem as most of the time you’re probably only working with one container image but it’s worth keeping in mind.
Secondly, interfacing with Docker can be daunting. Bare in mind that it was designed by hardcore sys-admin types so there are options (i.e command line switches) for almost anything you can think of.
However there are generally only a few commands to know that will cover almost all use cases which we’ll summarise next.
Docker Command Cheatsheet
These commands are run in a terminal window on your Windows/Mac/Linux host machine:
Login to DockerHub from your host machine:
docker login
This will prompt your for the username and password you used to sign up with DockerHub.
Pull an Image to your local machine:
docker pull rocker/geospatial
This pulls the latest version of the rocker/geospatial image. You can also specify a particular version you want to use by including a tag:
docker pull rocker/geospatial:4.0.2
List images:
docker images
Displays all images you have available on your host:
Launch a container (simple version):
docker run -it chenobyte/geospatial:1.4 /bin/bash
This creates a container using the chenobyte/geospatial:1.4 local image, connects to it and runs a bash shell inside the container. The -it switches ensure the container stays interactive. As can be seen below we are inside the container as root:
Once inside, you can kill or exit the container either by Ctrl+D or by typing exit. This drops you back out to the host terminal. You can also exit back to the host without killing the container by typing Ctrl+PQ. This is known as detaching.
List containers:
docker ps -a
Displays all containers including their status. Note that docker automatically assigns names to each instance - although you can override this if you really want to:
Attaching to a Running Container:
As seen above, each container has a container identifier. You can attach (jump back into) a running container using the container id:
docker attach <container_id>
Commit a Running Container:
This is useful when you’ve launched a container, made some changes while inside it (say installed some utilities or software) and now want to save the complete container state as a new image that you can reuse later.
The general format is:
docker commit <container_id> name:tag
As an example, if I’ve launched a container using image chenobyte/geospatial:1.5 and then make some changes I want to keep (e.g installed git) , I can detach and save the current container state as a new image. If the container id is 1234 then I can make a new version:
docker commit 1234 chenobyte/geospatial:1.5
I now have a new image version which is just like any other image and can be used to launch new containers too.
Stopping a Running Container:
You stop a running container from the host using the container id:
docker stop <container_id>
Launch a container (complicated version):
docker run -it
-p 8888:8888 \
-v /home/ec2-user/data:/data \
-e NB_UID=$(id -u) \
-e NB_GID=$(id -g) \
-e GRANT_SUDO=yes \
-e GEN_CERT=yes \
--user root \
chenobyte/geospatial:1.4 \
/bin/bash
This example looks daunting but is actually the most useful common case. The main issues are around making sure you have permissions inside the container to write to host directories etc. You don’t need to understand all the -e options - you can just re-use as is.
The -p switch exposes port 8888 inside the container as port 8888 to the host. This is useful when the container is running a server (RStudio, Jupyter etc). See the aside below for more information.
The -v option is probably the most useful. This a volume mapping which says that the directory /home/ec2-user/data on the host should be made available inside the container at /data. In this way, data files or code on the host can be accessed from within the container itself. You can have as many mappings as you want here.
There are many many more available options for running containers including specifying available CPUs and RAM, container lifecycle etc. For more information check out the references in the resources section below.
Building Your Own Image
Docker images are built using a plain text file known as the DockerFile. There are a wide range of options available (see resources below) but the main thing for us is to see how we can use an existing base image, and specify some additional R packages to be installed.
The first line in a Docker file specifies the base image you want to use. For example you might want to use a base image that contains a Shiny server. In that case, (once you’ve found the image on DockerHub) you specify it in the DockerFile:
FROM rocker/shiny:4.0.0
Note - while you don’t have to specify the tag (version) it’s best practice to do so - if you don’t it will use the latest version - which can be updated without you knowing!
You can also specify code to be copied into the image, set the initial working directory, environmental variables and execute commands. In the example below we use the geo-spatial image as the base and install additional R packages.
Note that the process shown below in the RUN command is specific to these R images - these ensure that R packages and dependencies are always installed from a fixed snapshot of CRAN at a specific date.
Once you’ve defined your docker file you create the image and give it a tag. This needs to be run from the same directory as the docker file (note the trailing full stop!). Here we’re saying that the new version of the image is to be tagged as version 1.5:
docker build --tag chenobyte/geospatial:1.5 .
Now your image is built you can create containers with it and push the image back to your repository in DockerHub. This is the bit that can take a while as it’s usually a multi-gigabyte upload:
docker push chenobyte/geospatial:1.5
And now you can share your work with colleagues as an entire self-contained unit.
Summary
This is just a quick introduction but while Docker has a steep initial learning curve, it’s one that’s worth investing the time in. You don’t need to become an expert to take advantage of massive scale compute power either.
Imagine you had a high performance 10-node cluster that you wanted to run a simulation on. You would have to make sure every node had the same R environment, packages, versions and data. By creating a Docker image it doesn’t matter if it’s a 10-node or 100-node cluster. You only need to specify a run time environment once.
In our recent work we used Docker containers across a 4 x 64 CPU node cluster - this allowed us to fit 17 complete large scale spatial models in parallel while using the best allocation of available cluster resources.
This was a great realisation of the power of Docker considering all of the model fitting code was developed on a crotchety 4-core five year old mac laptop!
There is far more below the surface that could be discussed about Docker but most of it is not really relevant to statisticians.
The main points presented here are hopefully useful for eco stats work and general research.
Resources
Docker Hub - https://hub.docker.com
Docker File Reference - https://docs.docker.com/engine/reference/builder
Docker Reference - https://docs.docker.com/engine/reference
RStudio and Docker - How To: Run RStudio products in Docker containers – RStudio Support
Friday, 29 March 2019
Paper of the year 2018
The winning paper was:
Anders Samberg (2018) Blueberry Planet.
arXiv:1807.10553 [physics.pop-ph]
This paper was nominated by Gordana Popovic because:
It shows very unpretentiously what research involves. You have a question, you find all the research in the area, you mush the two together, you get an answer.
Honourable mentions go to:
Bonta, M et al. (2017) Intentional Fire-Spreading by “Firehawk” Raptors in Northern Australia. Journal of Ethnobiology
This paper was nominated by Ben Maslen because:
Dobrynskaya, Victoria and Kishilova, Julia (2018) LEGO - The Toy of Smart Investors.
This paper was nominated by Michelle (Shi Jie) Lim because:
I like this paper because Lego toys do not belong to the luxury segment and are affordable to most retail investors. Although the returns may not be significant in reality, the study shows that people are willing to pay a premium for Lego sets. Any Lego toy owner would probably find this paper relatable.
The other nominations in no particular order are:
Zhu and Bradic (2018) Linear Hypothesis Testing in Dense High-Dimensional Linear Models. Journal of the American Statistical Association.
This paper was nominated by David Warton because:
...developing an original new machinery for inference and applying it to a tricky problem, that of simultaneous inference for lots of parameters.
Miller and Sanjurjo (2018) Surprised by the Hot Hand Fallacy?A Truth in the Law of Small Numbers. Econometrica.
This paper was nominated by Robert Nguyen because:
I think this is interesting because it tackles something that generally people believe (there is a hot hand in sport) but as of yet there is no evidence it exists or is there?
Fraser et al. (2018) Questionable research practicesin ecology and evolution. PLoS ONE.
This paper was nominated by Mitchell Lyons because:
I like these types of papers, and there’s been a few lately, including some recent press on some editorials in Nature (Google those if you like. It helps me to gain context on why people (we, and me now) in ecology and evolution think about significance the way they do.
Yixin Wang & David M. Blei (2018): Frequentist Consistency of Variational
Bayes. Journal of the American Statistical Association.
This paper was nominated by Elliot Dovers because:
I like this paper because it provides a firmer theoretical footing for a technique (more generally than has been previously for variational approximations) that has perhaps been used willy-nilly for a long time in computer science (I'm not innocent here either). I also like that the authors give time to addressing (and linking) the technique with respect to both a frequentist and Bayesian approaches. The result "bridges the gap in asymptotic theory between the frequentist variational approximation, in particular the variational frequentist estimate (VFE), and variational Bayes"... good to see authors with a less binary take in on what it means to be a statistician.
Wednesday, 20 March 2019
Template Model Builder Tutorial
TMB Introduction Tutorial
Note: before installing TMB (by your usual means of installing an R package) compiling C++ code will require a working development environment. In Windows you can just install the latest version of Rtools - follow the install guide here. If installing on Mac OS or Linux - following the devtools install guide will do the trick - check it out here.
Thursday, 8 March 2018
Paper of the year 2017
Hallmann CA, Sorg M, Jongejans E, Siepel H, Hofland N, et al. (2017) More than 75 percent decline over 27 years in total flying insect biomass in protected areas. PLOS ONE 12(10)
This paper was nominated by John Wilshire, who summarises it as follows:
Flying insects play a very important role in ecosystems, both as pollinators and as food sources for other animals. This paper shows that their populations have massively declined over a relatively short period of time (at least in protected areas in Germany). I like this paper as it presents the results of a long term study, and it is a pretty scary example of the impacts we are having on ecosystems. Plus it is open access and has data and code available, and the statistical analysis is presented in a clear and easy to follow manner.
Thursday, 13 April 2017
Special Feature in Methods in Ecology and Evolution on Eco-Stats '15
https://methodsblog.wordpress.com/2017/04/12/generating-new-ideas/