Containerisation for research software

Containerisation is the process of packaging software with its dependencies (eg. tools, libraries, configurations) into a container.

Using containerisation improves the repeatability of scientific research by ensuring software environments remain consistent across different systems over time.

Why do we need containerisation?

When running Bioinformatic software our computer will use certain components of it’s operating system, alongside the necessary tools and environments.

As both bioinformatic and operating system software changes over time it becomes difficult to repeat an analysis using the same versions and dependancies.

One method is to use Conda (see Using Conda for a Python Project) to create environments on your computer where you can install specific dependancies.

In many cases using a Conda environment is sufficient to repeat a workflow, but time plays an important factor here.

In Conda environments versions of dependancies and packages can change over time, making workflows less reproducible.

On top of this, when working with multiple Conda environments, we can end up with multiple copies of dependancies.

This is where the concept of containerisation comes in.

What is a container?

Containers are lightweight packages created from images.

The image defines everything the container needs to run, including all of the necessary code, packages and settings.

Imagine containers as a software pick-n-mix where you choose only the dependancies that you really need.

Importantly with containers we do not need to keep multiple copies of the same dependancy when using slightly different workflows.

This is due to how dependancies are reused by containerisation tools.

In Docker images are created in layers, where each build command stacks a new layer on top of the previous one.

These layers can be reused across images saving space and time!

For Singularity, if we already have a dependancy in an existing image we do not need re-build it – in short we can reuse parts of pre-built images.

Containerisation also directly enables reproducible research where our input data can also be copied into the image (if we want to, we can repeat the exact same analysis) though for large or sensitive datasets, mounting data at runtime is preferred.

In the next few posts we will go through examples of using Docker and Singularity!