Coalescent trees and Bugs in a box

Imagine tracing a tree diagram backwards in time, from the tips all the way back to its root.

At each node the lineages coalesce, or merge, into one.

As we follow the nodes eventually all lineages combine into one ancestral lineage, representing the most recent common ancestor (MRCA).

In 1982 Kingman showed that this process of merging lineages backwards in time can be described mathematically as a stochastic process termed the n-coalescent.

Importantly, this process considers the coalescence of gene copies, not just individuals.

Kingman, Hudson, Tajima and many more have contributed and expanded on coalescent theory and applied it to the study of genealogies and population history.

Today, coalescent models and trees are widely used in population genetics and phylogenetics to infer population demographic changes, reconstruct evolutionary histories and study genetic diversity across populations.

In this post we will keep it simpler and focus on coalescent theory for haploids where each individual carries only one gene copy.

Bugs in a box

Joe Felsenstein introduced the analogy of ‘bugs in a box‘ to aid visualising the n-coalescent.

Imagine we are tracking a number of bugs (k) in a box.

The hyperactive bugs move about randomly, occasionally two bugs collide, at which point one instantly eats the other.

Over time, the number of bugs decreases from k to k-1, k-2, and so on, until only a single bug remains.

In this analogy collisions represent coalescent events where the probability of a collision is determined by the density of bugs.

The density of bugs depends on:

1) The number of pairs of bugs we are following k(k-1)/2

2) The size of the box (N_e).

For haploids, the size of the box(N_e) represents the total number of gene copies in the population.

Coalescent trees

The n-coalescent can be represented as a genealogical coalescent tree.

In a coalescent tree, nodes represent coalescent events, the point where lineages merge and branch lengths correspond to waiting times between these coalescent events.

The waiting time(T) is determined by the number of lineages(k) and the effective population size (N_e), where N_e is the number of gene copies contributing to the next generation.

Coalescent trees encode information about population history, or ‘population demographics’ through characteristic tree shapes.

The n-coalescent captures ancestry in a probabilistic way, think back to our ‘bugs in a box’ example.

When there are more lineages (k) there are shorter waiting times as the probability of coalescent events is higher.

If there is a large Ne, the probability of coalescence events is lower, resulting in longer waiting times.

Expected waiting times

If we know the number of lineages (k) and the effective population size(N_e) we can calculate the expected waiting time E[T_k] for each coalescent event.

The expected waiting time E[T_k] differs from the waiting time T mentioned prior, as branch lengths can fluctuate randomly.

Using an expected value allows us to scale the waiting time relative to N_e to provide a typical branch length.

We can calculate the expected waiting time using the following equation:

E[T_k] = 2N_e/k(k-1)

In each coalescent event exactly two lineages are merged, with k lineages there are k(k-1)/2 possible pairs.

Each pair coalesces at a rate of 1/N_e

As any pair can coalesce the total rate of a coalescent event is the sum over all pairs.

The expected waiting time until the next coalescent event is the inverse of this total rate, where the total rate can be calculated as the number of possible pairs multiplied by the rate per pair:

k(k-1)/2 x 1/N_e = k(k-1)/2N_e

Once inverted, we get our expected waiting time for each coalescent event E[T_k] = 1/(k(k-1)/(2N_e)) = 2N_e/k(k-1)

These expected waiting times will become our branch lengths in the coalescent tree, ultimately influencing tree shape.

Why the coalescent is useful

By analysing the branch lengths (waiting times) and the number and distribution of nodes in a coalescent tree, researchers can estimate effective population sizes, detect bottlenecks or expansions and model changes in population structure over time.

What makes the ‘backwards in time’ coalescent approach so powerful is that it doesn’t require a large number of gene copies.

Going back to our bugs in a box analogy, tracking even a small number of bugs (k) can reveal the coalescent history of the entire box, where even a small sample of lineages can capture the probabilistic patterns of ancestry in a population.

Insights from coalescent theory form the foundation of many modern population genetics and phylogenetics analyses, from estimating mutation rates to reconstructing population histories.