Foundations of phylogenetics

For many years it was believed that patterns of evolution are tree-like and sometimes they are!

We now know that evolutionary relatedness can be more complexoften resembling networks. Despite this additional complexity, tree based methods remain powerful tools for studying evolutionary relationships.

This is due to the fact that biological information, such as genetic data, has an underlying hierarchical structure.

To produce a tree graph from hierarchical information, the tree must fit the data according to specific geometric constraints.

What is tree-like?

Tree-like data structure is a geometric property of distance matrices.

While the triangle inequality is a necessary condition for a tree metric, additional constraints are required for distances to be considered truly tree-like.

The two key constraints are the three-point and four-point condition.

But why do these constraints exist?

In a tree there is one unique path between any two nodes – and the path must go through their shared ancestor – whereas with a triangular structure there can be multiple paths between nodes.

Ultrametricity

Figure 1. Representing the three-point condition as (I) an isosceles triangle (II) a bifurcating tree.

Ultrametricity is a geometric property of distance matrices where distances conform to the ultrametric inequality known as the three-point condition.

The three-point condition requires that given three pairs of distances, two must be equal and both must be greater than the third distance.

It’s easier to think about these distance relationships in terms of an isosceles triangle (see fig 1) where we have two longer sides (of equal length) and one shorter side.

A tree is called ultrametric if it accurately reflects an ultrametric distance matrix.

Ultrametric trees assume a molecular clock – where all tips in the tree are the same distance from the root and there is a constant rate of evolution.

Algorithms such as UPGMA (Unweighted Pair Group Method with Arithmetic Mean) are used to build ultrametric trees, these are typically used in time-calibrated phylogenies using molecular clock models.

Additivity

Additivity is another geometric property of distance matrices, where distances conform to the four-point condition.

The four-point condition follows on from the three point condition, this time concerning sums of distances (see fig 2). It follows that, for any set of four distances, they must conform to the four-point rule.

Of the three sums below, two are equal and not less than the third:

  1. d(x,y) + d(z,t)
  2. d(x,z) + d(y,t)
  3. d(x,t) + d(y,z)

In other words among three sums of two pairs of distances there is no unique maximum.

A tree is called additive if it accurately reflects an additive distance matrix. Additive trees are less constrained than ultrametric trees, they allow variable edge lengths and do not assume a constant rate of evolution.

It has been proven that when a matrix is completely additive the NJ algorithm builds only one possible tree. It is important to note that even this single tree does not necessarily represent the correct evolutionary history.

This will ultimately depend on how well the signal in the data reflects evolutionary history.

Figure 2. The four point rule

But what if our data isn’t additive?

Interestingly the NJ algorithm can accurately represent a distance matrix, even when the data does not conform strictly to the four-point rule.

There is a limit to this flexibility however, quantified via Atteson’s theorum – where NJ will reconstruct the correct tree if the matrix deviates by no more than half the length of the shortest internal edge.

When a matrix is not additive or nearly-additive there is more uncertainty in evolutionary relationships in the tree.

Collectively these different evolutionary hypotheses are termed the tree space.

This might sound mysterious but the tree space is essentially all of the possible tree topologies that could be constructed from our data.

Variation in topology in tree space – caused by conflicting phylogenetic signal – is one reason why we use resampling methods such as bootstrapping, to assess tree robustness.

As there are potentially multiple possible topologies, with different relationships in the tree, we cannot know which tree – if any – represents the true relationships in our data.

This means the suitability of heritable data used in phylogenetic analysis must be carefully considered to ensure that the data contains enough useful information (eg non-conflicting, tree-like signal) that can be used to build both robust (with few alternative hypotheses) and accurate trees (reflecting evolutionary history).

What is the Neighbour-Joining (NJ) algorithm?

The NJ algorithm is an iterative hierarchical clustering method which loops through values in an initial distance matrix and builds a new matrix of tree distances.

Note that a distance matrix is a square matrix containing pairwise dissimilarities between data points.

The algorithm can be thought of as a dimensional reduction method, capable of reducing the information within a distance matrix to a lower dimensional space.

It’s efficient and can be solved in polynomial time where, for a set of N data points, the algorithm performs N3 operations.

The NJ algorithm begins with a distance matrix where each pairwise distance is considered as a node on a graph.

Each iteration produces an adjusted distance matrix that is reduced by one less column as two neighbours are joined into a single node.

Figure 3. The iterative process of the NJ algorithm

In addition to calculating the total pairwise distances between data points a Q value is calculated for each pair.

The Q value measures the difference between the total edge length before and after merging two taxa.

The algorithm aims to minimise the total edge length and so, in each iteration, the pair with the smallest Q value is identified and merged into a single node.

The pairwise distances from this new node to every other node are then calculated, the distance matrix is updated and the process repeats until there are only two nodes left.

To successfully use the NJ algorithm for evolutionary analyses we need to understand; 1) how well the tree distances fit the underlying distance matrix and 2) if the distance matrix contains enough information to trace evolutionary history.

We can identify if the NJ algorithm is suitable for a distance matrix by plotting the correlation of cophenetic distance (sum of the branch lengths along the path connecting them) to the original distance matrix.

When a tree is additive the cophenetic distance between each pair of taxa is equal to the original distance.

Additivity is a key property for many tree building approaches. Starting trees for methods such as maximum likelihood and bayesian inference are often generated using algorithms like Neighbour-Joining, which approximate additive distance matrices.

Understanding how geometric structures within our data, such as additivity, shapes phylogenetic analysis helps us make more informed choices about data, evolutionary models and interpretation of evolutionary relationships.