The Neighbour-Joining algorithm: Ties and Matrix Order

Following on from the previous post Foundations of phylogenetics we’ll next explore how ties and matrix order can influence the output of the Neighbour Joining(NJ) algorithm.

Quick recap: How NJ works

Here’s a recap of the iterative NJ process below:

Figure 1. The iterative process of the NJ algorithm

Determinism and topologies

The NJ algorithm is deterministic, meaning if we run NJ on the same distance matrix multiple times we will always get the same tree – even if we encounter ties.

But what do we mean by ties?

A tie is when the NJ algorithm encounters more than one smallest Q value during its iteration through the distance matrix.

Why does this affect the final NJ tree?

When there is more than one minimum Q value, the algorithm must choose only one of them.

How does it choose?

Most implementations of the NJ algorithm scan across the matrix in a fixed order eg. left to right.

The result of this is that, depending on the order of values in your matrix, the chosen smallest Q-value will be the value that is encountered first.

An Exercise

This got me thinking, for a small enough matrix ,where we can generate all possible re-orderings of rows and columns, would we see multiple topologies?

NJ theory tells us that:

1) If a matrix is additive, we will only ever get one tree.

2) If the matrix is not completely additive we will get multiple topologies, depending on which pair of nodes with the smallest Q value is chosen.

Let’s test this theory out on a small 4×4 matrix example.

Generating permutations

With a symmetrical distance matrix the number of unique row/column orders is :

4! = 24

I borrowed a nice R function from here which generates all possible permutations of n objects.

permutations <- function(n){
    if(n==1){
        return(matrix(1))
    } else {
        sp <- permutations(n-1)
        p <- nrow(sp)
        A <- matrix(nrow=n*p,ncol=n)
        for(i in 1:n){
            A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
        }
        return(A)
    }
}

Now to change numeric output to letters for nicer labels:

all.perms<-matrix(letters[permutations(4)],ncol=4)
all.perms

This will give us all 24 permutations of our taxa (a, b, c and d)


[,1] [,2] [,3] [,4]
[1,] "a" "b" "c" "d"
[2,] "a" "b" "d" "c"
[3,] "a" "c" "b" "d"
[4,] "a" "c" "d" "b"
[5,] "a" "d" "b" "c"
[6,] "a" "d" "c" "b"
[7,] "b" "a" "c" "d"
[8,] "b" "a" "d" "c"
[9,] "b" "c" "a" "d"
[10,] "b" "c" "d" "a"
[11,] "b" "d" "a" "c"
[12,] "b" "d" "c" "a"
[13,] "c" "a" "b" "d"
[14,] "c" "a" "d" "b"
[15,] "c" "b" "a" "d"
[16,] "c" "b" "d" "a"
[17,] "c" "d" "a" "b"
[18,] "c" "d" "b" "a"
[19,] "d" "a" "b" "c"
[20,] "d" "a" "c" "b"
[21,] "d" "b" "a" "c"
[22,] "d" "b" "c" "a"
[23,] "d" "c" "a" "b"
[24,] "d" "c" "b" "a"

Reordering the matrix and building NJ trees

To reorder the matrix based on our permutations I wrote the following function:

perm.trees<-function(perms, matrix.in){
  
  tre.list<-NULL
  
  for(i in 1:nrow(perms)) {
    perm.order <- perms[i,]
    
    # reorder rows and cols
    mat<-matrix.in[perm.order, perm.order]
    
    # build tree
    nj.tree<-nj(mat)
    
    # add tree to tree list
    tre.list[[length(tre.list) + 1]] <- nj.tree
  }
  
  return(as.multiPhylo(tre.list))
}

Next we can go ahead and define our matrices.

An additive matrix

The additive matrix defined in this example was taken from a really nice post on NJ by Amelia Harrision.

After defining the matrix the code below uses our perm.trees function to generate an NJ tree for each reordered distance matrix.

# define matrix
mat.add <- matrix(c(
  0, 4, 5, 10,
  4, 0, 7, 12,
  5, 7, 0, 9,
  10, 12, 9, 0
), nrow = 4, byrow = TRUE)

# set row and colnames
rownames(mat.add) <- colnames(mat.add) <- c("a", "b", "c", "d")

# generate permutations
add.perm.trees <- perm.trees(all.perms, mat.add)

We then use ggplot to visualise each of the 24 trees.

# plot all NJ trees
ggtree(add.perm.trees, layout="rectangular") + facet_wrap(~.id, scales="fixed") + geom_tiplab()

Looking at the trees below, we see there is one topology i.e the relationships between taxa do not change after reordering the matrix.

For example, we see that a and b always cluster together and away from c and d:

A nearly additive matrix

Let’s try this again but with a nearly-additive matrix:

# define matrix
near.add <- matrix(c(
  0, 2, 2, 3,
  2, 0, 3, 2,
  2, 3, 0, 2,
  3, 2, 2, 0
), nrow = 4, byrow = TRUE)

# set row and colnames
rownames(near.add) <- colnames(near.add) <- c("a", "b", "c", "d")

# generate permutations
near.add.perm.trees<-perm.trees(all.perms, near.add)

# plot all NJ trees
ggtree(near.add.perm.trees, layout="rectangular") + facet_wrap(~.id, scales="fixed") + geom_tiplab()

The result is a bit different this time – now we have multiple topologies, for example in Tree#2 a and b cluster together yet in Tree#3 a and c cluster together.

This illustrates that when our input distance matrix isn’t completely additive we can get different topologies.

We’ll explore more NJ quirks in later posts!