Clustering to Reveal Hidden Traits: Using Neighbour-Joining Trees

Before we dive into the theory behind the Neighbour-Joining algorithm I wanted to do a simple conceptual exercise to highlight the usefulness of tree algorithms such as NJ in revealing hidden traits from a datasets.

Imagine we have 6 objects of different shapes; circles, triangles and squares.

Our goal is to group them by colour – but imagine that we can’t observe the colours directly. They do exist but they remain hidden from us.

Based on prior research we suspect that the traits; shape, pattern and weight might be linked to colour.

To explore this hypothesis further, we can use a clustering algorithm to give us a visual representation of the difference in our objects for traits we believe are associated with the colour property.

An Example in R – Can NJ clustering recover the underlying colour groups?

Time to try out NJ clustering on this example dataset, for this example i’ve created a dummy dataset where some associations exist between colour and our observed data.

Let’s see if we can recover the colour groups using the NJ algorithm.

We’ll us the ape library in R for NJ tree building.

Create the data frame

First we create the data frame to hold our measurements and observations.

Note that we include the hidden colours in this dataset – but don’t worry! This won’t be used by the clustering algorithm and is only there to test our hypothesis later on.

library(ape)

df <- data.frame(
  object = paste0("obj", 1:6),
  shape = c("triangle", "triangle", "circle", "square", "circle", "square"),
  pattern = c("stars", "stars", "dots", "dots", "dots", "dots"),
  weight = c(50, 50, 70, 70, 70, 70),
  colour = c("red", "red", "green", "green", "green","green")
)

print(df)
  object shape pattern weight colour
1   obj1     3       2     50    red
2   obj2     3       2     50    red
3   obj3     1       1     70  green
4   obj4     2       1     70  green
5   obj5     1       1     70  green
6   obj6     2       1     70  green

Simple numerical encoding

As we have categorical data as well as numeric data ( and we want to calculate distance ) we encode the categorical data to numerical values, creating a new data frame for our encoded data:

df$shape   <- as.integer(factor(df$shape))   
df$pattern <- as.integer(factor(df$pattern)) 
df.encoded <- df[, c("shape", "pattern", "weight")]

Scale the data

Remember that distances can be skewed when data varies in scale. In this example we have carried out min-max normalisation to account for the difference in scale between weight and our encoded data: shape and pattern.


df.scaled <- as.data.frame(lapply(df.encoded, function(x) {
  (x - min(x)) / (max(x) - min(x))
}))

We then label the rows with our object names to use as tip labels on the tree.

rownames(df.scaled)<-df$object

your scaled data frame with rownames should look like this:

print(df.scaled)

     shape pattern weight
obj1   1.0       1      0
obj2   1.0       1      0
obj3   0.0       0      1
obj4   0.5       0      1
obj5   0.0       0      1
obj6   0.5       0      1

Build the distance matrix and NJ tree

Next we build the euclidean distance matrix using R’s dist() function. In this example we use 2D Euclidean distance which is calculated as:

d = √((x2 - x1)² + (y2 - y1)²)


dist.mat <- dist(df.scaled, method = "euclidean")

Finally we use the nj() function to build an NJ tree from our distance matrix and plot the result.

nj.tree <- nj(dist.mat)

plot(nj.tree,
     tip.color = df$colour,
     cex = 1.2) 

Interpreting the tree

Below is our final NJ tree! – I’ve annotated the object shapes and their (hidden) colours.

We can see that in our dummy example the NJ tree has clustered our shapes by their hidden colour.

Without access to our colour labels we could present a hypothesis from this tree that obj1 and obj2 likely share the same colour and are a different colour to the other objects.

We could also infer that obj3, 4, 5 and 6 belong to a second colour group – though there is some variation in distance for this group – likely a result of using shape as a trait in our distance matrix, as both circles and squares are green.

Visualising these distances gives us a powerful method for exploring structures within our data – whether we are testing hypotheses or exploring unknown patterns.

This is just a simple example – however I hope this helps to introduce the wider concept of using clustering algorithms – before we move onto the next post and dive into how the NJ algorithm works.