Highlights from RSECon24: Exascale and Isambard-AI

RSECon24 at Newcastle University

Recently I attended RSECon24, the annual research software engineering conference hosted by the RSE society from Sept 3-5th.

As with the Exascale workshop I was able to obtain funding from the Society of Research Software Engineering to attend this event, which I am very grateful for, thank you Soc RSE!

As Research Software Engineer is a relatively new profession it was reassuring to see considerable thought on the future of RSE as a career and significant effort had gone into making the conference as inclusive as possible.

It was nice to catchup with familiar faces from the Exascale workshop and make some new connections. The conference also gave me the experience of chairing my first session.

I met RSE’s from myriad of different backgrounds, including academics, informaticians and software engineers and many had interesting stories of how they had gotten into the field.

Most people I met shared a common goal, namely the desire to produce quality, reproducible software and to keep one foot in a research environment whilst doing so. Overall there was a collaborative spirit, aka good vibes!

Exascale workshops at RSECon24

I attended two Exascale workshops at RSECon24. The AIRR workshop introduced us to Dawn, an Exascale computer co-designed by Dell, Intel and The University of Cambridge.

There is some fascinating work going on making use of this great resource, such as the digital twins idea, you can read more about Dawn here.

For this post I will keep it brief and summarise the Isambard-AI workshop.

Introducing Isambard-AI

I had first heard about the Isambard-AI cluster at the Exascale workshop back in July and since then I’ve been eagerly awaiting to try it out.

Luckily for me RSECon24 was hosting an ‘Introducing Isambard-AI‘ workshop, run by the University of Bristol and NVIDIA, where I got temporary access to the cluster!

The workshop aimed to introduce us to accessing the cluster and illustrate the limits of Isambard-AI Grace Hopper Superchips.

The Grace Hopper Superchips (GH200) are combined CPU(Central Processing Unit) + GPU(Graphics Processing Unit) chips with high memory bandwidth. Memory bandwidth is the amount of data that can be transferred between the CPU/GPU and the memory (RAM).

In the GH200 the CPU and GPU are physically connected by an NVLink-C2C interconnect so that they can share memory access. See the diagram below:

Image credit to NVIDIA, taken from https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip

Each Grace CPU has 72 high-performance cores with 512 GB/s bandwidth and the Hopper GPU has 18,432 general purpose CUDA cores for parallel processing.

The Hopper GPU also has 576 tensor cores which are designed to accelerate matrix multiplications (handy for tasks such as LLMs which require these calculations).

The NVLink-C2C interconnect provides 900 GB/s of bandwidth between the Grace CPU and Hopper GPU, allowing the
CPU and GPU to access up to 512 GB of shared memory.

In the workshop we were given access to a single chip, setting the number of threads to 72.

First we ran an interactive chat bot ‘IsamBot’ which is based on the Phi3-mini model, an LLM with 3.8 billion parameters.

The performance of the language model was measured by how many tokens it generated per second.

Tokens here means words or parts-of words. So we are measuring how many of these words or parts-of-words are produced by the model each second.

We also monitored the CPU performance via the command line tool top. As we were working from a notebook we could also use NVDashboard to monitor the GPU.

NVDashboard is the JupyterLab extension and you can select tools such as ‘GPU Memory’ and ‘GPU Utilisation’. This was new to me and I found it a really nice visual method of monitoring the GPU.

Next we mapped the model to the GPU which improved the performance in tokens per second.

Mapping to the GPU prevents us running the model on the CPU which is far less efficient for our language model.

Remember the Hopper GPU is designed specifically to handle operations such as matrix multiplications so we would much rather run our model on the GPU than the CPU.

We then used a larger language model, the Llama3 model with 70 billion parameters but uh oh, we run out of memory!!

In this case, even with 512GB shared memory, when we run a model with many billions of parameters we exceed the available memory of our single chip.

Allocating different parts of the model to different GPU’s would likely remedy this, however it is a reminder to be aware that even though Isambard-AI is extremely powerful there are still limitations eg for LLMs with many billions of parameters.

Something to think about for when Isambard-AI becomes accessible for your own projects in the near future!

You can find more details about the Isambard-AI tutorial here and for Dawn here.