Locating Epitopes Using Python

What are Epitopes?

Epitopes are regions of a protein that can be recognised by an antibody or immune cells and they can be represented by their corresponding amino acid sequence.

For example, B cells bind to epitope regions via their receptors, where on binding an immune response is triggered. Below is an example of a B cell epitope sequence:

>Example_epitope
MDLIQSSFH

Identifying epitopes in a protein sequence is a crucial step in the development of epitope-based vaccines, which target specific epitopes of pathogens that can trigger a protective immune response.

How can we predict epitopes?

Several bioinformatic programs exist which can predict B cell or T cell epitope regions in protein sequences, such as Bepipred, Igpred, TepiTool and Discotope.

Epitope prediction programs use properties of the input amino acid sequence, such as hydrophobicity and antigenicity score, to predict epitope regions.

Some of these tools, such as Bepipred2 are also trained on datasets of known epitope and non-epitope sequences determined using crystal structures.

How to locate epitopes?

A while back I created a small python program, find_seq_overlap.py to locate the amino acid positions of overlapping epitopes in a protein sequence predicted using IgPred.

I wanted to do this to be able to plot epitopes onto protein structures, as the 3D conformation of a protein can mean that the 1D sequence corresponding to your 3D epitopes may be located in different regions (i.e the sequence may not be in tandem).

These epitopes are referred to as discontinuous epitopes.

Imagine bending a single piece of wire back on itself, two parts of this wire can end up touching when bent, but when the wire is stretched back out the point of contact is now in two different locations.

To help illustrate below is a plot where I have located some B cell epitope regions predicted using IgPred on a transmembrane protein region.

On the Y axis is antigenicity score, the peaks in antigenicity correspond to the higher antigenicity of the outer loop regions.

You can see here that there are two regions which, when mapped to the 3D structure, are located in the same region but in the 1D sequence appear to be in two separate regions, separated by ~15 amino acids.

To plot this data I identified the position of the epitopes in my input sequence using find_seq_overlap.py which takes a protein sequence in fasta format and an epitope file in multi-fasta format.

An Example

Next I will run through how I used find_seq_overlap.py to locate the overlapping epitopes in the illustration above.

Below is an example of a protein sequence for an outer loop region taken from a transmembrane protein:

>Pep1
VYKDRIDSEIDTLMTGALDNPNKEITEFMDLIQSSFHCCGAKGPGDYKVDPPASCKGEQVVYDE

For this sequence we also have a file of IgG epitope regions predicted using IgPred:

>1
MDLIQSSFH
>2
MDLIQSSFH
>3
QSSFHCCGA
>4
PASCK

Below is the full code, in this updated version I have added some type annotations and a if __name__ == “__main__”: statement.

As we are using python, and python is great with strings, we can think of our epitope input file simply as a bunch of substrings.

In this program I made use of pythons powerful re module using re.finditer() to find all overlapping string matches:

#!/usr/bin/env python

import sys
import re
from typing import List

def read_file(file_path: str) -> str:
    with open(file_path, "r") as file:
        return file.read()

def find_epitopes_in_sequence(sequence: str, epitopes: List[str]) -> str:
    a = "".join(sequence).split("\n")[1]
    pos = ""

    for i in epitopes:
        matches = re.finditer(r'(?=(%s))' % re.escape(i), a)
        pos += str([('%01d %01d %s' % (m.start(1), m.end(1), m.group(1))) for m in matches]) + "\n"
    
    return pos

def process_epitopes_file(epi: str) -> List[str]:
    with open(epi, "r") as epitopes:
        lines = epitopes.readlines()
        e = "".join(lines).split("\n")[1::2]
        return e

def main(seq: str, epi: str, out: str) -> None:
    sequence = read_file(seq)
    with open(out, "w") as positions:
        epitopes = process_epitopes_file(epi)
        pos = find_epitopes_in_sequence(sequence, epitopes)

        k = pos.replace("[]", "").replace("[", "").replace("]", "").replace("'", "")
        positions.write("".join(k))

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python script.py sequence_file epitope_file output_file")
        sys.exit(1)
    seq = sys.argv[1]
    epi = sys.argv[2]
    out = sys.argv[3]
    main(seq, epi, out)

The result is a text file with the start and stop index for the epitope sequence in the protein sequence. As you can see there are two ranges 28 to 41 and 51 to 56 so here we have a discontinuous epitope.

28 37 MDLIQSSFH
32 41 QSSFHCCGA
51 56 PASCK

The code and example files can be found here.

I hope you find this script useful 🙂