Translating open reading frames (orfs)

What is an open reading frame?

The open reading frame, or orf for short, is the region of DNA between a translation initiation site (start codon) and a translation termination site (stop codon) that is ultimately translated by the ribosome.

In Eukaryotes the orf is the coding sequence that remains after the intronic sequence is spliced out.

In prokaryotes without introns¹ the orf is the coding region that is transcribed directly into mRNA.

_{^1.Most prokaryotes we know of don’t have introns, but some have introns in their tRNA and rRNA.}

What is the longest open reading frame?

Unless we are directly determining the protein sequence using mass spectrometry the usual method of determining a protein sequence from genetic data is to predict it by decoding the DNA sequence into amino-acid sequence.

When predicting a protein we often want to find the longest open reading frame.

This corresponds to the longest, uninterrupted sequence of amino acids that we can get from a given sequence.

Why is this important?

Proteins are 3D confirmations of folded amino acid sequences – so the longer the sequence the more likely it is to be able to fold into a functional domain.

So, generally – the longer the amino-acid sequence the more likely it is to code for a protein².

_{².There are exceptions – it’s biology! – e.g small peptides can be functional too – but for now let’s keep it simple.}

Quick recap

Template Strand vs. Coding Strand:
- The template strand, also known as the antisense or noncoding strand, serves as the template for RNA synthesis.
- The non-template strand is referred to as the coding or sense strand because its sequence is the same as the RNA transcript (except thymine is replaced by uracil in RNA).
Transcription and Translation:
- RNA polymerase transcribes the DNA template strand into RNA. This RNA molecule, called messenger RNA (mRNA), is then translated by ribosomes into a sequence of amino acids, which eventually folds into a functional protein.
- The ribosome reads the mRNA in a 5′ to 3′ direction, and the translation process occurs in a fixed direction along the mRNA molecule.
Direction of Translation:
- Ribosomes translate mRNA in a 5′ to 3′ direction, starting from the start codon (usually AUG) and proceeding towards the stop codon.
Reading Frames and Open Reading Frames (ORFs):
- The genetic code is read in frames, usually starting from the first, second, or third nucleotide of a codon (a codon is a triplet of nucleotides which code for an amino acid).
- An open reading frame (ORF) is a sequence of DNA or RNA that could potentially be translated into a protein. It begins with a start codon (AUG) and ends with a stop codon (UAA, UAG, or UGA).
Start Codon and Translation Initiation:
- The start codon (AUG) signals the beginning of translation. Ribosomes recognise the start codon and initiate translation at this point.
- Translation initiation factors help position the ribosome on the mRNA, ensuring that it starts translating from the correct start codon.

How do we find the longest orf?

When presented with a DNA sequence we do not know if we have the sequence of the template strand (antisense) or the non-template strand (sense).

This means the ribosome could theoretically be translating from either direction, so we have to consider our longest open reading frame may be in the forward OR reverse orientation.

Furthermore, the translated sequence will vary depending on where the ribosome starts translating from.

So, to improve our chances of finding a protein-coding region in a DNA sequence we create all possible translations of the sequence.

Introducing the codon table!

This is the standard codon table, we will cover other codon tables later on.

Looking at the table you can already see that some codons are redundant where more than one codon can code for the same amino acid.

For example, 4 codons encode Serine (UCU, UCC, UCA and UCG).

We can also see that sometimes there is only one nucleotide difference that determines one amino acid from another.

For example, If we substituted a U for an A in Proline (CCU) we would get Histidine (CAU).

Importantly, we can use this codon table to decode a DNA sequence into a sequence of amino acids.

So we have our table to decode the DNA, however, the longest orf could be read starting from nucleotide position 1, 2 or 3 from either forward or reverse orientation.

This might sound like a bit of a nightmare but luckily, as each amino acid is encoded by a triplet of nucleotides, there are only 6 possible frames to consider!

It’s time to make our best estimate of the most likely translation by identifying the longest open reading frame (orf).

We can’t watch a ribosome doing it’s job (yet?) but there are tools that can help us predict the most likely amino acid sequence.

Example: Finding the longest ORF

Consider the single short DNA sequence below

Highlighted and underlined in bold is the first codon, as would be read by the ribosome.

Note, sequence data is usually found as DNA not RNA sequence in sequence databases – even though in reality the ribosome will read this sequence in it’s RNA form (so imagine U in the place of T).

(1) TATGGTACTAGATCCATCATAT

(2) TATGGTACTAGATCCATTATAT

(3) TATGGTACTAGATCCATTATAT

(4) TATGGTACTAGATCCATTATAT

(5) TATGGTACTAGATCCATTATAT

(6) TATGGTACTAGATCTATTATAT

Let’s plug in this sequence to the EMBOSS tool transeq to get an amino acid sequence translation for all 6 frames.

If you want to try out this example you can use the online tool available here.

>EMBOSS_001_1
YGTRSIIX
>EMBOSS_001_2
MVLDLLY
>EMBOSS_001_3
WY*IYYX
>EMBOSS_001_4
YNRSSTI
>EMBOSS_001_5
I**I*YHX
>EMBOSS_001_6
IIDLVPX

This is a short example but we can see already that there are in-frame stop codons which shorten the length of the protein product in frames 3 and 5.

Remember, in this simplified example the longer the amino acid sequence the more likely it is to be the correct translation.

We also can see that frames 1, 4 and 6 do not start with a start codon Methionine (M or Met).

The presence of a start codon gives us another hint that we are looking at the most likely translation (when looking at coding DNA).

In this case frame 2 represents the most likely translation.

Next, we will go over how to find the longest orf in a sequence using Python!