Finding the longest orf and in-frame stop codons using Python

This post will go through how to find the longest open reading frame (orf) from all 6 translated frames and how to identify in-frame stop codons using Python.

Let’s start with a real example!

Example: Translating a schistosome protein

Below are the protein translations for all 6 frames of a protein taken from a schistosome genome.

>STUR.scaffold.51844.2315_42739.2754_1
MALGCGYKCLQCLLIIFNCGAFNACLLTICGLALIVVGALGLHSVVNHWKDIEPPLQSLI
IFIIVLGCFLFVLGGLGMFGACTKNVCLLTMYCILLSVLIILEIAAGIFAIIEKPKVKNH
VTYALKEFIEEYREEKHVSRVLAEIQQKLECCGAESPKDYGDSIPDSCYRDGKQFTEGCI
KKVSEISKKHLNAIIVSVFLFALVQCTCLIFAVCVLLAIKRGNDE*SYITKNIFCENNYE
Q
>STUR.scaffold.51844.2315_42739.2754_2
WLLVVVTSACSVY*LFSTVEPSMLVY*PYVVLLSLWLVRLDYILLSITGKTLNLRFNHSS
SSLLSSDVSCSF*EVWECLVHARRMSVC*LCTVFFCQF**S*K*LQEYLR**KNLRSKIT
SHMH*KNS*KNTGKKSMLAEFLLKSNKNWNVVALSLRRIMEIRYRTRATETENNSQRDVS
KKSVK*AKST*MLS*SAYFYLHWYNVRA*YSQYVFYWP*SVVMMNRVT*QRIYFVKIITN
X
>STUR.scaffold.51844.2315_42739.2754_3
GSWLWLQVLAVFINYFQLWSLQCLFIDHMWSCSHCGWCAWTTFCCQSLERH*TSASITHH
LHYCPRMFLVRFRRFGNVWCMHEECLFVDYVLYSFVSFDNLRNSCRNICDNRKT*GQKSR
HICIKRIHRRIQGRKAC*QSSC*NPTKIGMLWR*VSEGLWRFDTGLVLPRRKTIHRGMYQ
KSQ*NKQKALKCYHSQRISICIGTMYVLNIRSMCSIGHKAW***IELHNKEYIL*K*LRT
X
>STUR.scaffold.51844.2315_42739.2754_4
LFVIIFTKYILCYVTLFIITTLYGQ*NTYCEY*ARTLYQCK*KYADYDSI*VLFAYFTDF
FDTSLCELFSVSVARVRYRISIILRRLSATTFQFLLDFSKNSANMLFFPVFFYEFF*CIC
DVIFDLRFFYYRKYSCSYF*DYQN*QKNTVHSQQTDILRACTKHSQTS*NEQETSEDNNE
DDE*LKRRFNVFPVIDNRM*SKRTNHNESKTTYGQ*TSIEGSTVENN**TLQALVTTTKS
H
>STUR.scaffold.51844.2315_42739.2754_5
VRNYFHKIYSLLCNSIHHYHALWPIEHILRILSTYIVPMQIEIR*L**HLSAFCLFH*LF
*YIPL*IVFRLGSTSPVSNLHNPSETQRHNIPIFVGFQQELC*HAFLPCILL*ILLMHM*
RDF*P*VFLLSQIFLQLFLRLSKLTKEYST*STNRHSSCMHQTFPNLLKRTRNIRGQ**R
**VIEAEVQCLSSD*QQNVVQAHQPQ*EQDHIWSINKH*RLHS*K*LINTASTCNHNQEP
X
>STUR.scaffold.51844.2315_42739.2754_6
CS*LFSQNIFFVM*LYSSLPRFMANRTHTANIKHVHCTNANRNTLTMIAFKCFLLISLTF
LIHPSVNCFPSR*HESGIESP*SFGDSAPQHSNFCWISARTLLTCFSSLYSSMNSFNAYV
T*FLTLGFSIIANIPAAISKIIKTDKRIQYIVNKQTFFVHAPNIPKPPKTNKKHPRTIMK
MMSD*SGGSMSFQ*LTTECSPSAPTTMRARPHMVNKQALKAPQLKIINKHCKHL*PQPRA
X

Exploring by eye

In amino acid sequences the stop codon is usually encoded by a * symbol.

Looking at the sequences we can already see that frame 1 has the longest contiguous string of amino acid code (hint: look for the sequence with the fewest * symbols).

It also, even more convincingly, begins with the M amino acid – to recap on start codons and translation see the last post on Translating open reading frames (orfs).

But wait!

Isn’t that a stop codon * symbol inside the translated sequence of frame 1? Usually these go at the end of the sequence, right?

Seleno-cysteine and in-frame stop codons

Stop codons are encoded by UAA or UGA however there is another less well known amino acid which is also encoded by UGA called Selenocysteine.

Selenocysteine has the same structure as cysteine but with an atom of Selenium taking in the place of Sulfur.

This amino acid, termed the ‘21st amino acid‘ does not have it’s own letter code and instead is translated to a * symbol by existing translation programs.

Some species have proteins with selenocysteine and when they are translated as stop codons they can be missed or deleted from the sequence.

I wanted to identify these in-frame stop codons and find the longest open reading frames but I didn’t want to manually sort through hundreds of proteins by eye.

I couldn’t find an existing tool that would do this for me without removing the in frame stop codons so I made a script longest_orf.py.

The code

Below is the code I wrote a while back for the script longest_orf.py.

A version with type annotations and example sequences can be found on GitHub here.

import sys

''' splits each translated frame by stop codons and counts the resulting substrings '''
def get_n_substrings(split_frames):
    counts = []
    for seq in split_frames:
        each_frame = seq.split("\n")
        wo_header = "".join(each_frame[1:]).rstrip()
        substrings = wo_header.split("*")
        counts.append(len(substrings))
    return counts

''' finds the translation with the minimum number of substrings '''
def longest_orf_final(substring_counts, split_frames):
    min_count = substring_counts.index(min(substring_counts))
    Longest_orf = ">" + split_frames[min_count]
    return Longest_orf

''' check the number of input variables '''
def main():
    if len(sys.argv) != 3:
        print("Usage: python longest_orf.py <transeq_output> <seq_id> > <out.filename>")
        sys.exit(1)

''' Input translated frames and sequence id '''
transeq_output = sys.argv[1]
seq_id = sys.argv[2]

try:
    frames = open(transeq_output, "r").readlines()
except IOError:
    print(f"Error: Cannot open file {transeq_output}")
    sys.exit(1)

''' split to remove headers, get the longest ORF '''
f1 = "".join(frames).split(">")[1:]
result = longest_orf_final(get_n_substrings(f1), f1)
Final = "".join(result.split("\n")[1:])

''' If there is an in-frame stop codon, print the position in a new file '''
if Final.endswith("*"):
    print(">" + seq_id + "\n" + Final.replace("*",""))
elif "*" in Final:
    with open(seq_id + "_" + "in_frame_stop_codon.txt", "w") as x:
        seq_nohead = result.split("\n")[1:]
        stop_pos = str(seq_nohead).index("*")
        x.write("In frame stop codon at:" + " " + str(stop_pos))
    print(">" + seq_id + "\n" + Final)
else:
    print(">" + seq_id + "\n" + Final)
    

if __name__ == "__main__":
    main()

How does it work?

The script takes a multi-fasta file and a sequence id as input.

The sequence id in this case is the name of your protein and will be used as the sequence header and to name the output txt file.

The function get_n_substrings splits each translation by the stop codon characters (*) and returns the number of sub strings.

This information is fed to longest_orf_final which retrieves the sequence with the smallest number of splits i.e the longest contiguous sequence.

Note that to save the sequence of your longest orf you can redirect the output to file like this:

python3 longest_orf.py test_frames.faa test > seq_id.faa

Finally, if our longest orf translation has an in-frame stop codon the position of this codon in the sequence is printed to a text file.

Some thoughts

I found that counting the number of sub strings was more reliable than counting the number of * symbols in a given sequence when considering the presence of selenocysteine.

eg: Imagine this scenario.

Below we have two sequences both with 2 * symbols.

>seq_1

MAL*GCGYKC*

>seq_2

MVL*V*VVT

We see that seq_1 is more likely to be a real translation with the stop codon at the end of the translation, with the first * we encounter potentially a selenocysteine. The number of substrings here is 2.

Looking at seq_2 we have two * characters but the number of substrings here is 3. So in this case the number of substrings is more informative than the number of * characters.

This method also, so far, does not require measuring the length of the substrings themselves so it’s fast.

There is another scenario:

>seq_1

MAL*GCGYKC*

>seq_2

MVL**VVVT

Which would give us both two * characters and two substrings, however, interestingly I have not yet come across an issue with this as the translations with two ** in a row usually also have a large number of substrings.

I have not yet come across a sequence that was incorrectly selected as the longest orf by this code but it may still happen, please, tell me if it does!