Python String Operations: Examples with DNA

What is a string?

Python is particularly good for working with strings. In this post we introduce some of Python’s powerful built-in features and functions for string operations.

But first, What is a string? A string is a sequence of characters used to represent human-readable data in the form of text. In python there are different types of strings:

  1. Single-Quoted String:
   single_quoted_string = 'This is a single-quoted string.'
  1. Double-Quoted String:
   double_quoted_string = "This is a double-quoted string."
  1. Triple-Quoted String (Multiline String):
   multiline_string = '''This is a
   multiline string.'''
  1. Raw String:
   raw_string = r'This is a raw string. It ignores escape characters like  \n.'
  1. Formatted String (f-string):
   name = 'Isaac'
   age = 23
   formatted_string = f'My name is {name} and I am {age} years old.'
  1. Unicode String:
   unicode_string = 'Python is awesome for strings! \u2728'
  1. Byte String:
   byte_string = b'This is a byte string.'
  1. Encoded String:
   utf_string = 'This is a UTF-8 encoded string.'
   encoded_string = utf_string.encode('utf-8')
  1. Decoded String:
   decoded_string = encoded_string.decode('utf-8')

The Language of DNA

The DNA molecule has two strands and each strand is composed of a sequence of nucleotides. Nucleotides consist of a phosphate group, a sugar molecule (deoxyribose), and one of four nitrogenous bases; Adenine, Thymine, Cytosine and Guanine.

To make nucleotides human readable geneticist’s encoded them using the letters; “A”, “T”, “C” and “G” to represent the nitrogenous bases.

Interpreting the patterns of these letters in genetic sequences uncovers vital information about the evolution and function of DNA sequences.

As DNA sequences are strings in this form we can use Python to explore, transform and edit them. Let’s look at an example:

Transforming DNA sequences

In the DNA molecule two strands are held together by hydrogen bonds between specific pairs of bases; A bonds with T and C bonds with G.

When DNA is replicated each strand of DNA is converted to complementary sequence.

Here, we mimic the process of DNA replication by creating a complementary DNA strand using Python’s replace() function.

# DNA Example
dna_sequence = "ATCGATCGTA"
print("DNA Sequence:", dna_sequence)

# String Operations
complementary_dna = dna_sequence.replace("A", "t").replace("T", "a").replace("C", "g").replace("G", "c").upper()
print("Complementary DNA:", complementary_dna)

To ensure that we do not replace the wrong bases we use lowercase letters for the replacement characters and change the case of the characters at the end (so that they are all in the same case) using the upper() function.

The result is a transformed sequence that mirrors the original.

Splicing a sequence

During transcription pre-messenger RNA is edited to produce the mRNA molecule via RNA splicing.

This involves the removal of introns or the non-coding regions from the DNA sequence.

If we are interested in studying the DNA sequence that codes for the protein of a gene we will need to mimic the RNA splicing process.

We can do this by extracting the exon or protein-coding regions of DNA.

In this example we know where the exon sequence starts and ends, using the in-built splicing function and the + operator we can extract the DNA that corresponds to the exons and concatenate the sequences together:

# Original DNA Sequence
dna_sequence = "ATCGATCGTATTGATCATTAAGTGTAGATA"

# Define start and stop indices 
ex1_start = 4
ex1_stop = 10

ex2_start = 16
ex2_stop = 20

#Splice the exon sequences and concatenate
exon_sequences = dna_sequence[ex1_start:ex1_stop] + dna_sequence[ex2_start:ex2_stop]

#print the result
print(exon_sequences)

In the next few posts I will introduce some pre-baked python scripts for handling DNA sequences!