What is a FASTA file? - Sequence Gazing

FASTA files

DNA, amino acid and RNA sequences are often stored in FASTA file format. So what is a FASTA file?

FASTA files are used to store sequences of DNA, amino acids or RNA.

The FASTA format consists of a single line or ‘header’ used as a descriptor, followed by sequence data as characters. The FASTA format is also called the Pearson format, named after the author of the FASTA program, an alignment software package.

The header of a FASTA file contains a unique id number or ‘accession number’ associated with a sequence from a database such as Genbank and other descriptive information.

The first letter of the header in a FASTA file is always the greater than symbol (“>”). The example below shows a simple made up FASTA file:

Note that the header information must be on a single line but the sequence data is placed over multiple lines.

There should also be no spaces between the characters of sequence data (i.e a continual string) on each line.

It is recommended to use 80 characters or less for FASTA file lines, this was originally to make the sequences more readable to the human eye.

These specifications may not seem important but many bioinformatics tools rely on the correct FASTA file format to correctly read in these files.

Sometimes we need to adjust the headers of FASTA files when they contain ‘illegal characters’ such as the “*” symbol, that a program may not be designed to handle. For this reason I have written a few scripts that can edit these headers, you can find them in the Sequence-tools repo.

Multi-FASTA files

FASTA files can also contain more than one sequence entry, these are called multi-FASTA files, where each header must be on a newline, for example:

Real examples

The most common file extension for FASTA format is .fasta, however there are specific extensions for FASTA files which can be useful as they tell us what kind of data the FASTA file contains without having to open it. These include .fa, .fasta, .fna and .faa.

The .fna extension specifies a file of nucleotides. For example, below is the DNA code for a mitochondrial gene of a parasitic worm species, Schistosoma turkestanicum.

The extension .faa specifies a file of amino acid sequence, for example below is the translation for our nucelotide fasta file as a sequence of amino acids:

The complete files of these examples can be found in the sequence gazing repo here.

Next we will explore the linux command line and how to edit FASTA files in the terminal!