What is a substitution?
A substitution is a type of mutation where a single nucleotide base (A,C, G, T) is substituted for another.
We can observe substitutions in sequence data through aligning corresponding DNA sequences (from species or individuals) and identifying where bases differ.
An example:
In the two sequences below at position 4, an A (Adenine) has been substituted to a C (Cytosine).
>Sequence1
ATGATGACGTA
>Sequence2
ATGCTGACGTA
The chemistry of DNA bases
There are two types of substitutions, Transitions and Transversions.
Transitions are substitutions that change (A<->G) or (C<->T), they occur more frequently than transversions as they are more likely to happen during DNA replication due to chemistry, i.e similar ring structures are exchanged.
Transversions are substitutions which change (G<->T), (C<->G), (C<->A), or (T<->A).
There are more types of Transversions than Transitions but they happen less frequently as changing a double ring for a single ring structure is less likely to happen.

A DNA substitution at a codon site (remember that amino acids are coded for by three nucleotides), can change the amino acid sequence, depending on the codon.
Transitions are less likely to result in amino acid substitutions and more likely to be identified as single nucleotide polymorphisms (SNPs).
What are substitution models?
The first substitution model was created by Margaret Dayhoff and colleagues for amino acid sequences.
Substitution models are mathematical models that describe the rates and patterns of nucleotide or amino acid substitutions that occur over time.
The parameters of these models include the rates of different types of substitutions (e.g., Transitions vs Transversions), as well as the equilibrium frequencies of the different nucleotides or amino acids.
Ultimately substitution models specify how likely one base is to change into another over time.
Why do we need substitution models?
When we sequence a gene or genome we are taking a snapshot of DNA information in time.
As a result, and without the benefit of a time machine, we cannot observe all of the historic changes in sequence data.
Hidden states in sequence data
If we want to use sequence data to understand the evolution of organisms we need to estimate how the sequence data might have changed in the past based on what we observe in the present.
The annotated example below shows two lineages diverging from a common ancestor:

Figure 2. illustrates that different substitutions have occurred in the past for both lineages since diverging from their common ancestor.
Importantly, what we see present day does not represent the full evolutionary history, i.e some of these substiutions are hidden to us.
The intermediate sequences in this figure can be thought of as hidden states in a substitution model, these are substitutions that happened in the past which we can’t observe, we can only guess at.
Assumptions of sequence evolution
If we were to simply compare the substitutions and look at raw differences between sequences (eg. Hamming distance), we would be subject to the following assumptions:
1) All substitutions are equally likely.
2) All sites evolve at the same rate.
3) The frequency of nucleotide bases is equal.
4) Each observed difference represents exactly one historical substitution.
We know from studying sequence data and the chemistry of DNA that these assumptions are not always true.
As a result we are likely to misinterpret the number of substitutions per site (a value that phylogenetic trees display as their branch lengths).
How can we model hidden evolution?
When we do not know which intermediate changes happened in the past, substitution models give us a way to infer the probability through using continous-time markov models.
Substitution models treat sequence evolution as a continuous-time markov process along each branch of a tree, where the hidden states are inferred probabilistically.
In a continous-time markov model, states are discrete but time is continuous.
For DNA, each nucleotide is a state and substitutions are the events between states.
The transition probabilities are represented by a matrix called the transition rate matrix. This is the rate at which the system moves from one state to another.
At any given time one nucleotide can change to another, according to the rates in the transition rate matrix.
Different models, for example Jukes-Cantor(JC), Kimura(K80) and GTR, define the structure of the rate matrix, according to constraints such as base frequencies (which are considered equal in the Jukes-Cantor model) and the ratio of transitions to transversions.
Below is a summary of various substitution models and their differences, taken from Posada and Crandall, 2001.

By using substitution models we can describe possible changes in a sequence and gain a more realistic representation of the number of substitutions that occurred between lineages, and hence how evolutionary related they are.
We’ll focus more on substitution models and whats going on under the hood in future posts!