A k-mer is a substring of a biological sequence with length K, where K can be any number.
If you are familiar with sliding windows, it’s a similar principal! Where k is the window size and the step is +1.
An example:
Take a look at the short DNA sequence below, imagine running across the DNA sequence for a given number of steps, extracting the substring at each step and shifting one base forward at a time.
ATGCTGA
To illustrate, let’s generate all possible 2-mers (overlapping substrings of length 2) from this sequence:
ATGCTGA -> AT
ATGCTGA -> TG
ATGCTGA -> GC
ATGCTGA -> CT
ATGCTGA -> TG
ATGCTGA -> GA
Now we’ll generate all possible 3-mers (overlapping substrings of length 3):
ATGCTGA -> ATG
ATGCTGA -> TGC
ATGCTGA -> GCT
ATGCTGA -> CTG
ATGCTGA -> TGA
Next, let’s generate all possible 4-mers (overlapping substrings of length 4):
ATGCTGA -> ATGC
ATGCTGA -> TGCT
ATGCTGA -> GCTG
ATGCTGA -> CTGA
K-mer patterns
There are two key patterns to notice here:
1) The longer the k-mer the fewer substrings we can extract from the sequence.
2) The longer the k-mer the more likely it is to be unique.
Short k-mers, like our 2-mers, repeat more often as we see in our example, where the 2-mer ‘TG‘ appears twice.
When we increase the length of our k-mers we find this increases uniqueness (see Table 1).
Our 3-mers are mostly unique, though some share similarities, (for example ‘TGC’ and ‘TGA’ both start with TG).
When we increase the length further and generate all 4-mers, we find each of them is unique.
K-mer length | Number of k-mers | Unique k-mers | Similarities |
---|---|---|---|
2 | 6 | 5 | repetition (TG occurs twice) |
3 | 5 | 5 | partial similarities (e.g, TGC & TGA share sequence) |
4 | 4 | 4 | all unique |
Why do we see these patterns?
Short k-mers are more likely to repeat by chance, as there are fewer possible combinations of the four bases (A,T,C and G).
Longer k-mers have more possible combinations of the 4 bases and as a result tend to be more distinctive.
Understanding how k-mer length affects uniqueness and similarity is the first step to understanding how k-mers are used in Bioinformatics.
In the next post, we’ll look at what makes k-mers so powerful in genomics!