What is a k-mer?

A k-mer is a substring of a biological sequence with length K, where K can be any number.

If you are familiar with sliding windows, it’s a similar principal! Where k is the window size and the step is +1.

An example:

Take a look at the short DNA sequence below, imagine running across the DNA sequence for a given number of steps, extracting the substring at each step and shifting one base forward at a time.

ATGCTGA

To illustrate, let’s generate all possible 2-mers (overlapping substrings of length 2) from this sequence:

ATGCTGA  ->  AT
ATGCTGA  ->  TG
ATGCTGA  ->  GC
ATGCTGA  ->  CT
ATGCTGA  ->  TG
ATGCTGA  ->  GA

Now we’ll generate all possible 3-mers (overlapping substrings of length 3):

ATGCTGA  ->  ATG
ATGCTGA  ->  TGC
ATGCTGA  ->  GCT
ATGCTGA  ->  CTG
ATGCTGA  ->  TGA

Next, let’s generate all possible 4-mers (overlapping substrings of length 4):

ATGCTGA  ->  ATGC
ATGCTGA  ->  TGCT
ATGCTGA  ->  GCTG
ATGCTGA  ->  CTGA

K-mer patterns

There are two key patterns to notice here:

1) The longer the k-mer the fewer substrings we can extract from the sequence.

2) The longer the k-mer the more likely it is to be unique.

Short k-mers, like our 2-mers, repeat more often as we see in our example, where the 2-merTG‘ appears twice.

When we increase the length of our k-mers we find this increases uniqueness (see Table 1).

Our 3-mers are mostly unique, though some share similarities, (for example ‘TGC’ and ‘TGA’ both start with TG).

When we increase the length further and generate all 4-mers, we find each of them is unique.

K-mer lengthNumber of k-mersUnique k-mersSimilarities
265repetition (TG occurs twice)
355partial similarities (e.g, TGC & TGA share sequence)
444 all unique
Table 1. Summary of k-mer patterns

Why do we see these patterns?

Short k-mers are more likely to repeat by chance, as there are fewer possible combinations of the four bases (A,T,C and G).

Longer k-mers have more possible combinations of the 4 bases and as a result tend to be more distinctive.

Understanding how k-mer length affects uniqueness and similarity is the first step to understanding the power of using k-mers in genomics how k-mers are used in Bioinformatics