What is a k-mer?

A k-mer is a substring of a biological sequence with length K, where K can be any number.

If you are familiar with sliding windows, it’s a similar principal! Where k is the window size and the step is +1.

An example:

Take a look at the short DNA sequence below, imagine running across the DNA sequence for a given number of steps, extracting the substring at each step and shifting one base forward at a time.

    ATGCTGA

    To illustrate, let’s generate all possible 2-mers (overlapping substrings of length 2) from this sequence:

    ATGCTGA  ->  AT
    ATGCTGA  ->  TG
    ATGCTGA  ->  GC
    ATGCTGA  ->  CT
    ATGCTGA  ->  TG
    ATGCTGA  ->  GA

    Now we’ll generate all possible 3-mers (overlapping substrings of length 3):

    ATGCTGA  ->  ATG
    ATGCTGA  ->  TGC
    ATGCTGA  ->  GCT
    ATGCTGA  ->  CTG
    ATGCTGA  ->  TGA

    Next, let’s generate all possible 4-mers (overlapping substrings of length 4):

    ATGCTGA  ->  ATGC
    ATGCTGA  ->  TGCT
    ATGCTGA  ->  GCTG
    ATGCTGA  ->  CTGA

    K-mer patterns

    There are two key patterns to notice here:

    1) The longer the k-mer the fewer substrings we can extract from the sequence.

    2) The longer the k-mer the more likely it is to be unique.

    Short k-mers, like our 2-mers, repeat more often as we see in our example, where the 2-merTG‘ appears twice.

    When we increase the length of our k-mers we find this increases uniqueness (see Table 1).

    Our 3-mers are mostly unique, though some share similarities, (for example ‘TGC’ and ‘TGA’ both start with TG).

    When we increase the length further and generate all 4-mers, we find each of them is unique.

    K-mer lengthNumber of k-mersUnique k-mersSimilarities
    265repetition (TG occurs twice)
    355partial similarities (e.g, TGC & TGA share sequence)
    444 all unique
    Table 1. Summary of k-mer patterns

    Why do we see these patterns?

    Short k-mers are more likely to repeat by chance, as there are fewer possible combinations of the four bases (A,T,C and G).

    Longer k-mers have more possible combinations of the 4 bases and as a result tend to be more distinctive.

    Understanding how k-mer length affects uniqueness and similarity is the first step to understanding how k-mers are used in Bioinformatics.

    In the next post, we’ll look at what makes k-mers so powerful in genomics!

    Leave a Reply

    Your email address will not be published. Required fields are marked *