grep, cut and piping on the command line

In an earlier post we were introduced to the Linux command line and command line tools now we will explore two essential command line tools grep and cut that can be used to wrangle with text on the command-line.

Using grep and cut

Global Regular Expression Print or “grep

grep is a command-line tool for searching text patterns within files. The user can specify a search pattern using regular expressions and grep will print the lines containing the matched pattern. It’s an incredibly useful tool, particularly when you need to quickly search through large files.

Basic usage of grep:

grep pattern filename

Example:

grep "findme" example.txt

This command will print out all lines containing our pattern (the word “findme“) in the file example.txt.

cut" command in Linux

cut is a command-line tool that enables the user to cut out sections from each line of a file. It’s extremely useful when dealing with structured data eg. CSV files that have a delimiter where values are separated by commas.

Basic usage of cut:

cut -d delimiter -f fields filename

Example:

cut -d "," -f 1,3 example.csv

This command extracts the first and third fields from a CSV file, where the fields are delimited by commas (-d “,”).

Combining grep and cut

A really nice aspect of command line tools is that they can be piped together, meaning you can come up with your own piped recipe for your particular task.

Using these tools as building blocks we can create many different combinations, making command line tools so powerful.

That’s great and all, but what about a real life example?


Extracting sequence ID’s from an Escherichia coli genome

Imagine you have the annotated genome from an E.coli strain and you want to find all the 16S RNA sequences. Note if you want to try out this example yourself, you can download the example from the sequence-gazing repo here.

We don’t want to manually go through the multi-fasta file (see What is a FASTA file? for a recap on fasta files) and find them, instead we can use grep.

On it’s own this grep command will find all entries that contain the word ’16S’ in the header.

grep "16S" protein.faa

This should give you a list of the following fasta ID’s:

>AAC73162.1
>AAC73193.1
>AAC74905.2
>AAC75244.1
>AAC75983.2
>AAC76180.1
>AAC76314.1
>AAC76490.1
>AAC76522.1
>AAC76763.1
>AAC77324.1

But what if we want to save the id’s of these sequences to use later?

Now we can use the command line tool cut in combination with grep using the pipe operator |and the output redirection operator > to save the result to a text file.

grep "16S" protein.faa | cut -d " " -f 1 > 16S_ids.txt