How to read data into R - Sequence Gazing

Before we can use R for data analysis we first need to read in our data.

R can read in many different types of file formats.

In this post I will cover reading data from a CSV file as this format is generally the easiest to work with and one of the most commonly used file formats in data analysis.

CSV files

This format is my preferred go to for any data I’m working with that isn’t DNA sequence data. CSV files contain data that are separated by commas, see the example data below:

Label,X,Y
A,1,1.046
A,2,0.77
A,3,0.533
A,4,0.358
A,5,0.311
A,6,0.182

This example data is available in the file Example.csv which is located under Example_data here.

Example.csv contains various scores (in the Y column) for each codon (in the X column) for each sequence (in the Label column).

The data is set out in this way deliberately for ggplot2.

It is important to recognise that:

1.Both the data labels and the data are separated by commas.

2.The first line represents the column headers.

So how do we read this data into R?

R has different data reading functions for different file formats.

To read the data from Example.csv into R type the following into RStudio:

my_data<-read.csv("Example.csv")

This will read in the csv file and assign your data to the variable my_data.

Run the code and remember to set the working directory to the folder containing the Example.csv file.

Your Rstudio window should look like this:

The read.csv() function takes additional arguments of “header” and “fill”. By default these are set to TRUE.

What are the header and fill arguments?

The header argument specifies if there are column labels and the fill argument specifies whether or not to add blank fields when there is missing data (R can tell this by the different size of the data columns).

# reads data and adds blank fields if there is missing data
my_data<-read.csv("Example.csv", fill = TRUE)

# reads data and does not add blank fields if there is missing data
my_data<-read.csv("Example.csv", fill = FALSE)

# reads in data with a header
my_data<-read.csv("Example.csv", header = TRUE)

# reads in data without a header
my_data<-read.csv("Example.csv", header = FALSE)

What if our data has no headers?

Sometimes we have data that does not have a header, in the next example we will look at Example2.csv located here under Example_data.

In Example2.csv we have the same data as Example.csv, but this time the columns are not labelled, i.e do not have headers and there are also some missing values.

In this case we want to read in the data and tell R there are no headers. As “fill = TRUE” is set by default we do not need to specify this in the command.

Type the following into your Rscript to read in the data from Example2.csv and assign the data to the variable my_data2:

my_data2<-read.csv("Example2.csv", header = FALSE)

Looking at my_data2 we see that there are no headers, instead R has labelled the columns V1, V2 and V3.

We can also see that the missing values from our CSV file have been assigned the value of “NA”.

Next we will look at the similar read.table() function and how to read fasta sequences into R.