Learning to use R (part 2)

From the first tutorial you should already have a good idea of how to work in R: how to do basic arithmetic, use some functions, how to work with multiple values at the same time, how to read files and how to do some basic plots. What you need now is a bit more practice, more “R vocabulary” (new functions), and a bit more understanding about how R is working internally to follow your wishes.

The object class

Some of you might have already come across problems that had to do with restrictions that R imposes according to the class of object you are working with.

For all of us to be on the same page, let’s load up a table in R and try to do something simple, like calculating the mean() of one of the rows.

Let me introduce a new trick that R can do: read files directly over the internet. If for any reason the following does not work for you, you can always read the file just as you did in the first tutorial. So, we will work with the same file that you already know: tab.txt.

Instead of downloading and saving in your working directory, sometimes it is easier to copy the link and give it directly to the read.table() function. To do this, simply right-click on the link, select “Copy Link Address” or “Copy Link” (depending on your browser) and then paste into your R code (between quotes!). It should look something like this:

tab <- read.table("https://cei.bio.ed.ac.uk/basic_R/data/tab.txt")

Note that the https link in your case can look different. This is perfectly ok, as long as you have obtained it by right-click on the tab.txt link above.

To make sure it worked, you can use some new functions: the first prints out the dimensions of the object (# rows and # columns for a table), the others allow you to see the first part (head) or last part (tail) of the object.

dim(tab)

## [1] 50  3

head(tab)

##       cond1 cond2 cond3
## gene1   252   233   182
## gene2   141   179   216
## gene3   165   195   175
## gene4   174   190   188
## gene5   192   231   194
## gene6   225   142   197

tail(tab)

##        cond1 cond2 cond3
## gene45   219   208   170
## gene46   235   157   199
## gene47   171   266   210
## gene48   315   195   171
## gene49   215   198   229
## gene50   206   210   191

Does everything look ok? Great, let’s continue.

Working as we did in the first tutorial, it’s pretty easy to calculate the mean() of the first column:

mean(tab[,1])

## [1] 204.44

but, the mean() of the first row doesn’t seem to work:

mean(tab[1,])

## Warning in mean.default(tab[1, ]): argument is not numeric or logical: returning NA

## [1] NA

This is because tab is an object of class data.frame. You can check out the class of any object with the class() function.

class(tab)

## [1] "data.frame"

The data.frame is a very flexible class of object. It allows us to have any kind of values in each of the columns. The only limitations is that each column has to be of the same length, which is generally true when we have what we usually call a table.

Another way of thinking about this is that our data are in the columns, and each column stores a different property (different type of data) of the things we are studying (genes, insects, leafs, portions of land, etc). For instance, we might have collected information about many animals of a certain species, and in one column we can have the weight of each animal, in the next column the length, then the sex, the age, the colour. If you imagine an example like this, you will see why it might not be a good idea to allow calculations by row.

To be even more explicit: does it make sense to calculate the average between the weight, the length, the sex, the age and the colour of one animal? On the other hand, calculating the average of the weights of all animals sounds perfectly reasonable!

Converting between classes

What happens when we know that all the values in our table are compatible with operations like the mean()?

In our tab object, we know that all the values are measurements of gene expression, so it makes sense to be able to calculate the average expression across all genes in one condition, or to calculate the average expression of one gene across all conditions.

If we want all values in a table to be equal, what we need is an object of class matrix. In R, we can convert objects between different classes by using a series of as.X() functions. In this case:

mat <- as.matrix(tab)

Another, more generic, way of converting classes is to use the as() function and the new class as the second argument: as(tab, "matrix")

We now have two different objects: tab and mat. They contain the same information, but are stored as objects with a different class. Let’s have a closer look at them.

dim(tab)

## [1] 50  3

dim(mat)

## [1] 50  3

head(tab)

##       cond1 cond2 cond3
## gene1   252   233   182
## gene2   141   179   216
## gene3   165   195   175
## gene4   174   190   188
## gene5   192   231   194
## gene6   225   142   197

head(mat)

##       cond1 cond2 cond3
## gene1   252   233   182
## gene2   141   179   216
## gene3   165   195   175
## gene4   174   190   188
## gene5   192   231   194
## gene6   225   142   197

class(tab)

## [1] "data.frame"

class(mat)

## [1] "matrix" "array"

They look identical, except when we come to the class property. And this is exactly what allows us to now be able to extract information from rows and operate on them as we wanted to initially.

mat[1,]

## cond1 cond2 cond3 
##   252   233   182

mean(mat[1,])

## [1] 222.3333

You might also notice that extracting a row for these two classes of objects gives you a slightly different result:

tab[1,]

##       cond1 cond2 cond3
## gene1   252   233   182

mat[1,]

## cond1 cond2 cond3 
##   252   233   182

class(tab[1,])

## [1] "data.frame"

class(mat[1,])

## [1] "integer"

For a data.frame, extracting a single row still gives you a data.frame that now has one row and the same number of columns (use the dim() function to verify this). Extracting a row of a matrix gives you the actual values as a plain vector (integer values in this case).

Classes of vectors

We have already been using vectors, such as our ages object we created by combining several numbers together. A vector is a kind of object in R that only has one dimension: it allows values to be stored in positions 1, 2, 3… n, where n is the length of the vector. Vectors can be of different classes: we can have integer or numeric (floating-point) numbers, character for text, logical for TRUE/FALSE values. But every value stored at each position of one vector has to be of the same class.

In our previous example, it was quite obvious that the class of the object could interfere with our work: the mean() function simply did not work on a data.frame. But the class can also interfere in less obvious ways.

Let us make a simple vector with a run of numbers:

x <- 1:15
x

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

The from:to operator allows us to quickly produce a range of values between from and to. A more flexible function is seq(). Check out the help for that function if you are curious.

We can always check the class() of the resulting object:

class(x)

## [1] "integer"

And we can convert the object to a different compatible class. For example, we can turn our numbers into text.

y <- as.character(x)
class(y)

## [1] "character"

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"

In this case, it does become obvious that our numbers are no longer numbers, because R puts them between quotes.

You probably already guessed this, but we can no longer perform certain operations on the new y object. Some might even give you an Error.

y + 1

Other functions might give you unexpected results:

max(y)

## [1] "9"

y <- sort(y)
y

##  [1] "1"  "10" "11" "12" "13" "14" "15" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

If at any point we have numbers that are stored as text (character), we can always change them back into numbers.

z <- as.integer(y)
class(z)

## [1] "integer"

##  [1]  1 10 11 12 13 14 15  2  3  4  5  6  7  8  9

sort(z)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Extracting by using names instead of positions

You might have wondered if there is a way of using the labels (names) to extract values from our R objects. Well, there is. Instead of using the numerical position (index) that we want, if the object has names we can use them instead. Let’s go back to our tab object and see some examples.

tab["gene22" , ]

##        cond1 cond2 cond3
## gene22   223   186   191

tab["gene22" , "cond2"]

## [1] 186

tab[c("gene3","gene11") , ]

##        cond1 cond2 cond3
## gene3    165   195   175
## gene11   125   194   215

This can be especially convenient when we have large objects where it is difficult to see which positions we want, but where the labels are meaningful.

For 2-dimensional objects (e.g. data.frame or matrix), you can access (or overwrite) the names by using two new functions:

rownames(tab)

##  [1] "gene1"  "gene2"  "gene3"  "gene4"  "gene5"  "gene6"  "gene7"  "gene8"  "gene9"  "gene10" "gene11" "gene12"
## [13] "gene13" "gene14" "gene15" "gene16" "gene17" "gene18" "gene19" "gene20" "gene21" "gene22" "gene23" "gene24"
## [25] "gene25" "gene26" "gene27" "gene28" "gene29" "gene30" "gene31" "gene32" "gene33" "gene34" "gene35" "gene36"
## [37] "gene37" "gene38" "gene39" "gene40" "gene41" "gene42" "gene43" "gene44" "gene45" "gene46" "gene47" "gene48"
## [49] "gene49" "gene50"

colnames(tab)

## [1] "cond1" "cond2" "cond3"

These are analogous to the names(ages) function we used in the first tutorial.

Extracting (either using positions or names) can also be used when we want to reorder our results.

tab[c("gene11","gene3") , c("cond3","cond1","cond2")]

##        cond3 cond1 cond2
## gene11   215   125   194
## gene3    175   165   195

For some classes of objects, there is a $ operator as a quick way of using labels to extract. For a data.frame you can use this for extracting all the values from a column. But be careful, because it does not work on a matrix.

head(tab$cond2)

## [1] 233 179 195 190 231 142

head(tab$cond3)

## [1] 182 216 175 188 194 197

You might see a lot of code with this syntax, because it is quicker to type.

Try it out on the mat object to see that it doesn’t work.

A brief introduction to lists

Probably the most flexible class in R is the list. It is a kind of vector, but at each position it allows us to store any other kind of object. For example, we can have a single number in position 1, a data.frame in position 2 and another list in position 3!

For now, I only want you to be aware of lists, since some functions we will be using during the course will use or return list-like objects. But do not worry if you feel this is beyond you. It is more important to be comfortable with the more basic classes of objects.

An example of a list:

myList <- list(first=1:5, second=c("Hello","everyone!"), third=head(tab))
class(myList)

## [1] "list"

myList

## $first
## [1] 1 2 3 4 5
## 
## $second
## [1] "Hello"     "everyone!"
## 
## $third
##       cond1 cond2 cond3
## gene1   252   233   182
## gene2   141   179   216
## gene3   165   195   175
## gene4   174   190   188
## gene5   192   231   194
## gene6   225   142   197

When printing out the content of a list that has names (as our example), R is reminding us that we can use the ‘$’ operator to access the content at each position. Shall we try it out?

myList$second

## [1] "Hello"     "everyone!"

A list behaves a bit like a data.frame in some cases.

Technically, a data.frame is a special kind of list with vectors of the same length at each position.

See how the different ways in which we know how to extract values from an object can return slightly different results:

myList["first"]

## $first
## [1] 1 2 3 4 5

myList$first

## [1] 1 2 3 4 5

By now, you might not be surprised by the following:

mean(myList["first"])

## Warning in mean.default(myList["first"]): argument is not numeric or logical: returning NA

## [1] NA

mean(myList$first)

## [1] 3

Perhaps the class can clarify this mystery?

class(myList["first"])

## [1] "list"

class(myList$first)

## [1] "integer"

Analogous to what was happening with a data.frame, using ["first"] is returning a list even if it now only has one element. Since lists can have any sort of values, the mean() function does not work. We need to extract the actual values stored at that position to be able to correctly work with them. That is what we are doing when using the $ operator.

There is one more way of extracting the actual contents at one position in a list. Double brackets ([[]]) can be used with single numerical positions, or single labels, to extract the contents of a list-like object.

myList[[1]]

## [1] 1 2 3 4 5

myList[["first"]]

## [1] 1 2 3 4 5

Final exercise

To recap what we have seen in this tutorial, I will walk you through an exercise and the results you should get, but without showing some of the R code. So in some places you will have to think and try out your own R code if you want to get the same results.

We will work with the file ensembl_info.tab that contains real information about some genomes that have been sequenced.

I originally downloaded all the information in this file from the Ensembl database, which curates a lot of information about fully sequenced genomes. The main site contains mostly vertebrate genomes, together with some other model organisms.

Feel free to load the contents of this file into R by your preferred method. Remember that one way is to right-click on the link, select “Copy Link Address” or “Copy Link” (depending on your browser) and then paste into your R code.

Make sure it has loaded correctly by exploring the contents briefly.

dim(speciesTab)

## [1] 25 10

Do you remember how to obtain all the names of the rows?

##  [1] "Anole_lizard"    "C_elegans"       "Cat"             "Chicken"         "Chimpanzee"      "Cow"            
##  [7] "Dog"             "Dolphin"         "Duck"            "Elephant"        "Fruitfly"        "Fugu"           
## [13] "Gorilla"         "Horse"           "Human"           "Microbat"        "Mouse"           "Panda"          
## [19] "Pig"             "Platypus"        "Rat"             "Stickleback"     "Tasmanian_devil" "Xenopus"        
## [25] "Zebrafish"

You will see that the table contains information on 25 genomes, from the Anole_lizard to the Zebrafish.

head(speciesTab)

##              coding_genes ncrna_genes pseudo_genes coding_gene_avg_length ncrna_gene_avg_length pseudo_gene_avg_length
## Anole_lizard        18595        7168          157               35332.78              10318.72                 878.24
## C_elegans           20191         933         1922                3149.75                338.86                1474.68
## Cat                 19563        2620          494               53520.62                119.25                4199.48
## Chicken             16779        1660          312               30172.87                 98.55               12726.05
## Chimpanzee          23302        9702          485               48874.07               7235.28                 820.99
## Cow                 21861        3747          492               43952.40                114.05                3748.61
##              coding_trans_avg_length ncrna_trans_avg_length pseudo_trans_avg_length genome_length
## Anole_lizard                 2572.26                 548.16                  868.57    1799143587
## C_elegans                    1412.30                 257.29                  899.83     100286401
## Cat                          2190.87                 119.25                  881.91    2455541136
## Chicken                      2420.99                  98.55                 2217.98    1230258557
## Chimpanzee                   2674.48                 334.79                  761.29    3309577922
## Cow                          2285.77                 114.05                  949.55    2670422299

The first three columns contain the total number of genes in each genome. Genes can be protein-coding (coding_genes), non-coding RNA (ncrna_genes) or pseudo-genes (pseudo_genes). The last column contains the genome length. The other columns contain the average length of different types of genes and transcripts (but we won’t be using these Today).

Did you see that the Human genome was in there? Let’s find out more about our genome. Can you extract the row for the Human genome?

##       coding_genes ncrna_genes pseudo_genes coding_gene_avg_length ncrna_gene_avg_length pseudo_gene_avg_length
## Human        22534        7568        17085               62740.53                134.52                 4200.8
##       coding_trans_avg_length ncrna_trans_avg_length pseudo_trans_avg_length genome_length
## Human                 3593.51                 134.52                  815.52    3096649726

Just how big is our genome? It’s sometimes difficult to read a large number.

The length of a genome is given by the number of nucleotide bases (A, C, G or Ts) of which it is composed. We tend to think in multiples of a million (1e6, megabases) or even thousand-millions (1e9, gigabases).

1e9 is the way in which we write scientific notation. It is equivalent to 1 x 10⁹. Another example: 1e-3 is 1 x 10^-3 or 0.001.

speciesTab["Human","genome_length"]/1e6

## [1] 3096.65

speciesTab["Human","genome_length"]/1e9

## [1] 3.09665

We could even print out this in a more informative manner, by pasting the value together with some text. New functions ahead!

print(paste("The Human genome is composed of", speciesTab["Human","genome_length"]/1e9, "gigabases."))

## [1] "The Human genome is composed of 3.096649726 gigabases."

Now, let’s look more carefully at the first three columns of the table:

##                 coding_genes ncrna_genes pseudo_genes
## Anole_lizard           18595        7168          157
## C_elegans              20191         933         1922
## Cat                    19563        2620          494
## Chicken                16779        1660          312
## Chimpanzee             23302        9702          485
## Cow                    21861        3747          492
## Dog                    20199        2972          613
## Dolphin                16550        3790          925
## Duck                   16765         477          455
## Elephant               20033        2644          568
## Fruitfly               13947         446          330
## Fugu                   21383        1008          816
## Gorilla                21587        7760          522
## Horse                  20912        1756          402
## Human                  22534        7568        17085
## Microbat               19728        4408         1713
## Mouse                  22198       11838        13946
## Panda                  19343        3141          778
## Pig                    21280        2173         1626
## Platypus               17417        7188          207
## Rat                    22245        8736         1668
## Stickleback            20787        1617           52
## Tasmanian_devil        18788        1490          178
## Xenopus                19983        1364           81
## Zebrafish              30140        4814          325

In fact, let’s save those three columns as a new object called genes.

Perhaps we could try plotting this information, to be able to visually compare the number of genes of different types between the different genomes. Perhaps a barplot?

barplot(genes)

Error in barplot.default(genes) : ‘height’ must be a vector or a matrix

Oh yes, what was the class?

## [1] "data.frame"

So, let’s convert it to a matrix, and check the class again.

## [1] "matrix" "array"

This should work now…

barplot(genes)

Almost there. Right now the bars are stacked, but adding up the number of coding genes into one big bar is not very useful. The function barplot() gives us some options, which you can find in the help page. Let’s try out a few:

barplot(genes, beside=TRUE)

Still, barplot() is grouping our values by column (type of gene), while it might be more informative in this case to group by row (species).

The matrix transpose function t() helps us rotate a matrix, so that it’s rows will become columns and vice-versa.

We can use it on the fly within another function, or save the transposed object with a new name; however you prefer.

barplot(t(genes), beside=TRUE, las=2)

The las parameter is for label-style. It allows you to place the axes labels in different ways: parallel or perpendicular to the axes, or always horizontal/vertical. It takes values between 0-3 for these four options. There are many other graphical parameters that you can change, for which you can find help in ?par.

Check out the ?barplot help page if you want to play around with different kinds of barplot.

Independent learning:

The following plot is just another way in which you can show the same information. I have simply added more arguments to the previous barplot command. Spend some time trying to understand the information available in the ?barplot help page. You can also use Google or ask an AI for help, but as always, make sure you understand the code that you are using.