From the first practical you should already have a good idea of how to work in R: how to do basic arithmetic, use some functions, how to work with multiple values at the same time, how to read files and how to do some basic plots. What you need now is a bit more practice, more “R vocabulary” (new functions), and a bit more understanding about how R is working internally to follow your wishes.
Some of you might have already come across problems that had to do with restrictions that R imposes according to the class of object you are working with.
For all of us to be on the same page, let’s load up a table in R and try to do something simple, like calculating the mean() of one of the rows.
Let me introduce a new trick that R can do: read files directly over the internet. If for any reason the following does not work for you, you can always read the file just as you did in the first practical. So, we will work with the same file that you already know: tab.txt.
Instead of downloading and saving in your working directory, sometimes it is easier to copy the link and give it directly to the read.table() function. To do this, simply right-click on the link, select Copy Link Address and then paste into your R code (between quotes!). It should look something like this:
tab <- read.table("https://www.learn.ed.ac.uk/bbcswebdav/courses/PGBI110032021-2SV1SEM1/data/tab.txt")To make sure it worked, you can use some new functions: the first prints out the dimensions of the object (# rows and # columns for a table), the others allow you to see the first part (head) or last part (tail) of the object.
dim(tab)## [1] 50 3
head(tab)## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
tail(tab)## cond1 cond2 cond3
## gene45 219 208 170
## gene46 235 157 199
## gene47 171 266 210
## gene48 315 195 171
## gene49 215 198 229
## gene50 206 210 191
Does everything look ok? Great, let’s continue.
Working as we did in the first practical, it’s pretty easy to calculate the mean() of the first column:
mean(tab[,1])## [1] 204.44
but, the mean() of the first row doesn’t seem to work:
mean(tab[1,])## Warning in mean.default(tab[1, ]): argument is not numeric or logical: returning NA
## [1] NA
This is because tab is an object of class data.frame. You can check out the class of any object with the class() function.
class(tab)## [1] "data.frame"
The data.frame is a very flexible class of object. It allows us to have any kind of values in each of the columns. The only limitations is that each column has to be of the same length, which is generally true when we have what we usually call a table.
Another way of thinking about this is that our data are in the columns, and each column stores a different property (different type of data) of the things we are studying (genes, insects, leafs, portions of land, etc). For instance, we might have collected information about many animals of a certain species, and in one column we can have the weight of each animal, in the next column the length, then the sex, the age, the colour. If you imagine an example like this, you will see why it might not be a good idea to allow calculations by row.
To be even more explicit: does it make sense to calculate the average between the weight, the length, the sex, the age and the colour of one animal? On the other hand, calculating the average of the weights of all animals sounds perfectly reasonable!
What happens when we know that all the values in our table are compatible with operations like the mean()?
In our tab object, we know that all the values are measurements of gene expression, so it makes sense to be able to calculate the average expression across all genes in one condition, or to calculate the average expression of one gene across all conditions.
If we want all values in a table to be equal, what we need is an object of class matrix. In R, we can convert objects between different classes by using a series of as.X() functions. In this case:
mat <- as.matrix(tab)Another, more generic, way of converting classes is to use the
as()function and the new class as the second argument:as(tab, "matrix")
We now have two different objects: tab and mat. They contain the same information, but are stored as objects with a different class. Let’s have a closer look at them.
dim(tab)## [1] 50 3
dim(mat)## [1] 50 3
head(tab)## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
head(mat)## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
class(tab)## [1] "data.frame"
class(mat)## [1] "matrix" "array"
They look identical, except when we come to the class property. And this is exactly what allows us to now be able to extract information from rows and operate on them as we wanted to initially.
mat[1,]## cond1 cond2 cond3
## 252 233 182
mean(mat[1,])## [1] 222.3333
You might also notice that extracting a row for these two classes of objects gives you a slightly different result:
tab[1,]## cond1 cond2 cond3
## gene1 252 233 182
mat[1,]## cond1 cond2 cond3
## 252 233 182
class(tab[1,])## [1] "data.frame"
class(mat[1,])## [1] "integer"
For a data.frame, extracting a single row still gives you a data.frame that now has one row and the same number of columns (use the dim() function to verify this). Extracting a row of a matrix gives you the actual values as a plain vector (integer values in this case).
We have already been using vectors, such as our ages object we created by combining several numbers together. A vector is a kind of object in R that only has one dimension: it allows values to be stored in positions 1, 2, 3… n, where n is the length of the vector. Vectors can be of different classes: we can have integer or numeric (floating-point) numbers, character for text, logical for TRUE/FALSE values. But every value stored at each position of one vector has to be of the same class.
In our previous example, it was quite obvious that the class of the object could interfere with our work: the mean() function simply did not work on a data.frame. But the class can also interfere in less obvious ways.
Let us make a simple vector with a run of numbers:
x <- 1:15
x## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The
from:tooperator allows us to quickly produce a range of values betweenfromandto. A more flexible function isseq(). Check out the help for that function if you are curious.
We can always check the class() of the resulting object:
class(x)## [1] "integer"
And we can convert the object to a different compatible class. For example, we can turn our numbers into text.
y <- as.character(x)
class(y)## [1] "character"
y## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
In this case, it does become obvious that our numbers are no longer numbers, because R puts them between quotes.
You probably already guessed this, but we can no longer perform certain operations on the new y object. Some might even give you an Error.
y + 1Other functions might give you unexpected results:
max(y)## [1] "9"
y <- sort(y)
y## [1] "1" "10" "11" "12" "13" "14" "15" "2" "3" "4" "5" "6" "7" "8" "9"
If at any point we have numbers that are stored as text (character), we can always change them back into numbers.
z <- as.integer(y)
class(z)## [1] "integer"
z## [1] 1 10 11 12 13 14 15 2 3 4 5 6 7 8 9
sort(z)## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
You might have wondered if there is a way of using the labels (names) to extract values from our R objects. Well, there is. Instead of using the numerical position (index) that we want, if the object has names we can use them instead. Let’s go back to our tab object and see some examples.
tab["gene22" , ]## cond1 cond2 cond3
## gene22 223 186 191
tab["gene22" , "cond2"]## [1] 186
tab[c("gene3","gene11") , ]## cond1 cond2 cond3
## gene3 165 195 175
## gene11 125 194 215
This can be especially convenient when we have large objects where it is difficult to see which positions we want, but where the labels are meaningful.
For 2-dimensional objects (e.g. data.frame or matrix), you can access (or overwrite) the names by using two new functions:
rownames(tab)## [1] "gene1" "gene2" "gene3" "gene4" "gene5" "gene6" "gene7" "gene8" "gene9" "gene10" "gene11" "gene12"
## [13] "gene13" "gene14" "gene15" "gene16" "gene17" "gene18" "gene19" "gene20" "gene21" "gene22" "gene23" "gene24"
## [25] "gene25" "gene26" "gene27" "gene28" "gene29" "gene30" "gene31" "gene32" "gene33" "gene34" "gene35" "gene36"
## [37] "gene37" "gene38" "gene39" "gene40" "gene41" "gene42" "gene43" "gene44" "gene45" "gene46" "gene47" "gene48"
## [49] "gene49" "gene50"
colnames(tab)## [1] "cond1" "cond2" "cond3"
These are analogous to the names(ages) function we used in the first practical.
Extracting (either using positions or names) can also be used when we want to reorder our results.
tab[c("gene11","gene3") , c("cond3","cond1","cond2")]## cond3 cond1 cond2
## gene11 215 125 194
## gene3 175 165 195
For some classes of objects, there is a $ operator as a quick way of using labels to extract. For a data.frame you can use this for extracting all the values from a column. But be careful, because it does not work on a matrix.
head(tab$cond2)## [1] 233 179 195 190 231 142
head(tab$cond3)## [1] 182 216 175 188 194 197
You might see a lot of code with this syntax, because it is quicker to type.
Try it out on the mat object to see that it doesn’t work.
Probably the most flexible class in R is the list. It is a kind of vector, but at each position it allows us to store any other kind of object. For example, we can have a single number in position 1, a data.frame in position 2 and another list in position 3!
For now, I only want you to be aware of lists, since some functions we will be using during the course will use or return list-like objects. But do not worry if you feel this is beyond you. It is more important to be comfortable with the more basic classes of objects.
An example of a list:
myList <- list(first=1:5, second=c("Hello","everyone!"), third=head(tab))
class(myList)## [1] "list"
myList## $first
## [1] 1 2 3 4 5
##
## $second
## [1] "Hello" "everyone!"
##
## $third
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
When printing out the content of a list that has names (as our example), R is reminding us that we can use the ‘$’ operator to access the content at each position. Shall we try it out?
myList$second## [1] "Hello" "everyone!"
A list behaves a bit like a data.frame in some cases.
Technically, a
data.frameis a special kind oflistwith vectors of the same length at each position.
See how the different ways in which we know how to extract values from an object can return slightly different results:
myList["first"]## $first
## [1] 1 2 3 4 5
myList$first## [1] 1 2 3 4 5
By now, you might not be surprised by the following:
mean(myList["first"])## Warning in mean.default(myList["first"]): argument is not numeric or logical: returning NA
## [1] NA
mean(myList$first)## [1] 3
Perhaps the class can clarify this mystery?
class(myList["first"])## [1] "list"
class(myList$first)## [1] "integer"
Analogous to what was happening with a data.frame, using ["first"] is returning a list even if it now only has one element. Since lists can have any sort of values, the mean() function does not work. We need to extract the actual values stored at that position to be able to correctly work with them. That is what we are doing when using the $ operator.
There is one more way of extracting the actual contents at one position in a list. Double brackets ([[]]) can be used with single numerical positions, or single labels, to extract the contents of a list-like object.
myList[[1]]## [1] 1 2 3 4 5
myList[["first"]]## [1] 1 2 3 4 5
To recap what we have seen in this practical, I will walk you through an exercise and the results you should get, but without showing some of the R code. So in some places you will have to think and try out your own R code if you want to get the same results.
We will work with the file ensembl_info.tab that contains real information about some genomes that have been sequenced.
I downloaded all the information in this file from the Ensembl database, which curates a lot of information about fully sequenced genomes. The main site contains mostly vertebrate genomes, together with some other model organisms.
Feel free to load the contents of this file into R by your preferred method. Make sure it has loaded correctly by exploring the contents briefly.
dim(ensTab)## [1] 25 10
Do you remember how to obtain all the names of the rows?
## [1] "Anole_lizard" "C_elegans" "Cat" "Chicken" "Chimpanzee" "Cow"
## [7] "Dog" "Dolphin" "Duck" "Elephant" "Fruitfly" "Fugu"
## [13] "Gorilla" "Horse" "Human" "Microbat" "Mouse" "Panda"
## [19] "Pig" "Platypus" "Rat" "Stickleback" "Tasmanian_devil" "Xenopus"
## [25] "Zebrafish"
You will see that the table contains information on 25 genomes, from the Anole_lizard to the Zebrafish.
head(ensTab)## coding_genes ncrna_genes pseudo_genes coding_gene_avg_length ncrna_gene_avg_length pseudo_gene_avg_length
## Anole_lizard 18595 7168 157 35332.78 10318.72 878.24
## C_elegans 20191 933 1922 3149.75 338.86 1474.68
## Cat 19563 2620 494 53520.62 119.25 4199.48
## Chicken 16779 1660 312 30172.87 98.55 12726.05
## Chimpanzee 23302 9702 485 48874.07 7235.28 820.99
## Cow 21861 3747 492 43952.40 114.05 3748.61
## coding_trans_avg_length ncrna_trans_avg_length pseudo_trans_avg_length genome_length
## Anole_lizard 2572.26 548.16 868.57 1799143587
## C_elegans 1412.30 257.29 899.83 100286401
## Cat 2190.87 119.25 881.91 2455541136
## Chicken 2420.99 98.55 2217.98 1230258557
## Chimpanzee 2674.48 334.79 761.29 3309577922
## Cow 2285.77 114.05 949.55 2670422299
The first three columns contain the total number of genes in each genome. Genes can be protein-coding (coding_genes), non-coding RNA (ncrna_genes) or pseudo-genes (pseudo_genes). The last column contains the genome length. The other columns contain the average length of different types of genes and transcripts (but we won’t be using these Today).
Did you see that the Human genome was in there? Let’s find out more about our genome. Can you extract the row for the Human genome?
## coding_genes ncrna_genes pseudo_genes coding_gene_avg_length ncrna_gene_avg_length pseudo_gene_avg_length
## Human 22534 7568 17085 62740.53 134.52 4200.8
## coding_trans_avg_length ncrna_trans_avg_length pseudo_trans_avg_length genome_length
## Human 3593.51 134.52 815.52 3096649726
Just how big is our genome? It’s sometimes difficult to read a large number.
The length of a genome is given by the number of nucleotide bases (A, C, G or Ts) of which it is composed. We tend to think in multiples of a million (1e6, megabases) or even thousand-millions (1e9, gigabases).
1e9 is the way in which we write scientific notation. It is equivalent to 1 x 109. Another example: 1e-3 is 1 x 10-3 or 0.001.
ensTab["Human","genome_length"]/1e6## [1] 3096.65
ensTab["Human","genome_length"]/1e9## [1] 3.09665
We could even print out this in a more informative manner, by pasting the value together with some text. New functions ahead!
print(paste("The Human genome is composed of", ensTab["Human","genome_length"]/1e9, "gigabases."))## [1] "The Human genome is composed of 3.096649726 gigabases."
Now, let’s look more carefully at the first three columns of the table:
## coding_genes ncrna_genes pseudo_genes
## Anole_lizard 18595 7168 157
## C_elegans 20191 933 1922
## Cat 19563 2620 494
## Chicken 16779 1660 312
## Chimpanzee 23302 9702 485
## Cow 21861 3747 492
## Dog 20199 2972 613
## Dolphin 16550 3790 925
## Duck 16765 477 455
## Elephant 20033 2644 568
## Fruitfly 13947 446 330
## Fugu 21383 1008 816
## Gorilla 21587 7760 522
## Horse 20912 1756 402
## Human 22534 7568 17085
## Microbat 19728 4408 1713
## Mouse 22198 11838 13946
## Panda 19343 3141 778
## Pig 21280 2173 1626
## Platypus 17417 7188 207
## Rat 22245 8736 1668
## Stickleback 20787 1617 52
## Tasmanian_devil 18788 1490 178
## Xenopus 19983 1364 81
## Zebrafish 30140 4814 325
In fact, let’s save those three columns as a new object called genes.
Perhaps we could try plotting this information, to be able to visually compare the number of genes of different types between the different genomes. Perhaps a barplot?
barplot(genes)Error in barplot.default(genes) : ‘height’ must be a vector or a matrix
Oh yes, what was the class?
## [1] "data.frame"
So, let’s convert it to a matrix, and check the class again.
## [1] "matrix" "array"
This should work now…
barplot(genes)Almost there. Right now the bars are stacked, but adding up the number of coding genes into one big bar is not very useful. The function barplot() gives us some options, which you can find in the help page. Let’s try out a few:
barplot(genes, beside=TRUE)Still, barplot() is grouping our values by column (type of gene), while it might be more informative in this case to group by row (species).
The matrix transpose function t() helps us rotate a matrix, so that it’s rows will become columns and vice-versa.
We can use it on the fly within another function, or save the transposed object with a new name; however you prefer.
barplot(t(genes), beside=TRUE, las=2)The
lasparamter is for label-style. It allows you to place the axes labels in different ways: parallel or perpendicular to the axes, or always horizontal/vertical. It takes values between 0-3 for these four options. There are many other graphical parameters that you can change, for which you can find help in?par.
Check out the ?barplot help page if you want to play around with different kinds of barplot.
The following is just another way in which you can plot the same information.