From the first tutorial you should already have a good idea of how to work in R: how to do basic arithmetic, use some functions, how to work with multiple values at the same time, how to read files and how to do some basic plots. What you need now is a bit more practice, more “R vocabulary” (new functions), and a bit more understanding about how R is working internally to follow your wishes.
Some of you might have already come across problems that had to do with restrictions that R imposes according to the class of object you are working with.
For all of us to be on the same page, let’s load up a table in R and
try to do something simple, like calculating the mean() of
one of the rows.
Let me introduce a new trick that R can do: read files directly over the internet. If for any reason the following does not work for you, you can always read the file just as you did in the first tutorial. So, we will work with the same file that you already know: tab.txt.
Instead of downloading and saving in your working directory,
sometimes it is easier to copy the link and give it directly to the
read.table() function. To do this, simply right-click on
the link, select “Copy Link Address” or “Copy Link” (depending on your
browser) and then paste into your R code (between
quotes!). It should look something like this:
Note that the https link in your case can look different. This is perfectly ok, as long as you have obtained it by right-click on the tab.txt link above.
To make sure it worked, you can use some new functions: the first prints out the dimensions of the object (# rows and # columns for a table), the others allow you to see the first part (head) or last part (tail) of the object.
## [1] 50 3
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
## cond1 cond2 cond3
## gene45 219 208 170
## gene46 235 157 199
## gene47 171 266 210
## gene48 315 195 171
## gene49 215 198 229
## gene50 206 210 191
Does everything look ok? Great, let’s continue.
Working as we did in the first tutorial, it’s pretty easy to
calculate the mean() of the first column:
## [1] 204.44
but, the mean() of the first row doesn’t seem to
work:
## Warning in mean.default(tab[1, ]): argument is not numeric or logical: returning NA
## [1] NA
This is because tab is an object of class
data.frame. You can check out the class of any object with
the class() function.
## [1] "data.frame"
The data.frame is a very flexible class of object. It
allows us to have any kind of values in each of the columns. The only
limitations is that each column has to be of the same length, which is
generally true when we have what we usually call a table.
Another way of thinking about this is that our data are in the columns, and each column stores a different property (different type of data) of the things we are studying (genes, insects, leafs, portions of land, etc). For instance, we might have collected information about many animals of a certain species, and in one column we can have the weight of each animal, in the next column the length, then the sex, the age, the colour. If you imagine an example like this, you will see why it might not be a good idea to allow calculations by row.
To be even more explicit: does it make sense to calculate the average between the weight, the length, the sex, the age and the colour of one animal? On the other hand, calculating the average of the weights of all animals sounds perfectly reasonable!
What happens when we know that all the values in our table are
compatible with operations like the mean()?
In our tab object, we know that all the values are
measurements of gene expression, so it makes sense to be able to
calculate the average expression across all genes in one condition, or
to calculate the average expression of one gene across all
conditions.
If we want all values in a table to be equal, what we need is an
object of class matrix. In R, we can convert objects
between different classes by using a series of as.X()
functions. In this case:
Another, more generic, way of converting classes is to use the
as()function and the new class as the second argument:as(tab, "matrix")
We now have two different objects: tab and
mat. They contain the same information, but are stored as
objects with a different class. Let’s have a closer look at them.
## [1] 50 3
## [1] 50 3
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
## [1] "data.frame"
## [1] "matrix" "array"
They look identical, except when we come to the class
property. And this is exactly what allows us to now be able to extract
information from rows and operate on them as we wanted to initially.
## cond1 cond2 cond3
## 252 233 182
## [1] 222.3333
You might also notice that extracting a row for these two classes of objects gives you a slightly different result:
## cond1 cond2 cond3
## gene1 252 233 182
## cond1 cond2 cond3
## 252 233 182
## [1] "data.frame"
## [1] "integer"
For a data.frame, extracting a single row still gives
you a data.frame that now has one row and the same number
of columns (use the dim() function to verify this).
Extracting a row of a matrix gives you the actual values as
a plain vector (integer values in this case).
We have already been using vectors, such as our ages
object we created by combining several numbers together. A vector is a
kind of object in R that only has one dimension: it allows values to be
stored in positions 1, 2, 3… n, where n is the length
of the vector. Vectors can be of different classes: we can have
integer or numeric (floating-point) numbers,
character for text, logical for TRUE/FALSE values. But
every value stored at each position of one vector has to be of the same
class.
In our previous example, it was quite obvious that the
class of the object could interfere with our work: the
mean() function simply did not work on a
data.frame. But the class can also interfere
in less obvious ways.
Let us make a simple vector with a run of numbers:
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The
from:tooperator allows us to quickly produce a range of values betweenfromandto. A more flexible function isseq(). Check out the help for that function if you are curious.
We can always check the class() of the resulting
object:
## [1] "integer"
And we can convert the object to a different compatible
class. For example, we can turn our numbers into text.
## [1] "character"
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
In this case, it does become obvious that our numbers are no longer numbers, because R puts them between quotes.
You probably already guessed this, but we can no longer perform
certain operations on the new y object. Some might even
give you an Error.
Other functions might give you unexpected results:
## [1] "9"
## [1] "1" "10" "11" "12" "13" "14" "15" "2" "3" "4" "5" "6" "7" "8" "9"
If at any point we have numbers that are stored as text
(character), we can always change them back into
numbers.
## [1] "integer"
## [1] 1 10 11 12 13 14 15 2 3 4 5 6 7 8 9
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
You might have wondered if there is a way of using the labels
(names) to extract values from our R objects. Well, there
is. Instead of using the numerical position (index) that we want, if the
object has names we can use them instead. Let’s go back to our
tab object and see some examples.
## cond1 cond2 cond3
## gene22 223 186 191
## [1] 186
## cond1 cond2 cond3
## gene3 165 195 175
## gene11 125 194 215
This can be especially convenient when we have large objects where it is difficult to see which positions we want, but where the labels are meaningful.
For 2-dimensional objects (e.g. data.frame or
matrix), you can access (or overwrite) the names by using
two new functions:
## [1] "gene1" "gene2" "gene3" "gene4" "gene5" "gene6" "gene7" "gene8" "gene9" "gene10" "gene11" "gene12"
## [13] "gene13" "gene14" "gene15" "gene16" "gene17" "gene18" "gene19" "gene20" "gene21" "gene22" "gene23" "gene24"
## [25] "gene25" "gene26" "gene27" "gene28" "gene29" "gene30" "gene31" "gene32" "gene33" "gene34" "gene35" "gene36"
## [37] "gene37" "gene38" "gene39" "gene40" "gene41" "gene42" "gene43" "gene44" "gene45" "gene46" "gene47" "gene48"
## [49] "gene49" "gene50"
## [1] "cond1" "cond2" "cond3"
These are analogous to the names(ages) function we used
in the first tutorial.
Extracting (either using positions or names) can also be used when we want to reorder our results.
## cond3 cond1 cond2
## gene11 215 125 194
## gene3 175 165 195
For some classes of objects, there is a $ operator as a
quick way of using labels to extract. For a data.frame you
can use this for extracting all the values from a column. But be
careful, because it does not work on a matrix.
## [1] 233 179 195 190 231 142
## [1] 182 216 175 188 194 197
You might see a lot of code with this syntax, because it is quicker to type.
Try it out on the mat object to see that it doesn’t
work.
Probably the most flexible class in R is the list. It is
a kind of vector, but at each position it allows us to store any other
kind of object. For example, we can have a single number in position 1,
a data.frame in position 2 and another list in
position 3!
For now, I only want you to be aware of lists, since some functions we will be using during the course will use or return list-like objects. But do not worry if you feel this is beyond you. It is more important to be comfortable with the more basic classes of objects.
An example of a list:
## [1] "list"
## $first
## [1] 1 2 3 4 5
##
## $second
## [1] "Hello" "everyone!"
##
## $third
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
When printing out the content of a list that has names (as our example), R is reminding us that we can use the ‘$’ operator to access the content at each position. Shall we try it out?
## [1] "Hello" "everyone!"
A list behaves a bit like a data.frame in
some cases.
Technically, a
data.frameis a special kind oflistwith vectors of the same length at each position.
See how the different ways in which we know how to extract values from an object can return slightly different results:
## $first
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
By now, you might not be surprised by the following:
## Warning in mean.default(myList["first"]): argument is not numeric or logical: returning NA
## [1] NA
## [1] 3
Perhaps the class can clarify this mystery?
## [1] "list"
## [1] "integer"
Analogous to what was happening with a data.frame, using
["first"] is returning a list even if it now
only has one element. Since lists can have any sort of values, the
mean() function does not work. We need to extract the
actual values stored at that position to be able to correctly work with
them. That is what we are doing when using the $
operator.
There is one more way of extracting the actual contents at one
position in a list. Double brackets ([[]]) can be used with
single numerical positions, or single labels, to extract the contents of
a list-like object.
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
To recap what we have seen in this tutorial, I will walk you through an exercise and the results you should get, but without showing some of the R code. So in some places you will have to think and try out your own R code if you want to get the same results.
We will work with the file ensembl_info.tab that contains real information about some genomes that have been sequenced.
I originally downloaded all the information in this file from the Ensembl database, which curates a lot of information about fully sequenced genomes. The main site contains mostly vertebrate genomes, together with some other model organisms.
Feel free to load the contents of this file into R by your preferred method. Remember that one way is to right-click on the link, select “Copy Link Address” or “Copy Link” (depending on your browser) and then paste into your R code.
Make sure it has loaded correctly by exploring the contents briefly.
## [1] 25 10
Do you remember how to obtain all the names of the rows?
## [1] "Anole_lizard" "C_elegans" "Cat" "Chicken" "Chimpanzee" "Cow"
## [7] "Dog" "Dolphin" "Duck" "Elephant" "Fruitfly" "Fugu"
## [13] "Gorilla" "Horse" "Human" "Microbat" "Mouse" "Panda"
## [19] "Pig" "Platypus" "Rat" "Stickleback" "Tasmanian_devil" "Xenopus"
## [25] "Zebrafish"
You will see that the table contains information on 25 genomes, from the Anole_lizard to the Zebrafish.
## coding_genes ncrna_genes pseudo_genes coding_gene_avg_length ncrna_gene_avg_length pseudo_gene_avg_length
## Anole_lizard 18595 7168 157 35332.78 10318.72 878.24
## C_elegans 20191 933 1922 3149.75 338.86 1474.68
## Cat 19563 2620 494 53520.62 119.25 4199.48
## Chicken 16779 1660 312 30172.87 98.55 12726.05
## Chimpanzee 23302 9702 485 48874.07 7235.28 820.99
## Cow 21861 3747 492 43952.40 114.05 3748.61
## coding_trans_avg_length ncrna_trans_avg_length pseudo_trans_avg_length genome_length
## Anole_lizard 2572.26 548.16 868.57 1799143587
## C_elegans 1412.30 257.29 899.83 100286401
## Cat 2190.87 119.25 881.91 2455541136
## Chicken 2420.99 98.55 2217.98 1230258557
## Chimpanzee 2674.48 334.79 761.29 3309577922
## Cow 2285.77 114.05 949.55 2670422299
The first three columns contain the total number of genes in each genome. Genes can be protein-coding (coding_genes), non-coding RNA (ncrna_genes) or pseudo-genes (pseudo_genes). The last column contains the genome length. The other columns contain the average length of different types of genes and transcripts (but we won’t be using these Today).
Did you see that the Human genome was in there? Let’s find out more about our genome. Can you extract the row for the Human genome?
## coding_genes ncrna_genes pseudo_genes coding_gene_avg_length ncrna_gene_avg_length pseudo_gene_avg_length
## Human 22534 7568 17085 62740.53 134.52 4200.8
## coding_trans_avg_length ncrna_trans_avg_length pseudo_trans_avg_length genome_length
## Human 3593.51 134.52 815.52 3096649726
Just how big is our genome? It’s sometimes difficult to read a large number.
The length of a genome is given by the number of nucleotide bases (A, C, G or Ts) of which it is composed. We tend to think in multiples of a million (1e6, megabases) or even thousand-millions (1e9, gigabases).
1e9 is the way in which we write scientific notation. It is equivalent to 1 x 109. Another example: 1e-3 is 1 x 10-3 or 0.001.
## [1] 3096.65
## [1] 3.09665
We could even print out this in a more informative manner, by pasting the value together with some text. New functions ahead!
print(paste("The Human genome is composed of", speciesTab["Human","genome_length"]/1e9, "gigabases."))## [1] "The Human genome is composed of 3.096649726 gigabases."
Now, let’s look more carefully at the first three columns of the table:
## coding_genes ncrna_genes pseudo_genes
## Anole_lizard 18595 7168 157
## C_elegans 20191 933 1922
## Cat 19563 2620 494
## Chicken 16779 1660 312
## Chimpanzee 23302 9702 485
## Cow 21861 3747 492
## Dog 20199 2972 613
## Dolphin 16550 3790 925
## Duck 16765 477 455
## Elephant 20033 2644 568
## Fruitfly 13947 446 330
## Fugu 21383 1008 816
## Gorilla 21587 7760 522
## Horse 20912 1756 402
## Human 22534 7568 17085
## Microbat 19728 4408 1713
## Mouse 22198 11838 13946
## Panda 19343 3141 778
## Pig 21280 2173 1626
## Platypus 17417 7188 207
## Rat 22245 8736 1668
## Stickleback 20787 1617 52
## Tasmanian_devil 18788 1490 178
## Xenopus 19983 1364 81
## Zebrafish 30140 4814 325
In fact, let’s save those three columns as a new object called
genes.
Perhaps we could try plotting this information, to be able to visually compare the number of genes of different types between the different genomes. Perhaps a barplot?
Error in barplot.default(genes) : ‘height’ must be a vector or a matrix
Oh yes, what was the class?
## [1] "data.frame"
So, let’s convert it to a matrix, and check the class again.
## [1] "matrix" "array"
This should work now…
Almost there. Right now the bars are stacked, but adding up
the number of coding genes into one big bar is not very useful. The
function barplot() gives us some options, which you can
find in the help page. Let’s try out a few:
Still, barplot() is grouping our values by column (type
of gene), while it might be more informative in this case to group by
row (species).
The matrix transpose function t() helps us
rotate a matrix, so that it’s rows will become columns and
vice-versa.
We can use it on the fly within another function, or save the transposed object with a new name; however you prefer.
The
lasparameter is for label-style. It allows you to place the axes labels in different ways: parallel or perpendicular to the axes, or always horizontal/vertical. It takes values between 0-3 for these four options. There are many other graphical parameters that you can change, for which you can find help in?par.
Check out the ?barplot help page if you want to play
around with different kinds of barplot.
Independent learning:
The following plot is just another way in which you can show the same
information. I have simply added more arguments to the previous
barplot command. Spend some time trying to understand the
information available in the ?barplot help page. You can
also use Google or ask an AI for help, but as always, make sure you
understand the code that you are using.