R is an open software environment for statistical computing. R is completely free and is available for Windows, Mac and Linux. You can obtain more information about R at http://www.r-project.org/ and download R for your computers.
We recommend working with R using an interface called RStudio that helps us organize our work and be more productive. This program is also freely available for all the usual operating systems: https://posit.co/download/rstudio-desktop/
In this session, we will focus on learning the basics of how to work in R.
When you start RStudio, you will see a window similar to the one below. The window will be pretty similar no matter which operating system you are using.
On the left side you will see a window called the Console, where we will be writing our commands to control R.
Within these tutorials, the code that you will need to type into R (or copy/paste in some cases) will be displayed as follows:
If you copy/paste this code into your Console, R will return something like:
## [1] "This function simply prints some text to your screen"
In R, we will be using a lot of functions, which you
will recognize since they are a word followed by parenthesis, just like
print() above. You can find our more about any function in
R in it’s help page, which you can access by typing either of the
following:
In RStudio, the requested help page will appear on the right. For now
just realize that you can ask for help, but we will not spend time
looking at the help page for print().
Among other things, R can be used like a calculator. We can do all the basic arithmetic:
## [1] 3
In general, you can probably guess how to do most of these operations:
## [1] -2
## [1] 16
## [1] 4
## [1] 14.3
sqrt() function is for? If in
doubt, you can always look at the help page by typing ?sqrt
into the Console.Sometimes you will need to use quotes when asking for help, e.g.
?"+"
Right now, all these operations are being calculated and the result
is displayed on the console. To keep any of these results, we need to
save it to an object in R. We can use almost any name for
these objects, but a simple restriction is that it has to start with a
letter, not a number.
Also note that we cannot use spaces within the name of an object. Thus,
aNumber,Anumber,a_numberare all valid, buta numberwould not work (R will think those are two separate objects:aandnumber).Neither does R let you use mathematical operators within the name of an object. In the following (wrong) examples R will think you are trying to calculate a mathematical operation on two separate objects:
a-number,a/number,a+number.
To save a value to an object we use <-, also called
an assignment operator.
Although the
<-is perhaps the most common assignment operator, the=can also be used:x = 2+1
To see what is stored in any object, simply type it’s name in R:
## [1] 14.3
You can also use
print(a_number).
When we save values like this we can then continue to use them in further calculations:
## [1] 31.6
One of the big advantages of R is that we can perform calculations on
many values at the same time. We first need to create an object that
contains several numbers, using the c() function. This
function is for combining.
To see the numbers we’ve just saved, just type ages
## [1] 40 36 10 8 1 1 88
and now, we can perform any calculation on all these numbers at once:
## [1] 41 37 11 9 2 2 89
## [1] 400 360 100 80 10 10 880
There are many available functions in R. Again, you recognize a function because it is a word, followed by parenthesis. Always be careful to correctly open and close the parenthesis or R will not know that you are trying to execute a function. We place values or arguments inside the parenthesis, and the function will try to use them.
If you find you made a small typing mistake, or you would simply like to repeat one of the lines of code you typed in without having to type it all again, you can use the Up and Down arrows on your keyboard to go back through the history of your commands.
You can also use the Left/Right arrows to move the cursor in order to edit the code you typed before.
Try it out.
If you are trying to find the name of a particular function, you
might be able to find it using the ?? command in the
Console. This asks R to search for the following word within the
descriptions of the installed functions.
Until now, you might think that writing code in R will be very tedious, having to type everything in the Console, and only being able to use the basic keyboard arrows to repeat and modify your previous attempts.
Nevertheless, there are many text editors to help you organize and save your R code. We will be learning to use the one in RStudio, but even the basic R installation comes with a useful text editor.
Within RStudio, you can use the menus to open a new R Script: File / New File / R Script. You can also click on the icon on the top left, that looks like a white sheet with a white-on-green cross, then select R Script.
In Windows or Linux you can also use the keyboard shortcut Ctrl+Shift+N, while on Mac the shortcut is ⇧⌘N.
Once you’ve opened a new R script, your RStudio interface should look something like this:
You can still type any
commands you want into the Console, but you can also type (and save
them!) in the R script editor above.
Most of what you will type into the editor will be R code, but you
can also type in text, for example as a reminder of what you are trying
to code. These bits of text are called comments, and in R you need to
place a # before them so R does not try to execute them as
R code. Try it out.
You will see that there are several advantages of using this editor:
You can save the document (script), with all the code you have written. You can send this script to someone else and they can use it.
It performs automatic syntax highlighting. This means that comments or text will show in one color, numbers in another, as well as certain functions and symbols. This can make your code easier to follow, and to find mistakes.
You are free to move around the text editor as you might expect, using the arrows, the mouse, you can copy and paste text as usual, etc.
Finally, it is connected to the Console in an interesting way: you can send any bit of code in the editor to the Console, where it will automatically be executed.
This last point is perhaps the most important. Any line of code on which your cursor is currently placed, can be sent to the Console with Ctrl+Enter in Windows/Linux or ⌘Return in Mac. This allows you to type in code directly in the editor, and with a simple keyboard combination, you can test out this code in the Console. If it works, great, you keep typing code. If there is a mistake, you will see an Error and you can modify the code until you get it right. You can also select several lines, or even all the code in the file, and send it in this way to the Console.
This will be useful later on, but if you do want to execute all the code you have in a script, it is easier to use the Source button in the editor. The keyboard shortcut is Ctrl+Shift+S for Windows/Linux and ⇧⌘S for Mac.
Try using the editor, repeating some of the simple operations we have been doing. You are free to use any combination that suits you, perhaps trying some code out by typing directly into the Console, and then typing the final version into the editor. Do try to remember though, that what you save in the editor should be code that works. There is no point saving all the trial and errors you have made (the Console is better for that). If you do want to keep a bit of code with an error (perhaps as a reminder of what doesn’t work), you can place it in the editor but with a comment (#) at the start.
From now on, try using the editor more and more, so you get used to it.
Until now, we have only been using numeric values. Nevertheless, R
can use other types of values. For example, we can use text values, as
long as they are enclosed in quotes. We can save a series of text values
in a new object, the same way as we did for ages.
It can be practical in many situations to add a text label to each of
our numbers (e.g. the name of the gene alongside the expression value).
We can use the names() function to add text labels onto any
other object:
and if we now ask to see the content of ages, we will
see that each value has its corresponding label:
## Homer Marge Bart Lisa Maggie Snowball Abraham
## 40 36 10 8 1 1 88
Our ages object has several values. In R, we call these
kinds of objects vectors. We have already seen that we can
perform mathematical operations on all the values of a vector at the
same time. But sometimes, we want to manipulate only part of the vector.
To extract part of a vector, we use square brackets. For example:
To get the first value from the vector:
## Homer
## 40
and to get the Fifth Element:
## Maggie
## 1
We can also get more than one value from a vector at once. For this,
we need to tell R the positions or indexes of the values we are
interested in. Remember that when we want to combine several values (in
this case the indexes) we need to use the c() function. So,
this is how it works:
## [1] 1 5
## Homer Maggie
## 40 1
If we want to get all the values from a series of consecutive
positions, we can use the abbreviation first:last, as
follows:
## [1] 2 3 4 5
## Marge Bart Lisa Maggie
## 36 10 8 1
Finally, if we use negative indexes R understands that we would like to remove these values from the vector, and only see what is left:
## [1] -1 -5
## Marge Bart Lisa Snowball Abraham
## 36 10 8 1 88
Exercises:
Sometimes we want to ask something about our numbers, for
instance, which genes have higher expression than a certain value. This
kind of question can be answered using relational operators (greater
than, less than, equal to, etc). Let’s use our ages object
to try this out.
Who is older than 10?
## Homer Marge Bart Lisa Maggie Snowball Abraham
## 40 36 10 8 1 1 88
## Homer Marge Bart Lisa Maggie Snowball Abraham
## TRUE TRUE FALSE FALSE FALSE FALSE TRUE
In the same way, we can ask which of our stored values are less than, greater or equal to, or equal to, a given value:
## Homer Marge Bart Lisa Maggie Snowball Abraham
## FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## Homer Marge Bart Lisa Maggie Snowball Abraham
## TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## Homer Marge Bart Lisa Maggie Snowball Abraham
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## Homer Marge Bart Lisa Maggie Snowball Abraham
## TRUE TRUE TRUE TRUE FALSE FALSE TRUE
In all these cases R returns a series of TRUE/FALSE values, or logical values. These logical vectors can be quite practical. Let’s use a couple of examples:
## [1] 3
## Homer Marge Abraham
## 40 36 88
Exercises:
mean of all ages?R is very popular for the quality and variety of figures you can
make. Once we have our data in an R object, it is relatively easy to
make figures with them. For example, we can simply plot our
ages object:
and get something like this:
Some other basic plots:
All these functions have their own help page. Remember you can access
them with ?hist for example. There is also a lot of
information about what parameters you can change in ?par.
It will take some time getting used to all of them. For now, let’s see
another example, of medium complexity:
When you are ready to save one of your plots, you can do so in several ways. For now, we will see how do to it using R code (instead of RStudio’s menus and buttons). For instance, to save the last plot we made, we would do the following:
pdf("histogram.pdf")
hist(ages, col="skyblue", main="My blue histogram", ylab="Number of individuals")
dev.off()We need to start with the function pdf(), which opens
the file in our “Working Directory”. From then on, every plotting
function we use will send the plot to this file, instead of our screen.
Once we have finished, we need to properly close this file, and for that
we need to use the dev.off() function.
You may be wondering what exactly this “Working Directory” might be.
R always has a current working directory in which it stores files and
from which it can also read files. You can change this working directory
whenever you want, but it is useful to first find out what it is, using
the getwd() function (get working directory)
You should see something like:
## [1] "/Users/cei/Documents"
This means that R is working inside that directory, and should have stored your histogram.pdf file there. Make sure you are aware which directory YOUR R is using right now. Can you use your normal operating system file browser to go to that directory and see if you can find and open the histogram you just saved?
You can also take advantage of the RStudio interface, in the Files tab of the lower right hand section. There you will be able to click to “Go To Working Directory”. You can also navigate your directories using this interface, create “New Folders” and then select one to “Set As Working Directory”. Some of these options are shown in the following screen capture:
When you try these options out, you will see that RStudio is simply producing R code according to your wishes, and sending them for execution to the Console. This is similar to what you do when you type your code into a script, and can be a nice reminder of the real R code.
Finally, getting back to using pure R code, it can be quite useful to list files that are in your working directory:
Can you find your histogram.pdf file in these results?
We really do NOT want to be putting our values into R by hand. If we already have a series of numbers in a text file, we can import them into R quite easily. For this exercise, download the following file num.txt. You can do this by right-clicking on the link in your web browser, and selecting the option to download the file. Depending on your web browser this could be “Save Link As…”, “Save Target As…”, “Download Linked File As…”. Make sure to save the file in your R working directory (see above).
num.txt file among the files in your current directory.
list.files()
function.If for whatever reason the file you downloaded has a slightly different name, use that exact name in the following R code (remember R is case-sensitive).
For importing the contents of the file, we can use the
scan() function.
## [1] 252 141 165 174 192 225 176 191 229 170 125 229 176 190 189 239 170 233 180 219 229 223 203 161 247 199 224 260 225
## [30] 327 171 244 170 115 272 131 240 173 279 85 175 231 198 214 219 235 171 315 215 206
To finish this tutorial, let’s look at an example that is a bit more realistic. The file tab.txt contains a table of expression values for 50 genes (rows) across 3 conditions (columns). Download and save this file in your working directory.
Files that are properly formatted as tables (same number of columns
for every row) are easy to read into R. We can use the
read.table() function:
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
## gene6 225 142 197
## gene7 176 190 218
## gene8 191 210 175
## gene9 229 162 191
## gene10 170 112 192
One way of visualizing data in a table is with a
boxplot(). This type of plot gives us an overview of the
distribution of numerical values in each column of our table.
We had already used square brackets to extract (with positive indexes) or remove (with negative indexes) part of a vector. We can do the same with tables. But, a table has two dimensions: rows and columns. In these cases, the square brackets will accept two series of indexes, separated by a comma:
## cond1 cond3
## gene1 252 182
## gene6 225 197
## gene10 170 192
## cond1 cond2 cond3
## gene1 252 233 182
## gene2 141 179 216
## gene3 165 195 175
## gene4 174 190 188
## gene5 192 231 194
We can also specify indexes in just one of the dimensions. Just remember to always use the comma so that it becomes obvious which dimension you are restricting.
Exercises:
The following exercises are provided to give you more practice with the R skills you have learnt in this practical. Take your time, and make sure you are not simply asking Google on an AI for the solution. Every question can be answered with what has been covered in this practical.
Start a new R script for each exercise. Add as many comments as you want (lines beginning with a #) to make your code easy to follow. At the very least, add commented lines at the start describing the exercise.
You can test as much code as you want in the Console, but make sure that the code that is saved in your script is error-free.
Use the data from tab.txt to answer each of the questions/exercises. Remember that each of your scripts will need to read these data first!
Obtain the minimum and the maximum values of gene expression for each of the three conditions.
Make a PDF file with three histograms (one for each of the three conditions). Make sure that you can tell which histogram corresponds to which condition.
Create two smaller tables, one with only genes 10-12 and another
with genes 30-40. Save these two tables into new files (hint:
check out ?write.table)
In each of the three conditions, how many genes have expression values greater than 250?
Independent learning:
In many cases you will come across a new function that you think might be useful, but that you do not know how to use. The following exercise is an example of this. To complete it, feel free to use Google, ask an AI or someone who you know already uses R. But try to make sure that at the end you understand how to use the function.
apply() can be used to apply any
function (such as sum, mean, max,
etc) to all the rows or all the columns of a table, using just one line
of code. Try to figure out how you can use this function to calculate
the mean expression for every gene in the table we have
been working with. Can you then figure out how to obtain the
mean expression for each condition? The information is
available within the help page for ?apply , although it
takes some practice to understand how to interpret these help
pages.