Learning to use R

About R

R is an open software environment for statistical computing. R is completely free and is available for Windows, Mac and Linux. You can obtain more information about R at http://www.r-project.org/ and download R for your computers.

We recommend working with R using an interface called RStudio that helps us organize our work and be more productive. This program is also freely available for all the usual operating systems: https://posit.co/download/rstudio-desktop/

In this session, we will focus on learning the basics of how to work in R.

Starting R

When you start RStudio, you will see a window similar to the one below. The window will be pretty similar no matter which operating system you are using.

On the left side you will see a window called the Console, where we will be writing our commands to control R.

Within these tutorials, the code that you will need to type into R (or copy/paste in some cases) will be displayed as follows:

print("This function simply prints some text to your screen")

If you copy/paste this code into your Console, R will return something like:

## [1] "This function simply prints some text to your screen"

In R, we will be using a lot of functions, which you will recognize since they are a word followed by parenthesis, just like print() above. You can find our more about any function in R in it’s help page, which you can access by typing either of the following:

?print

help(print)

In RStudio, the requested help page will appear on the right. For now just realize that you can ask for help, but we will not spend time looking at the help page for print().

Mathematical operations and saving results

Among other things, R can be used like a calculator. We can do all the basic arithmetic:

2+1

## [1] 3

In general, you can probably guess how to do most of these operations:

3-5

## [1] -2

4*4

## [1] 16

16/4

## [1] 4

7.15 * sqrt(4)

## [1] 14.3

What do you think the sqrt() function is for? If in doubt, you can always look at the help page by typing ?sqrt into the Console.

Sometimes you will need to use quotes when asking for help, e.g. ?"+"

Right now, all these operations are being calculated and the result is displayed on the console. To keep any of these results, we need to save it to an object in R. We can use almost any name for these objects, but a simple restriction is that it has to start with a letter, not a number.

Also note that we cannot use spaces within the name of an object. Thus, aNumber, Anumber, a_number are all valid, but a number would not work (R will think those are two separate objects: a and number).

Neither does R let you use mathematical operators within the name of an object. In the following (wrong) examples R will think you are trying to calculate a mathematical operation on two separate objects: a-number, a/number, a+number.

To save a value to an object we use <-, also called an assignment operator.

x <- 2+1

a_number <- 7.15 * sqrt(4)

Although the <- is perhaps the most common assignment operator, the = can also be used: x = 2+1

To see what is stored in any object, simply type it’s name in R:

a_number

## [1] 14.3

You can also use print(a_number).

When we save values like this we can then continue to use them in further calculations:

x + 2 * a_number

## [1] 31.6

One of the big advantages of R is that we can perform calculations on many values at the same time. We first need to create an object that contains several numbers, using the c() function. This function is for combining.

ages <- c(40,36,10,8,1,1,88)

To see the numbers we’ve just saved, just type ages

ages

## [1] 40 36 10  8  1  1 88

and now, we can perform any calculation on all these numbers at once:

ages + 1

## [1] 41 37 11  9  2  2 89

ages * 10

## [1] 400 360 100  80  10  10 880

There are many available functions in R. Again, you recognize a function because it is a word, followed by parenthesis. Always be careful to correctly open and close the parenthesis or R will not know that you are trying to execute a function. We place values or arguments inside the parenthesis, and the function will try to use them.

sum(ages)

mean(ages)

max(ages)

min(ages)

range(ages)

sort(ages)

unique(ages)

Can you guess what each of the functions does? If not, simply try them out!

If you find you made a small typing mistake, or you would simply like to repeat one of the lines of code you typed in without having to type it all again, you can use the Up and Down arrows on your keyboard to go back through the history of your commands.

You can also use the Left/Right arrows to move the cursor in order to edit the code you typed before.

Try it out.

If you are trying to find the name of a particular function, you might be able to find it using the ?? command in the Console. This asks R to search for the following word within the descriptions of the installed functions.

??deviation

Can you find the function to calculate the “standard deviation”?

Writing R code in a text editor (R Scripts)

Until now, you might think that writing code in R will be very tedious, having to type everything in the Console, and only being able to use the basic keyboard arrows to repeat and modify your previous attempts.

Nevertheless, there are many text editors to help you organize and save your R code. We will be learning to use the one in RStudio, but even the basic R installation comes with a useful text editor.

Within RStudio, you can use the menus to open a new R Script: File / New File / R Script. You can also click on the icon on the top left, that looks like a white sheet with a white-on-green cross, then select R Script.

In Windows or Linux you can also use the keyboard shortcut Ctrl+Shift+N, while on Mac the shortcut is ⇧⌘N.

Once you’ve opened a new R script, your RStudio interface should look something like this:

You can still type any commands you want into the Console, but you can also type (and save them!) in the R script editor above.

Most of what you will type into the editor will be R code, but you can also type in text, for example as a reminder of what you are trying to code. These bits of text are called comments, and in R you need to place a # before them so R does not try to execute them as R code. Try it out.

You will see that there are several advantages of using this editor:

You can save the document (script), with all the code you have written. You can send this script to someone else and they can use it.
It performs automatic syntax highlighting. This means that comments or text will show in one color, numbers in another, as well as certain functions and symbols. This can make your code easier to follow, and to find mistakes.
You are free to move around the text editor as you might expect, using the arrows, the mouse, you can copy and paste text as usual, etc.
Finally, it is connected to the Console in an interesting way: you can send any bit of code in the editor to the Console, where it will automatically be executed.

This last point is perhaps the most important. Any line of code on which your cursor is currently placed, can be sent to the Console with Ctrl+Enter in Windows/Linux or ⌘Return in Mac. This allows you to type in code directly in the editor, and with a simple keyboard combination, you can test out this code in the Console. If it works, great, you keep typing code. If there is a mistake, you will see an Error and you can modify the code until you get it right. You can also select several lines, or even all the code in the file, and send it in this way to the Console.

This will be useful later on, but if you do want to execute all the code you have in a script, it is easier to use the Source button in the editor. The keyboard shortcut is Ctrl+Shift+S for Windows/Linux and ⇧⌘S for Mac.

Try using the editor, repeating some of the simple operations we have been doing. You are free to use any combination that suits you, perhaps trying some code out by typing directly into the Console, and then typing the final version into the editor. Do try to remember though, that what you save in the editor should be code that works. There is no point saving all the trial and errors you have made (the Console is better for that). If you do want to keep a bit of code with an error (perhaps as a reminder of what doesn’t work), you can place it in the editor but with a comment (#) at the start.

From now on, try using the editor more and more, so you get used to it.

Different types of objects and manipulating vectors

Until now, we have only been using numeric values. Nevertheless, R can use other types of values. For example, we can use text values, as long as they are enclosed in quotes. We can save a series of text values in a new object, the same way as we did for ages.

nameObj <- c("Homer","Marge","Bart","Lisa","Maggie","Snowball","Abraham")

It can be practical in many situations to add a text label to each of our numbers (e.g. the name of the gene alongside the expression value). We can use the names() function to add text labels onto any other object:

names(ages) <- nameObj

and if we now ask to see the content of ages, we will see that each value has its corresponding label:

ages

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##       40       36       10        8        1        1       88

Our ages object has several values. In R, we call these kinds of objects vectors. We have already seen that we can perform mathematical operations on all the values of a vector at the same time. But sometimes, we want to manipulate only part of the vector. To extract part of a vector, we use square brackets. For example:

To get the first value from the vector:

ages[1]

## Homer 
##    40

and to get the Fifth Element:

ages[5]

## Maggie 
##      1

We can also get more than one value from a vector at once. For this, we need to tell R the positions or indexes of the values we are interested in. Remember that when we want to combine several values (in this case the indexes) we need to use the c() function. So, this is how it works:

c(1,5)

## [1] 1 5

ages[c(1,5)]

##  Homer Maggie 
##     40      1

If we want to get all the values from a series of consecutive positions, we can use the abbreviation first:last, as follows:

2:5

## [1] 2 3 4 5

ages[2:5]

##  Marge   Bart   Lisa Maggie 
##     36     10      8      1

Finally, if we use negative indexes R understands that we would like to remove these values from the vector, and only see what is left:

-c(1,5)

## [1] -1 -5

ages[-c(1,5)]

##    Marge     Bart     Lisa Snowball  Abraham 
##       36       10        8        1       88

Exercises:

What is the average of the ages, but excluding Maggie?
Remove Snowball from the vector, order it, and save it as a new object.

Relational Operators

Sometimes we want to ask something about our numbers, for instance, which genes have higher expression than a certain value. This kind of question can be answered using relational operators (greater than, less than, equal to, etc). Let’s use our ages object to try this out.

Who is older than 10?

ages

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##       40       36       10        8        1        1       88

ages > 10

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##     TRUE     TRUE    FALSE    FALSE    FALSE    FALSE     TRUE

In the same way, we can ask which of our stored values are less than, greater or equal to, or equal to, a given value:

ages < 10

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##    FALSE    FALSE    FALSE     TRUE     TRUE     TRUE    FALSE

ages >= 10

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##     TRUE     TRUE     TRUE    FALSE    FALSE    FALSE     TRUE

ages == 1    # CAREFUL! If you use a single = you will overwrite your object!

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##    FALSE    FALSE    FALSE    FALSE     TRUE     TRUE    FALSE

ages != 1

##    Homer    Marge     Bart     Lisa   Maggie Snowball  Abraham 
##     TRUE     TRUE     TRUE     TRUE    FALSE    FALSE     TRUE

In all these cases R returns a series of TRUE/FALSE values, or logical values. These logical vectors can be quite practical. Let’s use a couple of examples:

sum(ages > 10)

## [1] 3

ages[ages > 10]

##   Homer   Marge Abraham 
##      40      36      88

Exercises:

What do these last instructions do?
How can you find the ages that are greater than the mean of all ages?

Plotting with R

R is very popular for the quality and variety of figures you can make. Once we have our data in an R object, it is relatively easy to make figures with them. For example, we can simply plot our ages object:

plot(ages)

and get something like this:

Some other basic plots:

barplot(ages)

pie(ages)

hist(ages)

All these functions have their own help page. Remember you can access them with ?hist for example. There is also a lot of information about what parameters you can change in ?par. It will take some time getting used to all of them. For now, let’s see another example, of medium complexity:

hist(ages, col="skyblue", main="My blue histogram", ylab="Number of individuals")

Saving files into your Working Directory

When you are ready to save one of your plots, you can do so in several ways. For now, we will see how do to it using R code (instead of RStudio’s menus and buttons). For instance, to save the last plot we made, we would do the following:

pdf("histogram.pdf")

hist(ages, col="skyblue", main="My blue histogram", ylab="Number of individuals")

dev.off()

We need to start with the function pdf(), which opens the file in our “Working Directory”. From then on, every plotting function we use will send the plot to this file, instead of our screen. Once we have finished, we need to properly close this file, and for that we need to use the dev.off() function.

You may be wondering what exactly this “Working Directory” might be. R always has a current working directory in which it stores files and from which it can also read files. You can change this working directory whenever you want, but it is useful to first find out what it is, using the getwd() function (get working directory)

getwd()

You should see something like:

## [1] "/Users/cei/Documents"

This means that R is working inside that directory, and should have stored your histogram.pdf file there. Make sure you are aware which directory YOUR R is using right now. Can you use your normal operating system file browser to go to that directory and see if you can find and open the histogram you just saved?

You can also take advantage of the RStudio interface, in the Files tab of the lower right hand section. There you will be able to click to “Go To Working Directory”. You can also navigate your directories using this interface, create “New Folders” and then select one to “Set As Working Directory”. Some of these options are shown in the following screen capture:

When you try these options out, you will see that RStudio is simply producing R code according to your wishes, and sending them for execution to the Console. This is similar to what you do when you type your code into a script, and can be a nice reminder of the real R code.

Finally, getting back to using pure R code, it can be quite useful to list files that are in your working directory:

list.files()

Can you find your histogram.pdf file in these results?

Reading data from files and manipulating tables

We really do NOT want to be putting our values into R by hand. If we already have a series of numbers in a text file, we can import them into R quite easily. For this exercise, download the following file num.txt. You can do this by right-clicking on the link in your web browser, and selecting the option to download the file. Depending on your web browser this could be “Save Link As…”, “Save Target As…”, “Download Linked File As…”. Make sure to save the file in your R working directory (see above).

After downloading a file, make sure you can see the num.txt file among the files in your current directory.
- You can use the RStudio interface to explore the contents of your working directory, or you can use the list.files() function.
If you cannot see the downloaded file there, the next function will NOT work, since R will not be able to find the file you are asking it to open.

If for whatever reason the file you downloaded has a slightly different name, use that exact name in the following R code (remember R is case-sensitive).

For importing the contents of the file, we can use the scan() function.

num <- scan("num.txt")

num

##  [1] 252 141 165 174 192 225 176 191 229 170 125 229 176 190 189 239 170 233 180 219 229 223 203 161 247 199 224 260 225
## [30] 327 171 244 170 115 272 131 240 173 279  85 175 231 198 214 219 235 171 315 215 206

Try out some of the functions you already know about with this new object.

To finish this tutorial, let’s look at an example that is a bit more realistic. The file tab.txt contains a table of expression values for 50 genes (rows) across 3 conditions (columns). Download and save this file in your working directory.

Files that are properly formatted as tables (same number of columns for every row) are easy to read into R. We can use the read.table() function:

tab <- read.table("tab.txt")

tab

##        cond1 cond2 cond3
## gene1    252   233   182
## gene2    141   179   216
## gene3    165   195   175
## gene4    174   190   188
## gene5    192   231   194
## gene6    225   142   197
## gene7    176   190   218
## gene8    191   210   175
## gene9    229   162   191
## gene10   170   112   192

One way of visualizing data in a table is with a boxplot(). This type of plot gives us an overview of the distribution of numerical values in each column of our table.

boxplot(tab, col=c("red","green","blue"))

We had already used square brackets to extract (with positive indexes) or remove (with negative indexes) part of a vector. We can do the same with tables. But, a table has two dimensions: rows and columns. In these cases, the square brackets will accept two series of indexes, separated by a comma:

tab[c(1,6,10), c(1,3)]

##        cond1 cond3
## gene1    252   182
## gene6    225   197
## gene10   170   192

tab[1:5, 1:3]

##       cond1 cond2 cond3
## gene1   252   233   182
## gene2   141   179   216
## gene3   165   195   175
## gene4   174   190   188
## gene5   192   231   194

We can also specify indexes in just one of the dimensions. Just remember to always use the comma so that it becomes obvious which dimension you are restricting.

tab[c(1,6,10), ]      # CAREFUL, always put the comma!

tab[, c(1,3)]

Exercises:

What are the expression values for the 3rd gene? (simple)
What is the mean expression across all genes in the 2nd condition? (intermediate)
How many genes have higher expression in the 1st condition compared to the 3rd? (harder)

More Exercises

The following exercises are provided to give you more practice with the R skills you have learnt in this practical. Take your time, and make sure you are not simply asking Google on an AI for the solution. Every question can be answered with what has been covered in this practical.

Start a new R script for each exercise. Add as many comments as you want (lines beginning with a #) to make your code easy to follow. At the very least, add commented lines at the start describing the exercise.

You can test as much code as you want in the Console, but make sure that the code that is saved in your script is error-free.

Use the data from tab.txt to answer each of the questions/exercises. Remember that each of your scripts will need to read these data first!

Obtain the minimum and the maximum values of gene expression for each of the three conditions.
Make a PDF file with three histograms (one for each of the three conditions). Make sure that you can tell which histogram corresponds to which condition.
Create two smaller tables, one with only genes 10-12 and another with genes 30-40. Save these two tables into new files (hint: check out ?write.table)
In each of the three conditions, how many genes have expression values greater than 250?

Independent learning:

In many cases you will come across a new function that you think might be useful, but that you do not know how to use. The following exercise is an example of this. To complete it, feel free to use Google, ask an AI or someone who you know already uses R. But try to make sure that at the end you understand how to use the function.

The function apply() can be used to apply any function (such as sum, mean, max, etc) to all the rows or all the columns of a table, using just one line of code. Try to figure out how you can use this function to calculate the mean expression for every gene in the table we have been working with. Can you then figure out how to obtain the mean expression for each condition? The information is available within the help page for ?apply , although it takes some practice to understand how to interpret these help pages.