Genome annotation and repeating code

Introduction

In most cases, we can only interpret and make sense of our sequenced exRNA reads by understanding from which type of molecule they are derived from.

Are they microRNAs?
Do they map to protein-coding mRNAs?
Are they parts of rRNA or tRNA?
What about transposons?
How many are coming from as yet un-annotated parts of the genome?

Nowadays almost all projects will have a reference genome of some sort (although of varying quality). Together with the genome sequence we should also be able to obtain annotation file(s) which tell us the coordinates of protein-coding genes, and other predicted or known elements encoded on that genome. We will need to use this genome and its annotation as a reference for understanding our sequencing results.

Exploring the genome annotation of our data

In later practicals we will be looking more closely at real genome annotation files. For now, let us start by imagining we performed an experiment where we used 3 different methods to isolate extracellular vesicles (EVs). We sequenced small RNAs from the input material, as well as the small RNAs after enriching for EVs using the 3 methods. We then mapped the reads for each sample to our reference genome, and counted how many reads map to each type of annotation.

The results can be stored in a simple text file, containing one column for each of the four samples (input and 3 EV samples) and one row for each type of annotation (we sometimes refer to these as biotypes).

You can download this file using this link: EV_experiment1.txt

As always, you have a few options:

You can right-click and save the file into your working directory, in which case you need to use read.table and specify the filename.
You can try to load it directly from this website, by right-clicking and copying the link or link address and then pasting this into the read.table function.

Before proceeding make sure you did load it correctly, by checking the contents of the object.

countTab <- read.table("name_of_the_file_I_downloaded.txt")

countTab

##              input    EV1     EV2    EV3
## exons       450341  58502  159202 138592
## introns     150119  19620   52981  46445
## repeats     300784  38420  105985 643975
## miRNA      1348818 174731  476342 413169
## rRNA        450470  62151 1588819 137927
## intergenic  299468  46576  116671 119892

Similar to what I showed you at the end of the previous practical, you might decide to use a barplot to display these data. A figure is usually easier to explore!

Remember to be careful since read.table produces a data.frame while barplot wants your data to be stored as a matrix. So we will need to convert our object into the correct class.

countMat <- as.matrix(countTab)
class(countTab)

## [1] "data.frame"

class(countMat)

## [1] "matrix" "array"

colors <- c(4,5,3,7,2,8)

barplot(as.matrix(countTab), col=colors, space=0.5, las=2,
        main="Sequencing counts by annotation", legend.text=TRUE)

Let’s pause here a bit, and think about what we might interpret from this figure.

Are these results consistent with what we know about our experiment?

The first sample corresponds to the input, so we want to compare the others against this one.
Is the EV1 sample consistent with a successful enrichment of EVs showing a distinct population of RNA molecules?
What about the EV2 sample? What differences can you see, and what might these mean?
And what about the EV3 sample?

Scaling transformations

In many cases we will need to transform our data in some way. Some options that you have probably heard of or used are counts per million (CPM), reads per kilo-base per million mapped reads (RPKM). These can also be combined with a simple logarithm base 2. For now, let’s do a very simple one, let’s turn our counts into percentages, so that the values in each sample column sum to 100 (CPM are the same, except the columns will sum to 1 million).

There are many ways of doing this in R, but the one I’m using below is with the sweep function, which returns a new array modified according to a function (we are dividing here, so “/”). In this case we are dividing by the sum of each column, so colSums).

I admit this code is a bit complex, but don’t worry too much about it for now. You can use it as a recipe, and later modify it by trial and error or by asking for additional help (e.g. from ChatGPT or similar).

percMat <- sweep(countMat, MARGIN=2, STATS=colSums(countMat), FUN="/") * 100

percMat

##                input      EV1      EV2       EV3
## exons      15.011367 14.62550  6.36808  9.239467
## introns     5.003967  4.90500  2.11924  3.096333
## repeats    10.026133  9.60500  4.23940 42.931667
## miRNA      44.960600 43.68275 19.05368 27.544600
## rRNA       15.015667 15.53775 63.55276  9.195133
## intergenic  9.982267 11.64400  4.66684  7.992800

colSums(percMat)

## input   EV1   EV2   EV3 
##   100   100   100   100

barplot(percMat, col=colors, space=0.5, las=2,
        main="Sequencing counts by annotation (as percent)", legend.text=TRUE)

A trick to get enough space to fit the label on the same plot, is to add one or more empty columns (filled with NA values). The function cbind is for column bind.

barplot(cbind(percMat,NA,NA), col=colors, space=0.5, las=2,
        main="Sequencing counts by annotation (as percent)", legend.text=TRUE)

Now let’s revisit some of our questions:

What do you think about the differences between input and EV1?

Differences in sequencing depth are not to be trusted, since they can arise for all sorts of technical reasons and do not usually reflect a difference in the starting biological material.

How about EV2, does it look more similar, or still quite distinct?

What if EV2 was a bit more degraded, leading to more random rRNA fragments within the size range that we selected for sequencing small RNAs?

It the above is a possibility, we could try to correct this by removing all reads that map to rRNA, and then re-scaling the results. It would not be perfect (since if we do have different levels of degradation there are likely to be other types of RNA similarly affected), but removing rRNA is a typical step in trying to remove biases and making samples more comparable.

So, let’s look at the data and see how we might remove rRNA.

countMat

##              input    EV1     EV2    EV3
## exons       450341  58502  159202 138592
## introns     150119  19620   52981  46445
## repeats     300784  38420  105985 643975
## miRNA      1348818 174731  476342 413169
## rRNA        450470  62151 1588819 137927
## intergenic  299468  46576  116671 119892

What we need to do is remove the 5th row, corresponding to “rRNA”. We can do this in several different ways. I’ll show you some of these to see which ones you feel more comfortable with. Just remember that at the end you have to save the resulting object to proceed.

countMat[-5, ]

##              input    EV1    EV2    EV3
## exons       450341  58502 159202 138592
## introns     150119  19620  52981  46445
## repeats     300784  38420 105985 643975
## miRNA      1348818 174731 476342 413169
## intergenic  299468  46576 116671 119892

countMat[rownames(countMat) != "rRNA", ]

##              input    EV1    EV2    EV3
## exons       450341  58502 159202 138592
## introns     150119  19620  52981  46445
## repeats     300784  38420 105985 643975
## miRNA      1348818 174731 476342 413169
## intergenic  299468  46576 116671 119892

toRem <- c("rRNA")
countMat[!rownames(countMat) %in% toRem, ]

##              input    EV1    EV2    EV3
## exons       450341  58502 159202 138592
## introns     150119  19620  52981  46445
## repeats     300784  38420 105985 643975
## miRNA      1348818 174731 476342 413169
## intergenic  299468  46576 116671 119892

They all give the same result. So pick one, and let’s try plotting again. If you want the colors to be the same, you will need to also remove the color value for rRNA.

countMat_no_rRNA <- countMat[-5, ]
colors_no_rRNA   <- colors[-5]

barplot(cbind(countMat_no_rRNA,NA,NA), col=colors_no_rRNA, space=0.5, las=2,
        main="Sequencing counts by annotation (without rRNA)", legend.text=TRUE)

Ah, but we need to scale them again as percentages.

Can you do this yourselves?

We can now see that the first three samples, visualised in this manner, are not very different. The fourth sample is quite different still. So one way of interpreting these results is that the first two EV-enrichment protocols did not really work, perhaps still containing too much input material. The EV3 sample on the other hand does seem to be distinct, and might be more representative of our EV population.

Of course, this has just been a simulated example, and although real datasets will have similar issues, you need to interpret your results not just based on what the numbers or images are telling you, but also based on what you know about the biology of your samples.

Repeating your code

Now that you’ve made a script to perform your analysis once, what happens when you get results from a new experiment, and the results are in a new file in the same format?

Of course you want to perform the same analysis. How should you go about it?

At the very least you do not have to start writing your code from scratch! Some obvious solutions come to mind:

Edit your script to be able to process the new data correctly: change the name of the input file, maybe change one of the figure’s main titles so you know which figure corresponds to which input file, etc.
Or to not lose your original code, you make a copy of your script, and make the changes only within the new copy.

Both options would work, but they are not great: in the first case you’ve lost your original script and in the second case you now have two copies of the same script. If you decide to add some new functionality, you will need to be careful of adding it to both scripts, testing them both. If then you realise you have not two, but even more input files, neither of the solutions are feasible.

An alternative would be to modify your original script in such a way that the same script is able to process two (or more) different input files. Similarly it should produce two (or more) sets of result plots.

One of the simplest ways to accomplish this in coding is to use a for loop. This is a command that allow us to repeat part of our code a certain number of times.

Let’s first look at a couple of very simple examples:

inputFiles <- c("file1", "file2", "file3")
for (inputFile in inputFiles) {
  print(inputFile)
}

## [1] "file1"
## [1] "file2"
## [1] "file3"

for (i in 1:5) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Can you see what the code is doing? It is repeating the internal lines of code (in this case just a print() command) a certain number of times. How many times? It depends on how many values are given to the loop (the inputFiles or 1:5 bit). Importantly, there is also something that changes each time the internal lines of code are executed. The print() command doesn’t print out the same value each time. This is because the other part of the loop (the inputFile or i object) changes each time, taking on, one-by-one, the values that you specify.

Have a try, make your own for loop. Add more commands than just a print().

Exercise

Please work in pairs on this exercise, and help each other out. Try to first talk about what you might need to do, and then try to code it. If you get stuck please ask any of the trainers.

Download the three count input files from this folder. Each file is similar to the data you have already worked with, but with some differences.
Prepare a script that will use a for loop to cycle through each file and process them according to this practical.
- Start by making a clean copy of your code, that works and produces the correct barplot.
- Decide which part of this code needs to be repeated for each file, try to place a for() loop around this.
- To make your code easier to read, all lines inside the for loop should have an indent. Hint: if you copy code into a for loop Rstudio will do this automatically. If you write a for loop around existing code, you can select this code and click Tab
At the end the script should produce a single PDF file with a barplot (or series of barplots) for each input file. We explained how to make a PDF file for plots in the first practical. Hint: think about where in your script you should open the pdf() device and where you should close it using dev.off()

Advanced/independent Exercise

You might be thinking that it can be more meaningful to compare each EV enrichment result against the input. You would be right! Although ideally we would be doing this in a more sophisticated manner, considering replicates and performing a statistical test, initially we can make a very simple calculation of dividing each EV result against the corresponding value in the input.

Can you try to make these calculations?

Each EV column would need to be divided by the same input column (start doing only one, you can implement a for loop later).
How will you store these new results?
How will you plot these new results?
There is an additional aspect to consider: ratios are not symmetrical.
- If the numerator is smaller than the denominator (depletion), the range of the possible values will be 0-1.
- If the numerator is larger than the denominator (enrichment), the range of possible values will be 1-Infinity.
- One trick to avoid this is to use a logarithm transformation, typically log2. Can you try this?