Genome annotation and counting reads from BAM alignments

Introduction

The purpose of this practical is to use our genome’s annotation to catalogue our sequenced reads. At the end of this practical you will produce a figure showing the relative amount of reads that can be assigned to each type of annotation (coding regions, miRNAs, rRNA, snoRNA, etc). We will do all of this inside R and using some Bioconductor packages.

Exactly the same strategy can be used even when no genome is available, as long as you have some sort of annotation regarding the reference sequences you are using (which could be contigs, ESTs, cDNAs, etc).

During this practical you will need the following Bioconductor packages: IRanges, rtracklayer, GenomicRanges, Rsamtools, GenomicAlignments

If you haven’t installed them before, you should do so now:

BiocManager::install(c("IRanges", "rtracklayer", "GenomicRanges", "Rsamtools", "GenomicAlignments", "Rsubread"))

Using IRanges

In this practical we will be manipulating different types of information that can be represented as ranges (gene coordinates, positions of mapped sequences, etc). One of the most flexible ways of working with ranges in R is to use the IRanges package.

library(IRanges)

To begin, let’s create an IRanges object, by defining the start and end positions of 3 ranges. Don’t worry about what these ranges mean right now, we will get to it.

x <- IRanges(start=c(11,35,40), end=c(20,50,63))

x

## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        11        20        10
##   [2]        35        50        16
##   [3]        40        63        24

An image might make it easier to understand what the values in our IRanges object represent.

There are several functions that we can use to extract information from an IRanges object:

start(x) # All the start positions

## [1] 11 35 40

end(x)   # All the end positions

## [1] 20 50 63

Again, a visual representation of these results

width(x) # The width of each range

## [1] 10 16 24

range(x) # The combined range (smallest start to largest end)

## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        11        63        53

Coverage calculations

One very useful function allows us to calculate the coverage, which is the sum of how many ranges overlap each position.

coverage(x)

## integer-Rle of length 63 with 6 runs
##   Lengths: 10 10 14  5 11 13
##   Values :  0  1  0  1  2  1

A integer-Rle means integers stored as Run-length encoded, and is a compact way of storing runs of the same value. So first we have a Value of 0 that is repeated a Length of 10 times. Much easier to understand if we simply convert it to the actual integers:

as.integer(coverage(x))

##  [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
## [59] 1 1 1 1 1

And easier still as an image:

Coverage values have many uses, such as for displaying the accumulation of mapped reads on genome browsers, for detecting peaks where reads are piling up above a certain threshold, etc.

Collapsing overlaps

Another very useful function is reduce, which fuses all overlapping ranges into one.

reduce(x)

## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        11        20        10
##   [2]        35        63        29

Counting overlaps

And now to give you a closer idea to what we will be doing with these objects:

exons <- reduce(x)

reads <- IRanges(start=c(1,23,31,54,78), width=20)

reads

## IRanges object with 5 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        20        20
##   [2]        23        42        20
##   [3]        31        50        20
##   [4]        54        73        20
##   [5]        78        97        20

countOverlaps(exons, reads)

## [1] 1 3

Using the figure, make sure you can understand what the previous function is telling us: how many reads overlap with each of the two exons.

For more details, you can have a look at the IRanges vignette (browseVignettes(package="IRanges")).

Real genome annotation

The first thing we need is information about the genome elements we are interested in studying. For this project, we are interested in the annotation of the mouse and H. bakeri genomes. For each region of a genome we are interested in, we need to know it’s coordinates (what chromosome, start and end positions) and the strand on which it lies. We also want to know what type of element is encoded there: a protein? a non-coding RNA?

There are many ways to obtain these types of information. One very useful resource is Ensembl, where we can download a GTF/GFF (Gene Transfer/Feature Format) file with all the annotation information of a genome. Unfortunately, Ensembl does not include all possible genomes, and in particular H. bakeri is missing. There are similar resources though, and WormBase ParaSite does include the genome of H. bakeri.

For this practical please download the following annotation file:

Hbakeri_small_annotation.gff.gz

We will now use a function from the rtracklayerpackage to import the data into R. We will also be using the GenomicRanges package, so let’s load them both.

library(rtracklayer)
library(GenomicRanges)

annot <- import("Hbakeri_small_annotation.gff.gz")

If for whatever reason you were not able to load the gff3 file as above, you can download the Rdata file I saved with the previous instruction annot.Rdata, and load it directly into your R session. If you did load the file correctly with the import function, then do NOT use the following command.

load("annot.Rdata")

In either case, you should now be able to view all the loaded annotation:

annot

## GRanges object with 91854 ranges and 8 metadata columns:
##           seqnames          ranges strand |       source      type     score     phase                ID
##              <Rle>       <IRanges>  <Rle> |     <factor>  <factor> <numeric> <integer>       <character>
##       [1]        I     20308-20494      + | RepeatMasker DNA/MITE       13.4      <NA>   rnd-1_family-53
##       [2]        I     24008-24205      + | RepeatMasker SINE           20.8      <NA>   rnd-1_family-17
##       [3]        I     24965-25055      + | RepeatMasker Satellite      19.8      <NA>   rnd-1_family-80
##       [4]        I     25211-25296      + | RepeatMasker Unknown        11.6      <NA> rnd-4_family-1996
##       [5]        I     25610-25840      + | RepeatMasker Unknown        25.3      <NA>  rnd-1_family-105
##       ...      ...             ...    ... .          ...       ...       ...       ...               ...
##   [91850]      III 9996966-9997251      - | RepeatMasker SINE           11.5      <NA>   rnd-1_family-17
##   [91851]      III 9997460-9997621      - | RepeatMasker PLE/Naiad      16.1      <NA> rnd-5_family-1125
##   [91852]      III 9998041-9998288      - | RepeatMasker LINE/CR1       25.2      <NA>  rnd-1_family-633
##   [91853]      III 9998333-9998479      - | RepeatMasker Unknown        25.2      <NA> rnd-5_family-1813
##   [91854]      III 9999160-9999618      - | RepeatMasker Unknown         5.2      <NA>  rnd-1_family-217
##                    Parent       class          rep_name
##           <CharacterList> <character>       <character>
##       [1]               .    DNA/MITE   rnd-1_family-53
##       [2]               .        SINE   rnd-1_family-17
##       [3]               .   Satellite   rnd-1_family-80
##       [4]               .     Unknown rnd-4_family-1996
##       [5]               .     Unknown  rnd-1_family-105
##       ...             ...         ...               ...
##   [91850]               .        SINE   rnd-1_family-17
##   [91851]               .   PLE/Naiad rnd-5_family-1125
##   [91852]               .    LINE/CR1  rnd-1_family-633
##   [91853]               .     Unknown rnd-5_family-1813
##   [91854]               .     Unknown  rnd-1_family-217
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

You will see that this object belongs to the GRanges class, which is very useful to store information about coordinates on genomes (so, most kinds of annotation)

We can access different parts of the object with the following functions:

seqnames(annot)

## factor-Rle of length 91854 with 3 runs
##   Lengths: 32131 32029 27694
##   Values :   I     II    III
## Levels(3): I II III

ranges(annot)

## IRanges object with 91854 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]     20308     20494       187
##       [2]     24008     24205       198
##       [3]     24965     25055        91
##       [4]     25211     25296        86
##       [5]     25610     25840       231
##       ...       ...       ...       ...
##   [91850]   9996966   9997251       286
##   [91851]   9997460   9997621       162
##   [91852]   9998041   9998288       248
##   [91853]   9998333   9998479       147
##   [91854]   9999160   9999618       459

strand(annot)

## factor-Rle of length 91854 with 6 runs
##   Lengths: 16512 15619 16528 15501 14609 13085
##   Values :     +     -     +     -     +     -
## Levels(3): + - *

mcols(annot)

## DataFrame with 91854 rows and 8 columns
##             source      type     score     phase                ID          Parent       class          rep_name
##           <factor>  <factor> <numeric> <integer>       <character> <CharacterList> <character>       <character>
## 1     RepeatMasker DNA/MITE       13.4        NA   rnd-1_family-53               .    DNA/MITE   rnd-1_family-53
## 2     RepeatMasker SINE           20.8        NA   rnd-1_family-17               .        SINE   rnd-1_family-17
## 3     RepeatMasker Satellite      19.8        NA   rnd-1_family-80               .   Satellite   rnd-1_family-80
## 4     RepeatMasker Unknown        11.6        NA rnd-4_family-1996               .     Unknown rnd-4_family-1996
## 5     RepeatMasker Unknown        25.3        NA  rnd-1_family-105               .     Unknown  rnd-1_family-105
## ...            ...       ...       ...       ...               ...             ...         ...               ...
## 91850 RepeatMasker SINE           11.5        NA   rnd-1_family-17               .        SINE   rnd-1_family-17
## 91851 RepeatMasker PLE/Naiad      16.1        NA rnd-5_family-1125               .   PLE/Naiad rnd-5_family-1125
## 91852 RepeatMasker LINE/CR1       25.2        NA  rnd-1_family-633               .    LINE/CR1  rnd-1_family-633
## 91853 RepeatMasker Unknown        25.2        NA rnd-5_family-1813               .     Unknown rnd-5_family-1813
## 91854 RepeatMasker Unknown         5.2        NA  rnd-1_family-217               .     Unknown  rnd-1_family-217

The last function, mcols(), returns a table of metadata, where we can store and add any kind of extra information for each coordinate or region.

We can easily manipulate information stored in this section, like this:

table(mcols(annot)$type)

## 
##          DNA/MITE              SINE         Satellite           Unknown     Simple_repeat          LINE/CR1 
##              4527              3175              1523             23345              4529              2598 
##        DNA/Merlin      LINE/RTE-RTE             ncRNA        DNA/hAT-Ac    Low_complexity         PLE/Naiad 
##               301              1539              1079              2506               483               450 
##     DNA/TcMar-Tc4 DNA/TcMar-Mariner     LINE/RTE-BovB     DNA/TcMar-Tc1               DNA     LINE/Penelope 
##               871              3250               556              1272                94              1466 
##              LINE               LTR     DNA/MULE-MuDR              exon               CDS              mRNA 
##               183                46               385             16008             15638              1400 
##              gene    five_prime_UTR   three_prime_UTR         DNA/MITE?           LTR/Pao         LTR/Gypsy 
##               996               815               855               474               586               437 
##       RC/Helitron  DNA/TcMar-ISRm11          LTR/DIRS    DNA/TcMar-Pogo              rRNA       PLE/Chlamys 
##               137               205                19                25                12                36 
##           LINE/R1     LINE/I-Jockey         LTR/Copia             snRNA      DNA/PiggyBac DNA/PIF-Harbinger 
##                 5                 8                11                 4                 4                 1

Clearly there are many more of these columns than we actually need. If you just want to select a few, and remove the rest, you can do something like this:

mcols(annot) <- mcols(annot)[,c("source","type","ID","class")]

mcols(annot)

## DataFrame with 91854 rows and 4 columns
##             source      type                ID       class
##           <factor>  <factor>       <character> <character>
## 1     RepeatMasker DNA/MITE    rnd-1_family-53    DNA/MITE
## 2     RepeatMasker SINE        rnd-1_family-17        SINE
## 3     RepeatMasker Satellite   rnd-1_family-80   Satellite
## 4     RepeatMasker Unknown   rnd-4_family-1996     Unknown
## 5     RepeatMasker Unknown    rnd-1_family-105     Unknown
## ...            ...       ...               ...         ...
## 91850 RepeatMasker SINE        rnd-1_family-17        SINE
## 91851 RepeatMasker PLE/Naiad rnd-5_family-1125   PLE/Naiad
## 91852 RepeatMasker LINE/CR1   rnd-1_family-633    LINE/CR1
## 91853 RepeatMasker Unknown   rnd-5_family-1813     Unknown
## 91854 RepeatMasker Unknown    rnd-1_family-217     Unknown

Exercises:

How can you save a new object where you only keep annotations where the type column equals LTR?
- Slightly more advanced: annotations where the type contains the word LTR.
How about a new object containing all annotations on the “-” strand?
Finally, only those on chromosome III?

Importing BAM alignments

The next step we need to do is to load all the information about the genome positions to which our reads aligned. For this we are going to use sequences that were trimmed and filtered as in the last practical, and then aligned against the H. bakeri genome. Please download the BAM file directly from here: Pure_EV_r1.trimmed.bam. You can use wget on your command line, or download it and move it to your working directory. Inside R we can create an object which contains the filename. If you decide to use a different filename, just make sure that you change the code to the same name.

bamFile <- "Pure_EV_r1.trimmed.bam"

We are now going to use the GenomicAlignments package to load all the aligments present in the BAM file into R. This package should allow us to load BAM files of any sort, including those resulting from aligning paired-end reads.

library(GenomicAlignments)

bamAlign <- readGAlignments(bamFile)

bamAlign

## GAlignments object with 376957 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width     njunc
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##        [1]        I      +         23M        23     24313     24335        23         0
##        [2]        I      +         22M        22     25909     25930        22         0
##        [3]        I      +         22M        22     26082     26103        22         0
##        [4]        I      +         22M        22     26113     26134        22         0
##        [5]        I      +         22M        22     26113     26134        22         0
##        ...      ...    ...         ...       ...       ...       ...       ...       ...
##   [376953]      III      -         23M        23   9999761   9999783        23         0
##   [376954]      III      -         21M        21   9999761   9999781        21         0
##   [376955]      III      -         22M        22   9999900   9999921        22         0
##   [376956]      III      -         21M        21   9999901   9999921        21         0
##   [376957]      III      -         19M        19   9999932   9999950        19         0
##   -------
##   seqinfo: 3 sequences from an unspecified genome

A slight problem is that the default object class is GAlignments. In our case we are better off converting to a GRanges object that (hopefully!) we are all more comfortable with.

readsGR <- as(bamAlign, "GRanges")

readsGR

## GRanges object with 376957 ranges and 0 metadata columns:
##            seqnames          ranges strand
##               <Rle>       <IRanges>  <Rle>
##        [1]        I     24313-24335      +
##        [2]        I     25909-25930      +
##        [3]        I     26082-26103      +
##        [4]        I     26113-26134      +
##        [5]        I     26113-26134      +
##        ...      ...             ...    ...
##   [376953]      III 9999761-9999783      -
##   [376954]      III 9999761-9999781      -
##   [376955]      III 9999900-9999921      -
##   [376956]      III 9999901-9999921      -
##   [376957]      III 9999932-9999950      -
##   -------
##   seqinfo: 3 sequences from an unspecified genome

Counting reads overlapping annotation

We can use the countOverlaps() function to find out how many of our reads fall within (or nearby) each of our annotated regions. We can then store this result directly as an extra column of our annotation.

WARNING: Although we are going to ignore this for now, many of our annotations could actually be overlapping each other. Any read that maps to these overlapping regions will be counted again and again for each overlapping region.

mcols(annot)$counts <- countOverlaps(annot, readsGR)

mcols(annot)

## DataFrame with 91854 rows and 5 columns
##             source      type                ID       class    counts
##           <factor>  <factor>       <character> <character> <integer>
## 1     RepeatMasker DNA/MITE    rnd-1_family-53    DNA/MITE         0
## 2     RepeatMasker SINE        rnd-1_family-17        SINE         0
## 3     RepeatMasker Satellite   rnd-1_family-80   Satellite         0
## 4     RepeatMasker Unknown   rnd-4_family-1996     Unknown         0
## 5     RepeatMasker Unknown    rnd-1_family-105     Unknown         0
## ...            ...       ...               ...         ...       ...
## 91850 RepeatMasker SINE        rnd-1_family-17        SINE         0
## 91851 RepeatMasker PLE/Naiad rnd-5_family-1125   PLE/Naiad         0
## 91852 RepeatMasker LINE/CR1   rnd-1_family-633    LINE/CR1         0
## 91853 RepeatMasker Unknown   rnd-5_family-1813     Unknown         0
## 91854 RepeatMasker Unknown    rnd-1_family-217     Unknown         0

If you receive a Warning with the above code you can ignore it for now. This warning arises because the “chromosomes” (reference sequences) are not identical between the annotation and the reads. R is warning us that we might be comparing different genomes, but we know that this is not so in our case.

One way of visualising these counts is to aggregate them in some way. We could do this by biotype, or even by individual gene.

If we only wanted to count reads in genes, we would first want to select the part of the object that corresponds to genes.

genes <- annot[annot$type == "gene"]

geneCounts <- aggregate(mcols(genes)$counts, by=list("id"=mcols(genes)$ID), sum)

head(geneCounts)

##                  id   x
## 1     nxHelBake1.g1   1
## 2    nxHelBake1.g10 234
## 3   nxHelBake1.g100   0
## 4 nxHelBake1.g10000  48
## 5 nxHelBake1.g10001   0
## 6 nxHelBake1.g10002  83

We could then use these geneCounts directly to perform differential gene expression analysis.

On the other hand, it can be quite useful to plot the distribution of our reads among different types of annotation.

typeCounts <- aggregate(mcols(annot)$counts, by=list("type"=mcols(annot)$type), sum)

typeCounts

##                 type      x
## 1           DNA/MITE    542
## 2               SINE    465
## 3          Satellite   4721
## 4            Unknown  18221
## 5      Simple_repeat    557
## 6           LINE/CR1    503
## 7         DNA/Merlin     11
## 8       LINE/RTE-RTE   4412
## 9              ncRNA   9484
## 10        DNA/hAT-Ac    380
## 11    Low_complexity      5
## 12         PLE/Naiad     17
## 13     DNA/TcMar-Tc4   1825
## 14 DNA/TcMar-Mariner   2771
## 15     LINE/RTE-BovB 122750
## 16     DNA/TcMar-Tc1    400
## 17               DNA      2
## 18     LINE/Penelope    283
## 19              LINE      2
## 20               LTR      3
## 21     DNA/MULE-MuDR     73
## 22              exon   1916
## 23               CDS   1078
## 24              mRNA 113982
## 25              gene  74640
## 26    five_prime_UTR    282
## 27   three_prime_UTR    562
## 28         DNA/MITE?    328
## 29           LTR/Pao    131
## 30         LTR/Gypsy    190
## 31       RC/Helitron    684
## 32  DNA/TcMar-ISRm11      1
## 33          LTR/DIRS      0
## 34    DNA/TcMar-Pogo      0
## 35              rRNA     20
## 36       PLE/Chlamys      1
## 37           LINE/R1      0
## 38     LINE/I-Jockey      0
## 39         LTR/Copia      0
## 40             snRNA      0
## 41      DNA/PiggyBac      0
## 42 DNA/PIF-Harbinger      0

Exercise:

Using the results in typeCounts, make a pie chart like the one below.
- You will need to sort your data from smallest to largest counts.
- Collapse the smallest categories (below a threshold that you define) into a single “other” category, and then remove them .

For discussion

The process of assigning mapped reads to a particular gene (to count it’s expression) is much more complex than what we have done in this practical. Think about the following questions:

What should we do if a read maps equally well to two genes?
What should we do if a read overlaps with a gene, but only by one base?
What should we do if a read maps to a gene, but on the opposite strand?
What should we do if a read maps to more than one location in the genome?
What should we do if a read maps to an intron of a gene?

There are various programs that specialise in obtaining gene expression values from mapped reads. These allow us to control many of the issues brought up by the above questions. One good option to do this inside R is the featureCounts() function from the Rsubread package which we will be using later.

Reducing the overlap problem

You might have realised this just by looking at the names of annotated elements in the type column, which include exon, CDS, mRNA, gene, five_prime_UTR and three_prime_UTR. A gene is transcribed into an mRNA, and an mRNA is composed of exons. Within the mRNA we can have a 5’ UTR, a CDS and a 3’UTR. So there is bound to be a lot of redundancy here.

To make this very clear, have a look at the following image, based on an actual segment of the genome annotation which you can explore here: IGV g199

Can you tell how many times the black read would be counted, even though it doesn’t overlap with any of the exons?

The previous pie chart also suggests this redundancy, with both gene and mRNA appearing as large slices. The example figure can help you see why mRNA is bigger than gene.

In many cases when we are exploring our own data in new ways, we will need to think about how to deal with redundancy. In this case involving protein-coding genes, the easiest way is to keep only the exon category, since these should encompass all the relevant bases in the genome (and will definitely include all the CDS, five_prime_UTR and three_prime_UTR).

The mRNA and gene categories are definitely unnecessary and can be quite dangerous since they include introns (not annotated as a separate category). In H. bakeri many transposons reside within introns. The slice of the pie labeled as gene will be inflated due to reads mapping to transposons that reside in intron.

On the other hand, we might be interested in studying introns as a category. There are many regulatory elements in introns. Other functional elements like miRNAs or transposons can reside in introns and some might be missed in our annotation.

Removing redundant annotations

Let’s first pre-define annotation categories that we want to remove, and count how many annotation elements we would remove.

toRem <- c("CDS", "mRNA", "gene", "five_prime_UTR", "three_prime_UTR")

head(annot$type %in% toRem)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

head(!annot$type %in% toRem)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

sum(annot$type %in% toRem)

## [1] 19704

sum(!annot$type %in% toRem)

## [1] 72150

The %in% function allows us to search for matching words between the left-hand object (annot$type in this case) and the right-hand object (toRem). When an item matches, we get TRUE, and when it doesn’t we get FALSE. It’s easy to invert these by using the ! operator (which we read as not).

Can you see what the above instructions are doing?

Let’s use this result to remove the redundant categories.

annotNR <- annot[!annot$type %in% toRem]

length(annot)

## [1] 91854

length(annotNR)

## [1] 72150

Finally, let’s count the reads again using this new annotation, and see how the pie changes. Exactly the same code:

mcols(annotNR)$counts <- countOverlaps(annotNR, readsGR)

typeCountsNR <- aggregate(mcols(annotNR)$counts, by=list("type"=mcols(annotNR)$type), sum)

minCount <- 1600
typeCountsNRHigh <- typeCountsNR[typeCountsNR$x > minCount,]
typeCountsNRHigh <- typeCountsNRHigh[order(typeCountsNRHigh$x),]
typeCountsNRHigh <- rbind(data.frame("type"="other",
                           "x"=sum(typeCountsNR$x[typeCountsNR$x <= minCount])),
                           typeCountsNRHigh)

pie(typeCountsNRHigh$x, labels=typeCountsNRHigh$type, col=rev(rainbow(nrow(typeCountsNRHigh))),
    main="Number of aligned reads per type (NR)", cex=0.9)

A drastically different picture, I’m sure you’ll agree.

But as mentioned above, by removing the gene/mRNA categories, we have removed introns. We can confirm just how many annotated bases we have removed, by a simple calculation taking advantage of the reduce function which collapses all overlaps.

nBasesAll <- sum(width(reduce(annot)))

nBasesNR  <- sum(width(reduce(annotNR)))

missingBases <- nBasesAll - nBasesNR
missingPerc <-  missingBases / nBasesAll * 100

print(paste("Percent of annotated bases now missing:",round(missingPerc,1)))

## [1] "Percent of annotated bases now missing: 34.6"

Adding more annotation: introns

This is definitely more advanced, but let’s look at how we would go about defining the regions in the genome that belong to introns. What is an intron? One way of thinking about this is that introns are the spaces between exons. So if we start with our mRNA positions, and remove the exons, we should be left with the introns. Can we do this? Yes!

mrnas <- reduce(annot[annot$type == "mRNA"])

exons <- reduce(annot[annot$type == "exon"])

introns <- setdiff(mrnas, exons)

introns

## GRanges object with 8930 ranges and 0 metadata columns:
##          seqnames          ranges strand
##             <Rle>       <IRanges>  <Rle>
##      [1]        I   250694-251696      +
##      [2]        I   251849-255490      +
##      [3]        I   255574-255651      +
##      [4]        I   255775-260767      +
##      [5]        I   260857-261021      +
##      ...      ...             ...    ...
##   [8926]      III 9977451-9977673      -
##   [8927]      III 9977805-9977866      -
##   [8928]      III 9977909-9980742      -
##   [8929]      III 9980881-9980934      -
##   [8930]      III 9981167-9983991      -
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

The setdiff function will first collapse both sets of ranges, and then define all the fragments of the first that are not present in the second. By necessity this kind of operation will discard the annotation present in the mcols.

Since we know that there can be other annotation categories within the introns (especially transposons) we might as well clean up these introns before we use them. This is pretty easy, since we just want to remove any part of the introns that overlaps with anything currently in our NR annotation. We can use the setdiff function again.

introns <- setdiff(introns, annotNR)

Now let’s confirm that the number of bases that we initially lost have been recovered by estimating the introns.

missingBases

## [1] 10710622

sum(width(introns))

## [1] 10710622

Pretty good!

Because we have lost all annotation, we need to create some values for the mcols before we can combine them all together. If we don’t do this now, we end up with all the values filled with NA (and this could cause problems later on)

mcols(introns)$source <- "manual"
mcols(introns)$type <- "introns"
mcols(introns)$ID <- "introns"
mcols(introns)$class <- "protein_coding"
mcols(introns)$counts <- 0

We can now add our new introns onto our general annotation, and we can repeat our analysis.

annotNR_introns <- c(annotNR, introns)

annotNR_introns

## GRanges object with 93363 ranges and 5 metadata columns:
##           seqnames          ranges strand |       source      type                ID          class    counts
##              <Rle>       <IRanges>  <Rle> |     <factor>  <factor>       <character>    <character> <numeric>
##       [1]        I     20308-20494      + | RepeatMasker DNA/MITE    rnd-1_family-53       DNA/MITE         0
##       [2]        I     24008-24205      + | RepeatMasker SINE        rnd-1_family-17           SINE         0
##       [3]        I     24965-25055      + | RepeatMasker Satellite   rnd-1_family-80      Satellite         0
##       [4]        I     25211-25296      + | RepeatMasker Unknown   rnd-4_family-1996        Unknown         0
##       [5]        I     25610-25840      + | RepeatMasker Unknown    rnd-1_family-105        Unknown         0
##       ...      ...             ...    ... .          ...       ...               ...            ...       ...
##   [93359]      III 9979957-9980742      - |       manual   introns           introns protein_coding         0
##   [93360]      III 9980881-9980934      - |       manual   introns           introns protein_coding         0
##   [93361]      III 9981167-9981317      - |       manual   introns           introns protein_coding         0
##   [93362]      III 9981691-9981737      - |       manual   introns           introns protein_coding         0
##   [93363]      III 9981914-9983991      - |       manual   introns           introns protein_coding         0
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

mcols(annotNR_introns)$counts <- countOverlaps(annotNR_introns, readsGR)

typeCountsNR <- aggregate(mcols(annotNR_introns)$counts, by=list("type"=mcols(annotNR_introns)$type), sum)

minCount <- 1600
typeCountsNRHigh <- typeCountsNR[typeCountsNR$x > minCount,]
typeCountsNRHigh <- typeCountsNRHigh[order(typeCountsNRHigh$x),]
typeCountsNRHigh <- rbind(data.frame("type"="other",
                           "x"=sum(typeCountsNR$x[typeCountsNR$x <= minCount])),
                           typeCountsNRHigh)

pie(typeCountsNRHigh$x, labels=typeCountsNRHigh$type, col=rev(rainbow(nrow(typeCountsNRHigh))),
    main="Number of aligned reads per type (NR with introns)", cex=0.9)

The picture changes once again! But we are getting closer to having a realistic view of how our sRNAs are spread across the genomic annotation.

As you can hopefully see, this is by no means trivial, and requires a lot of work to weed out problematic (redundant or overlapping) annotation, as well as adding important annotation that we might be missing.

More recommendations

We don’t have enough space or time to go into greater depth, but I’ll just remind everyone here of three problems that remain in our annotation:

There are still overlaps in our annotation. But careful use of reduce and setdiff functions can fix this if needed.
We are not counting reads that fall outside of our annotation, what we would call intergenic regions. If we just want to capture them, we can use the gaps function. Alternatively we might want intergenic regions that are upstream/downstream of genes (e.g. promoters). For this, we have a promoters function as well as a more generic flank function.
Many siRNAs are produced in anti-sense to their primary RNA. We have not counted these so far (by default only overlaps on the same strand are counted.

To illustrate this last problem, I’ll compare the previous results (counting overlaps on the same or sense strand) to the same calculation but now counting reads on the opposite or anti-sense strand of each annotation. For this I will take advantage of an invertStrand function.

mcols(annotNR_introns)$counts <- countOverlaps(annotNR_introns, readsGR, minoverlap=10)

mcols(annotNR_introns)$counts_AS <- countOverlaps(annotNR_introns, invertStrand(readsGR), minoverlap=10)

To simplify the results, I will also convert annotation types to a simpler version, where I am collapsing many kinds of transposable elements into more general categories.

mcols(annotNR_introns)$simple_type <- annotNR_introns$type
annotNR_introns$simple_type[grepl("LINE|PLE",annotNR_introns$type)] <- "LINE"
annotNR_introns$simple_type[grepl("LTR",annotNR_introns$type)] <- "LTR"
annotNR_introns$simple_type[grepl("DNA|RC",annotNR_introns$type)] <- "DNA"
annotNR_introns$simple_type <- factor(annotNR_introns$simple_type)

The rest of the code is the same as before:

typeCountsNR_S  <- aggregate(mcols(annotNR_introns)$counts, by=list("type"=mcols(annotNR_introns)$simple_type), sum)
typeCountsNR_AS <- aggregate(mcols(annotNR_introns)$counts_AS, by=list("type"=mcols(annotNR_introns)$simple_type), sum)

all(typeCountsNR_S$type == typeCountsNR_AS$type)

## [1] TRUE

allCounts <- cbind(typeCountsNR_S$x, typeCountsNR_AS$x)
rownames(allCounts) <- typeCountsNR_S$type
colnames(allCounts) <- c("sense","anti-sense")

cols <- rainbow(nrow(allCounts))
barplot(cbind(allCounts,NA), beside=TRUE, las=1, col=cols, legend.text=TRUE,
        main="Aligned reads on sense vs anti-sense strands")

Counting overlaps with featureCounts

As mentioned briefly above, there are more specialised programs (in R, but also on the Linux command line) for counting the overlaps between reads and genomic regions. The one that I would currently recommend is called featureCounts and is part of the Rsubread package that you should have already installed.

There are two main advantages over the countOverlaps function we have been using so far:

featureCounts gives us much more flexibility to configure the kind of overlaps we want to count. For example:

We can choose to ignore reads that overlap multiple annotations
We can choose to count the corresponding fraction of a read if it maps to multiple annotations
When multiple overlaps exist, we can prefer the longest overlap
We can decide if we want to count overlaps on one strand, the opposite strand, or both

featureCounts is way more computationally efficient: it is faster, and uses less memory.

This is particularly important if we are counting overlaps with many large BAM files.
It can also use multiple CPU threads

The main disadvantage is that we have to spend some time learning to use it, probably by reading the manual or help page: ?featureCounts. One key difference is that it does not use GRanges objects. It will read the BAM files directly from disk, and the annotation needs to be provided either as a GTF/GFF file (I do not recommend this, due to the overlap issues mentioned above) or a simple table format in R called SAF (Simplified Annotation Format).

Let’s show all this in action with the data we have been working with. As always, we first need to load any required libraries.

library(Rsubread)

Next we create the saf object with the annotation we want. Note that featureCounts will combine all the counts at the GeneID level. Here we take advantage of this to count by the simple_type of annotation we have previously defined. More traditionally featureCounts is used for counting gene expression in RNA-Seq experiments, and this column would be used to combine the counts of any exon belonging to the same GeneID.

saf <- data.frame("GeneID" = annotNR_introns$simple_type,
                  "Chr"    = seqnames(annotNR_introns),
                  "Start"  = start(annotNR_introns),
                  "End"    = end(annotNR_introns),
                  "Strand" = strand(annotNR_introns)
)

head(saf)

##          GeneID Chr Start   End Strand
## 1           DNA   I 20308 20494      +
## 2          SINE   I 24008 24205      +
## 3     Satellite   I 24965 25055      +
## 4       Unknown   I 25211 25296      +
## 5       Unknown   I 25610 25840      +
## 6 Simple_repeat   I 25843 25894      +

Now let’s run the function. Check the help page for details about the arguments I’m using.

fcRes <- featureCounts(bamFile, annot.ext=saf, allowMultiOverlap=TRUE, largestOverlap=TRUE,
                       fraction=TRUE, minOverlap=10,
                       strandSpecific=2, nthreads=2)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.22.1
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 1 BAM file                                       ||
## ||                                                                            ||
## ||                           Pure_EV_r1.trimmed.bam                           ||
## ||                                                                            ||
## ||              Paired-end : no                                               ||
## ||        Count read pairs : no                                               ||
## ||              Annotation : R data.frame                                     ||
## ||      Dir for temp files : .                                                ||
## ||                 Threads : 2                                                ||
## ||                   Level : meta-feature level                               ||
## ||      Multimapping reads : counted (fractional)                             ||
## || Multi-overlapping reads : counted                                          ||
## ||   Min overlapping bases : 10                                               ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file .Rsubread_UserProvidedAnnotation_pid59217 ...         ||
## ||    Features : 93363                                                        ||
## ||    Meta-features : 13                                                      ||
## ||    Chromosomes/contigs : 3                                                 ||
## ||                                                                            ||
## || Process BAM file Pure_EV_r1.trimmed.bam...                                 ||
## ||    Strand specific : reversely stranded                                    ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 376957                                               ||
## ||    Successfully assigned alignments : 225821 (59.9%)                       ||
## ||    Running time : 0.00 minutes                                             ||
## ||                                                                            ||
## || Write the final count table.                                               ||
## || Write the read assignment summary.                                         ||
## ||                                                                            ||
## \\============================================================================//

The result object contains several things.

names(fcRes)

## [1] "counts"     "annotation" "targets"    "stat"

fcRes$stat

##                           Status Pure_EV_r1.trimmed.bam
## 1                       Assigned                 225821
## 2            Unassigned_Unmapped                      0
## 3           Unassigned_Read_Type                      0
## 4           Unassigned_Singleton                      0
## 5      Unassigned_MappingQuality                      0
## 6             Unassigned_Chimera                      0
## 7      Unassigned_FragmentLength                      0
## 8           Unassigned_Duplicate                      0
## 9        Unassigned_MultiMapping                      0
## 10          Unassigned_Secondary                      0
## 11           Unassigned_NonSplit                      0
## 12         Unassigned_NoFeatures                 149503
## 13 Unassigned_Overlapping_Length                   1633
## 14          Unassigned_Ambiguity                      0

fcRes$counts

##                Pure_EV_r1.trimmed.bam
## DNA                          29799.33
## SINE                           282.50
## Satellite                     2197.17
## Unknown                      22447.67
## Simple_repeat                   69.00
## LINE                         35321.33
## ncRNA                           69.67
## Low_complexity                   0.00
## LTR                          45499.50
## exon                         38062.83
## rRNA                             4.00
## snRNA                            1.00
## introns                      52067.00

tmpCounts <- cbind(allCounts, fcRes$counts[rownames(allCounts),])
colnames(tmpCounts)[3] <- "featureCounts"
barplot(cbind(tmpCounts,NA), beside=TRUE, las=1, col=cols, legend.text=TRUE,
        main="featureCounts result on the anti-sense strand")

There are some small differences between the two approaches, caused by the level of overlaps between our annotation. Remember that featureCounts will not be counting the same read more than once, instead it will prefer to assign the count to the longest overlap, and if not it will assign a fractional count to multiple annotations. The countOverlaps function will count the same read as many times as it finds overlaps.

I would always recommend using featureCounts since it was designed exactly for these purposes.

Extras

To help illustrate the IRanges examples at the start of this practical, I used the following function from the manual of the IRanges package. We haven’t discussed creating our own functions in this workshop, but you can copy/paste it into your code if you want to use it.

plotRanges <- function(x, xlim=x, main=deparse(substitute(x)), col="black", sep=0.2, ...) {
    height <- 1
    if (is(xlim,"IntegerRanges")) {
      xlim <- c(min(start(xlim)), max(end(xlim)))
    }
    bins <- disjointBins(IRanges(start(x), end(x) + 1))
    plot.new()
    plot.window(xlim, c(0, max(bins)*(height + sep)))
    ybottom <- bins * (sep + height)- height
    rect(start(x)-0.1, ybottom, end(x)+0.1, ybottom + height, col=col, ...)
    title(main)
    axis(1)
}

And a basic example:

x <- IRanges(start=c(11,35,40), end=c(20,50,63))

plotRanges(x, col=4, xlim=c(0,80), main="IRanges example")