The purpose of this practical is to use our genome’s annotation to catalogue our sequenced reads. At the end of this practical you will produce a figure showing the relative amount of reads that can be assigned to each type of annotation (coding regions, miRNAs, rRNA, snoRNA, etc). We will do all of this inside R and using some Bioconductor packages.
Exactly the same strategy can be used even when no genome is available, as long as you have some sort of annotation regarding the reference sequences you are using (which could be contigs, ESTs, cDNAs, etc).
During this practical you will need the following Bioconductor
packages:
IRanges, rtracklayer, GenomicRanges, Rsamtools, GenomicAlignments
If you haven’t installed them before, you should do so now:
BiocManager::install(c("IRanges", "rtracklayer", "GenomicRanges", "Rsamtools", "GenomicAlignments", "Rsubread"))In this practical we will be manipulating different types of information that can be represented as ranges (gene coordinates, positions of mapped sequences, etc). One of the most flexible ways of working with ranges in R is to use the IRanges package.
To begin, let’s create an IRanges object, by defining the start and end positions of 3 ranges. Don’t worry about what these ranges mean right now, we will get to it.
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 20 10
## [2] 35 50 16
## [3] 40 63 24
An image might make it easier to understand what the values in our
IRanges object represent.
There are several functions that we can use to extract information from an IRanges object:
## [1] 11 35 40
## [1] 20 50 63
Again, a visual representation of these results
## [1] 10 16 24
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 63 53
One very useful function allows us to calculate the
coverage, which is the sum of how many ranges overlap each
position.
## integer-Rle of length 63 with 6 runs
## Lengths: 10 10 14 5 11 13
## Values : 0 1 0 1 2 1
A integer-Rle means integers stored as Run-length
encoded, and is a compact way of storing runs of the same value. So
first we have a Value of 0 that is repeated a
Length of 10 times. Much easier to understand if we simply
convert it to the actual integers:
## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
## [59] 1 1 1 1 1
And easier still as an image:
Coverage values have many uses, such as for displaying the accumulation of mapped reads on genome browsers, for detecting peaks where reads are piling up above a certain threshold, etc.
Another very useful function is reduce, which fuses all
overlapping ranges into one.
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 20 10
## [2] 35 63 29
And now to give you a closer idea to what we will be doing with these objects:
## IRanges object with 5 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 20 20
## [2] 23 42 20
## [3] 31 50 20
## [4] 54 73 20
## [5] 78 97 20
## [1] 1 3
Using the figure, make sure you can understand what the previous
function is telling us: how many reads overlap with each of
the two exons.
For more details, you can have a look at the IRanges vignette
(browseVignettes(package="IRanges")).
The first thing we need is information about the genome elements we are interested in studying. For this project, we are interested in the annotation of the mouse and H. bakeri genomes. For each region of a genome we are interested in, we need to know it’s coordinates (what chromosome, start and end positions) and the strand on which it lies. We also want to know what type of element is encoded there: a protein? a non-coding RNA?
There are many ways to obtain these types of information. One very useful resource is Ensembl, where we can download a GTF/GFF (Gene Transfer/Feature Format) file with all the annotation information of a genome. Unfortunately, Ensembl does not include all possible genomes, and in particular H. bakeri is missing. There are similar resources though, and WormBase ParaSite does include the genome of H. bakeri.
For this practical please download the following annotation file:
We will now use a function from the rtracklayerpackage
to import the data into R. We will also be using the
GenomicRanges package, so let’s load them both.
If for whatever reason you were not able to load the gff3 file as above, you can download the Rdata file I saved with the previous instruction annot.Rdata, and load it directly into your R session. If you did load the file correctly with the
importfunction, then do NOT use the following command.
In either case, you should now be able to view all the loaded annotation:
## GRanges object with 91854 ranges and 8 metadata columns:
## seqnames ranges strand | source type score phase ID
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
## [1] I 20308-20494 + | RepeatMasker DNA/MITE 13.4 <NA> rnd-1_family-53
## [2] I 24008-24205 + | RepeatMasker SINE 20.8 <NA> rnd-1_family-17
## [3] I 24965-25055 + | RepeatMasker Satellite 19.8 <NA> rnd-1_family-80
## [4] I 25211-25296 + | RepeatMasker Unknown 11.6 <NA> rnd-4_family-1996
## [5] I 25610-25840 + | RepeatMasker Unknown 25.3 <NA> rnd-1_family-105
## ... ... ... ... . ... ... ... ... ...
## [91850] III 9996966-9997251 - | RepeatMasker SINE 11.5 <NA> rnd-1_family-17
## [91851] III 9997460-9997621 - | RepeatMasker PLE/Naiad 16.1 <NA> rnd-5_family-1125
## [91852] III 9998041-9998288 - | RepeatMasker LINE/CR1 25.2 <NA> rnd-1_family-633
## [91853] III 9998333-9998479 - | RepeatMasker Unknown 25.2 <NA> rnd-5_family-1813
## [91854] III 9999160-9999618 - | RepeatMasker Unknown 5.2 <NA> rnd-1_family-217
## Parent class rep_name
## <CharacterList> <character> <character>
## [1] . DNA/MITE rnd-1_family-53
## [2] . SINE rnd-1_family-17
## [3] . Satellite rnd-1_family-80
## [4] . Unknown rnd-4_family-1996
## [5] . Unknown rnd-1_family-105
## ... ... ... ...
## [91850] . SINE rnd-1_family-17
## [91851] . PLE/Naiad rnd-5_family-1125
## [91852] . LINE/CR1 rnd-1_family-633
## [91853] . Unknown rnd-5_family-1813
## [91854] . Unknown rnd-1_family-217
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
You will see that this object belongs to the GRanges class, which is very useful to store information about coordinates on genomes (so, most kinds of annotation)
We can access different parts of the object with the following functions:
## factor-Rle of length 91854 with 3 runs
## Lengths: 32131 32029 27694
## Values : I II III
## Levels(3): I II III
## IRanges object with 91854 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 20308 20494 187
## [2] 24008 24205 198
## [3] 24965 25055 91
## [4] 25211 25296 86
## [5] 25610 25840 231
## ... ... ... ...
## [91850] 9996966 9997251 286
## [91851] 9997460 9997621 162
## [91852] 9998041 9998288 248
## [91853] 9998333 9998479 147
## [91854] 9999160 9999618 459
## factor-Rle of length 91854 with 6 runs
## Lengths: 16512 15619 16528 15501 14609 13085
## Values : + - + - + -
## Levels(3): + - *
## DataFrame with 91854 rows and 8 columns
## source type score phase ID Parent class rep_name
## <factor> <factor> <numeric> <integer> <character> <CharacterList> <character> <character>
## 1 RepeatMasker DNA/MITE 13.4 NA rnd-1_family-53 . DNA/MITE rnd-1_family-53
## 2 RepeatMasker SINE 20.8 NA rnd-1_family-17 . SINE rnd-1_family-17
## 3 RepeatMasker Satellite 19.8 NA rnd-1_family-80 . Satellite rnd-1_family-80
## 4 RepeatMasker Unknown 11.6 NA rnd-4_family-1996 . Unknown rnd-4_family-1996
## 5 RepeatMasker Unknown 25.3 NA rnd-1_family-105 . Unknown rnd-1_family-105
## ... ... ... ... ... ... ... ... ...
## 91850 RepeatMasker SINE 11.5 NA rnd-1_family-17 . SINE rnd-1_family-17
## 91851 RepeatMasker PLE/Naiad 16.1 NA rnd-5_family-1125 . PLE/Naiad rnd-5_family-1125
## 91852 RepeatMasker LINE/CR1 25.2 NA rnd-1_family-633 . LINE/CR1 rnd-1_family-633
## 91853 RepeatMasker Unknown 25.2 NA rnd-5_family-1813 . Unknown rnd-5_family-1813
## 91854 RepeatMasker Unknown 5.2 NA rnd-1_family-217 . Unknown rnd-1_family-217
The last function, mcols(), returns a table of metadata,
where we can store and add any kind of extra information for each
coordinate or region.
We can easily manipulate information stored in this section, like this:
##
## DNA/MITE SINE Satellite Unknown Simple_repeat LINE/CR1
## 4527 3175 1523 23345 4529 2598
## DNA/Merlin LINE/RTE-RTE ncRNA DNA/hAT-Ac Low_complexity PLE/Naiad
## 301 1539 1079 2506 483 450
## DNA/TcMar-Tc4 DNA/TcMar-Mariner LINE/RTE-BovB DNA/TcMar-Tc1 DNA LINE/Penelope
## 871 3250 556 1272 94 1466
## LINE LTR DNA/MULE-MuDR exon CDS mRNA
## 183 46 385 16008 15638 1400
## gene five_prime_UTR three_prime_UTR DNA/MITE? LTR/Pao LTR/Gypsy
## 996 815 855 474 586 437
## RC/Helitron DNA/TcMar-ISRm11 LTR/DIRS DNA/TcMar-Pogo rRNA PLE/Chlamys
## 137 205 19 25 12 36
## LINE/R1 LINE/I-Jockey LTR/Copia snRNA DNA/PiggyBac DNA/PIF-Harbinger
## 5 8 11 4 4 1
Clearly there are many more of these columns than we actually need. If you just want to select a few, and remove the rest, you can do something like this:
## DataFrame with 91854 rows and 4 columns
## source type ID class
## <factor> <factor> <character> <character>
## 1 RepeatMasker DNA/MITE rnd-1_family-53 DNA/MITE
## 2 RepeatMasker SINE rnd-1_family-17 SINE
## 3 RepeatMasker Satellite rnd-1_family-80 Satellite
## 4 RepeatMasker Unknown rnd-4_family-1996 Unknown
## 5 RepeatMasker Unknown rnd-1_family-105 Unknown
## ... ... ... ... ...
## 91850 RepeatMasker SINE rnd-1_family-17 SINE
## 91851 RepeatMasker PLE/Naiad rnd-5_family-1125 PLE/Naiad
## 91852 RepeatMasker LINE/CR1 rnd-1_family-633 LINE/CR1
## 91853 RepeatMasker Unknown rnd-5_family-1813 Unknown
## 91854 RepeatMasker Unknown rnd-1_family-217 Unknown
type column equals LTR?
type contains the word LTR.III?The next step we need to do is to load all the information about the
genome positions to which our reads aligned. For this we are going to
use sequences that were trimmed and filtered as in the last practical,
and then aligned against the H. bakeri genome. Please download
the BAM file directly from here: Pure_EV_r1.trimmed.bam. You can
use wget on your command line, or download it and move it
to your working directory. Inside R we can create an object which
contains the filename. If you decide to use a different filename, just
make sure that you change the code to the same name.
We are now going to use the GenomicAlignments package to
load all the aligments present in the BAM file into R. This package
should allow us to load BAM files of any sort, including those resulting
from aligning paired-end reads.
## GAlignments object with 376957 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer>
## [1] I + 23M 23 24313 24335 23 0
## [2] I + 22M 22 25909 25930 22 0
## [3] I + 22M 22 26082 26103 22 0
## [4] I + 22M 22 26113 26134 22 0
## [5] I + 22M 22 26113 26134 22 0
## ... ... ... ... ... ... ... ... ...
## [376953] III - 23M 23 9999761 9999783 23 0
## [376954] III - 21M 21 9999761 9999781 21 0
## [376955] III - 22M 22 9999900 9999921 22 0
## [376956] III - 21M 21 9999901 9999921 21 0
## [376957] III - 19M 19 9999932 9999950 19 0
## -------
## seqinfo: 3 sequences from an unspecified genome
A slight problem is that the default object class is
GAlignments. In our case we are better off converting to a
GRanges object that (hopefully!) we are all more
comfortable with.
## GRanges object with 376957 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] I 24313-24335 +
## [2] I 25909-25930 +
## [3] I 26082-26103 +
## [4] I 26113-26134 +
## [5] I 26113-26134 +
## ... ... ... ...
## [376953] III 9999761-9999783 -
## [376954] III 9999761-9999781 -
## [376955] III 9999900-9999921 -
## [376956] III 9999901-9999921 -
## [376957] III 9999932-9999950 -
## -------
## seqinfo: 3 sequences from an unspecified genome
We can use the countOverlaps() function to find out how
many of our reads fall within (or nearby) each of our annotated regions.
We can then store this result directly as an extra column of our
annotation.
WARNING: Although we are going to ignore this for now, many of our annotations could actually be overlapping each other. Any read that maps to these overlapping regions will be counted again and again for each overlapping region.
## DataFrame with 91854 rows and 5 columns
## source type ID class counts
## <factor> <factor> <character> <character> <integer>
## 1 RepeatMasker DNA/MITE rnd-1_family-53 DNA/MITE 0
## 2 RepeatMasker SINE rnd-1_family-17 SINE 0
## 3 RepeatMasker Satellite rnd-1_family-80 Satellite 0
## 4 RepeatMasker Unknown rnd-4_family-1996 Unknown 0
## 5 RepeatMasker Unknown rnd-1_family-105 Unknown 0
## ... ... ... ... ... ...
## 91850 RepeatMasker SINE rnd-1_family-17 SINE 0
## 91851 RepeatMasker PLE/Naiad rnd-5_family-1125 PLE/Naiad 0
## 91852 RepeatMasker LINE/CR1 rnd-1_family-633 LINE/CR1 0
## 91853 RepeatMasker Unknown rnd-5_family-1813 Unknown 0
## 91854 RepeatMasker Unknown rnd-1_family-217 Unknown 0
If you receive a Warning with the above code you can ignore it for now. This warning arises because the “chromosomes” (reference sequences) are not identical between the annotation and the reads. R is warning us that we might be comparing different genomes, but we know that this is not so in our case.
One way of visualising these counts is to aggregate them in some way. We could do this by biotype, or even by individual gene.
If we only wanted to count reads in genes, we would first want to select the part of the object that corresponds to genes.
## id x
## 1 nxHelBake1.g1 1
## 2 nxHelBake1.g10 234
## 3 nxHelBake1.g100 0
## 4 nxHelBake1.g10000 48
## 5 nxHelBake1.g10001 0
## 6 nxHelBake1.g10002 83
We could then use these geneCounts directly to perform
differential gene expression analysis.
On the other hand, it can be quite useful to plot the distribution of our reads among different types of annotation.
## type x
## 1 DNA/MITE 542
## 2 SINE 465
## 3 Satellite 4721
## 4 Unknown 18221
## 5 Simple_repeat 557
## 6 LINE/CR1 503
## 7 DNA/Merlin 11
## 8 LINE/RTE-RTE 4412
## 9 ncRNA 9484
## 10 DNA/hAT-Ac 380
## 11 Low_complexity 5
## 12 PLE/Naiad 17
## 13 DNA/TcMar-Tc4 1825
## 14 DNA/TcMar-Mariner 2771
## 15 LINE/RTE-BovB 122750
## 16 DNA/TcMar-Tc1 400
## 17 DNA 2
## 18 LINE/Penelope 283
## 19 LINE 2
## 20 LTR 3
## 21 DNA/MULE-MuDR 73
## 22 exon 1916
## 23 CDS 1078
## 24 mRNA 113982
## 25 gene 74640
## 26 five_prime_UTR 282
## 27 three_prime_UTR 562
## 28 DNA/MITE? 328
## 29 LTR/Pao 131
## 30 LTR/Gypsy 190
## 31 RC/Helitron 684
## 32 DNA/TcMar-ISRm11 1
## 33 LTR/DIRS 0
## 34 DNA/TcMar-Pogo 0
## 35 rRNA 20
## 36 PLE/Chlamys 1
## 37 LINE/R1 0
## 38 LINE/I-Jockey 0
## 39 LTR/Copia 0
## 40 snRNA 0
## 41 DNA/PiggyBac 0
## 42 DNA/PIF-Harbinger 0
typeCounts, make a
pie chart like the one below.
The process of assigning mapped reads to a particular gene (to count it’s expression) is much more complex than what we have done in this practical. Think about the following questions:
There are various programs that specialise in obtaining gene
expression values from mapped reads. These allow us to control many of
the issues brought up by the above questions. One good option to do this
inside R is the featureCounts() function from the
Rsubread package which we will be using later.
You might have realised this just by looking at the names of
annotated elements in the type column, which include
exon, CDS, mRNA,
gene, five_prime_UTR and
three_prime_UTR. A gene is transcribed into an mRNA, and an
mRNA is composed of exons. Within the mRNA we can have a 5’ UTR, a CDS
and a 3’UTR. So there is bound to be a lot of redundancy here.
To make this very clear, have a look at the following image, based on an actual segment of the genome annotation which you can explore here: IGV g199
The previous pie chart also suggests this redundancy, with both
gene and mRNA appearing as large slices. The
example figure can help you see why mRNA is bigger than
gene.
In many cases when we are exploring our own data in new ways, we will
need to think about how to deal with redundancy. In this case involving
protein-coding genes, the easiest way is to keep only the
exon category, since these should encompass all the
relevant bases in the genome (and will definitely include all the
CDS, five_prime_UTR and
three_prime_UTR).
The mRNA and gene categories are definitely
unnecessary and can be quite dangerous since they include introns (not
annotated as a separate category). In H. bakeri many
transposons reside within introns. The slice of the pie labeled as
gene will be inflated due to reads mapping to transposons
that reside in intron.
On the other hand, we might be interested in studying introns as a category. There are many regulatory elements in introns. Other functional elements like miRNAs or transposons can reside in introns and some might be missed in our annotation.
Let’s first pre-define annotation categories that we want to remove, and count how many annotation elements we would remove.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
## [1] 19704
## [1] 72150
The %in% function allows us to search for matching words
between the left-hand object (annot$type in this case) and
the right-hand object (toRem). When an item matches, we get
TRUE, and when it doesn’t we get FALSE. It’s
easy to invert these by using the ! operator (which we read
as not).
Let’s use this result to remove the redundant categories.
## [1] 91854
## [1] 72150
Finally, let’s count the reads again using this new annotation, and see how the pie changes. Exactly the same code:
mcols(annotNR)$counts <- countOverlaps(annotNR, readsGR)
typeCountsNR <- aggregate(mcols(annotNR)$counts, by=list("type"=mcols(annotNR)$type), sum)
minCount <- 1600
typeCountsNRHigh <- typeCountsNR[typeCountsNR$x > minCount,]
typeCountsNRHigh <- typeCountsNRHigh[order(typeCountsNRHigh$x),]
typeCountsNRHigh <- rbind(data.frame("type"="other",
"x"=sum(typeCountsNR$x[typeCountsNR$x <= minCount])),
typeCountsNRHigh)
pie(typeCountsNRHigh$x, labels=typeCountsNRHigh$type, col=rev(rainbow(nrow(typeCountsNRHigh))),
main="Number of aligned reads per type (NR)", cex=0.9)A drastically different picture, I’m sure you’ll agree.
But as mentioned above, by removing the gene/mRNA
categories, we have removed introns. We can confirm just how many
annotated bases we have removed, by a simple calculation taking
advantage of the reduce function which collapses all
overlaps.
nBasesAll <- sum(width(reduce(annot)))
nBasesNR <- sum(width(reduce(annotNR)))
missingBases <- nBasesAll - nBasesNR
missingPerc <- missingBases / nBasesAll * 100
print(paste("Percent of annotated bases now missing:",round(missingPerc,1)))## [1] "Percent of annotated bases now missing: 34.6"
This is definitely more advanced, but let’s look at how we would go about defining the regions in the genome that belong to introns. What is an intron? One way of thinking about this is that introns are the spaces between exons. So if we start with our mRNA positions, and remove the exons, we should be left with the introns. Can we do this? Yes!
mrnas <- reduce(annot[annot$type == "mRNA"])
exons <- reduce(annot[annot$type == "exon"])
introns <- setdiff(mrnas, exons)
introns## GRanges object with 8930 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] I 250694-251696 +
## [2] I 251849-255490 +
## [3] I 255574-255651 +
## [4] I 255775-260767 +
## [5] I 260857-261021 +
## ... ... ... ...
## [8926] III 9977451-9977673 -
## [8927] III 9977805-9977866 -
## [8928] III 9977909-9980742 -
## [8929] III 9980881-9980934 -
## [8930] III 9981167-9983991 -
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
The setdiff function will first collapse both sets of
ranges, and then define all the fragments of the first that are not
present in the second. By necessity this kind of operation will discard
the annotation present in the mcols.
Since we know that there can be other annotation categories within
the introns (especially transposons) we might as well clean up these
introns before we use them. This is pretty easy, since we just want to
remove any part of the introns that overlaps with anything currently in
our NR annotation. We can use the setdiff function
again.
Now let’s confirm that the number of bases that we initially lost have been recovered by estimating the introns.
## [1] 10710622
## [1] 10710622
Pretty good!
Because we have lost all annotation, we need to create some values
for the mcols before we can combine them all together. If
we don’t do this now, we end up with all the values filled with
NA (and this could cause problems later on)
mcols(introns)$source <- "manual"
mcols(introns)$type <- "introns"
mcols(introns)$ID <- "introns"
mcols(introns)$class <- "protein_coding"
mcols(introns)$counts <- 0We can now add our new introns onto our general annotation, and we can repeat our analysis.
## GRanges object with 93363 ranges and 5 metadata columns:
## seqnames ranges strand | source type ID class counts
## <Rle> <IRanges> <Rle> | <factor> <factor> <character> <character> <numeric>
## [1] I 20308-20494 + | RepeatMasker DNA/MITE rnd-1_family-53 DNA/MITE 0
## [2] I 24008-24205 + | RepeatMasker SINE rnd-1_family-17 SINE 0
## [3] I 24965-25055 + | RepeatMasker Satellite rnd-1_family-80 Satellite 0
## [4] I 25211-25296 + | RepeatMasker Unknown rnd-4_family-1996 Unknown 0
## [5] I 25610-25840 + | RepeatMasker Unknown rnd-1_family-105 Unknown 0
## ... ... ... ... . ... ... ... ... ...
## [93359] III 9979957-9980742 - | manual introns introns protein_coding 0
## [93360] III 9980881-9980934 - | manual introns introns protein_coding 0
## [93361] III 9981167-9981317 - | manual introns introns protein_coding 0
## [93362] III 9981691-9981737 - | manual introns introns protein_coding 0
## [93363] III 9981914-9983991 - | manual introns introns protein_coding 0
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
mcols(annotNR_introns)$counts <- countOverlaps(annotNR_introns, readsGR)
typeCountsNR <- aggregate(mcols(annotNR_introns)$counts, by=list("type"=mcols(annotNR_introns)$type), sum)
minCount <- 1600
typeCountsNRHigh <- typeCountsNR[typeCountsNR$x > minCount,]
typeCountsNRHigh <- typeCountsNRHigh[order(typeCountsNRHigh$x),]
typeCountsNRHigh <- rbind(data.frame("type"="other",
"x"=sum(typeCountsNR$x[typeCountsNR$x <= minCount])),
typeCountsNRHigh)
pie(typeCountsNRHigh$x, labels=typeCountsNRHigh$type, col=rev(rainbow(nrow(typeCountsNRHigh))),
main="Number of aligned reads per type (NR with introns)", cex=0.9)
The picture changes once again! But we are getting closer to having a realistic view of how our sRNAs are spread across the genomic annotation.
As you can hopefully see, this is by no means trivial, and requires a lot of work to weed out problematic (redundant or overlapping) annotation, as well as adding important annotation that we might be missing.
We don’t have enough space or time to go into greater depth, but I’ll just remind everyone here of three problems that remain in our annotation:
There are still overlaps in our annotation. But careful use of
reduce and setdiff functions can fix this if
needed.
We are not counting reads that fall outside of our annotation,
what we would call intergenic regions. If we just want to
capture them, we can use the gaps function. Alternatively
we might want intergenic regions that are upstream/downstream
of genes (e.g. promoters). For this, we have a promoters
function as well as a more generic flank function.
Many siRNAs are produced in anti-sense to their primary RNA. We have not counted these so far (by default only overlaps on the same strand are counted.
To illustrate this last problem, I’ll compare the previous results
(counting overlaps on the same or sense strand) to the same
calculation but now counting reads on the opposite or
anti-sense strand of each annotation. For this I will take
advantage of an invertStrand function.
mcols(annotNR_introns)$counts <- countOverlaps(annotNR_introns, readsGR, minoverlap=10)
mcols(annotNR_introns)$counts_AS <- countOverlaps(annotNR_introns, invertStrand(readsGR), minoverlap=10)To simplify the results, I will also convert annotation types to a simpler version, where I am collapsing many kinds of transposable elements into more general categories.
mcols(annotNR_introns)$simple_type <- annotNR_introns$type
annotNR_introns$simple_type[grepl("LINE|PLE",annotNR_introns$type)] <- "LINE"
annotNR_introns$simple_type[grepl("LTR",annotNR_introns$type)] <- "LTR"
annotNR_introns$simple_type[grepl("DNA|RC",annotNR_introns$type)] <- "DNA"
annotNR_introns$simple_type <- factor(annotNR_introns$simple_type)The rest of the code is the same as before:
typeCountsNR_S <- aggregate(mcols(annotNR_introns)$counts, by=list("type"=mcols(annotNR_introns)$simple_type), sum)
typeCountsNR_AS <- aggregate(mcols(annotNR_introns)$counts_AS, by=list("type"=mcols(annotNR_introns)$simple_type), sum)
all(typeCountsNR_S$type == typeCountsNR_AS$type)## [1] TRUE
allCounts <- cbind(typeCountsNR_S$x, typeCountsNR_AS$x)
rownames(allCounts) <- typeCountsNR_S$type
colnames(allCounts) <- c("sense","anti-sense")
cols <- rainbow(nrow(allCounts))
barplot(cbind(allCounts,NA), beside=TRUE, las=1, col=cols, legend.text=TRUE,
main="Aligned reads on sense vs anti-sense strands")As mentioned briefly above, there are more specialised programs (in
R, but also on the Linux command line) for counting the overlaps between
reads and genomic regions. The one that I would currently recommend is
called featureCounts and is part of the
Rsubread package that you should have already
installed.
There are two main advantages over the countOverlaps
function we have been using so far:
featureCounts gives us much more flexibility to
configure the kind of overlaps we want to count. For example:featureCounts is way more computationally efficient: it
is faster, and uses less memory.The main disadvantage is that we have to spend some time learning to
use it, probably by reading the manual or help page:
?featureCounts. One key difference is that it does not use
GRanges objects. It will read the BAM files directly from
disk, and the annotation needs to be provided either as a GTF/GFF file
(I do not recommend this, due to the overlap issues mentioned above) or
a simple table format in R called SAF (Simplified Annotation
Format).
Let’s show all this in action with the data we have been working with. As always, we first need to load any required libraries.
Next we create the saf object with the annotation we
want. Note that featureCounts will combine all the counts
at the GeneID level. Here we take advantage of this to
count by the simple_type of annotation we have previously
defined. More traditionally featureCounts is used for
counting gene expression in RNA-Seq experiments, and this column would
be used to combine the counts of any exon belonging to the same
GeneID.
saf <- data.frame("GeneID" = annotNR_introns$simple_type,
"Chr" = seqnames(annotNR_introns),
"Start" = start(annotNR_introns),
"End" = end(annotNR_introns),
"Strand" = strand(annotNR_introns)
)
head(saf)## GeneID Chr Start End Strand
## 1 DNA I 20308 20494 +
## 2 SINE I 24008 24205 +
## 3 Satellite I 24965 25055 +
## 4 Unknown I 25211 25296 +
## 5 Unknown I 25610 25840 +
## 6 Simple_repeat I 25843 25894 +
Now let’s run the function. Check the help page for details about the arguments I’m using.
fcRes <- featureCounts(bamFile, annot.ext=saf, allowMultiOverlap=TRUE, largestOverlap=TRUE,
fraction=TRUE, minOverlap=10,
strandSpecific=2, nthreads=2)##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.22.1
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 1 BAM file ||
## || ||
## || Pure_EV_r1.trimmed.bam ||
## || ||
## || Paired-end : no ||
## || Count read pairs : no ||
## || Annotation : R data.frame ||
## || Dir for temp files : . ||
## || Threads : 2 ||
## || Level : meta-feature level ||
## || Multimapping reads : counted (fractional) ||
## || Multi-overlapping reads : counted ||
## || Min overlapping bases : 10 ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file .Rsubread_UserProvidedAnnotation_pid59217 ... ||
## || Features : 93363 ||
## || Meta-features : 13 ||
## || Chromosomes/contigs : 3 ||
## || ||
## || Process BAM file Pure_EV_r1.trimmed.bam... ||
## || Strand specific : reversely stranded ||
## || Single-end reads are included. ||
## || Total alignments : 376957 ||
## || Successfully assigned alignments : 225821 (59.9%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Write the final count table. ||
## || Write the read assignment summary. ||
## || ||
## \\============================================================================//
The result object contains several things.
## [1] "counts" "annotation" "targets" "stat"
## Status Pure_EV_r1.trimmed.bam
## 1 Assigned 225821
## 2 Unassigned_Unmapped 0
## 3 Unassigned_Read_Type 0
## 4 Unassigned_Singleton 0
## 5 Unassigned_MappingQuality 0
## 6 Unassigned_Chimera 0
## 7 Unassigned_FragmentLength 0
## 8 Unassigned_Duplicate 0
## 9 Unassigned_MultiMapping 0
## 10 Unassigned_Secondary 0
## 11 Unassigned_NonSplit 0
## 12 Unassigned_NoFeatures 149503
## 13 Unassigned_Overlapping_Length 1633
## 14 Unassigned_Ambiguity 0
## Pure_EV_r1.trimmed.bam
## DNA 29799.33
## SINE 282.50
## Satellite 2197.17
## Unknown 22447.67
## Simple_repeat 69.00
## LINE 35321.33
## ncRNA 69.67
## Low_complexity 0.00
## LTR 45499.50
## exon 38062.83
## rRNA 4.00
## snRNA 1.00
## introns 52067.00
tmpCounts <- cbind(allCounts, fcRes$counts[rownames(allCounts),])
colnames(tmpCounts)[3] <- "featureCounts"
barplot(cbind(tmpCounts,NA), beside=TRUE, las=1, col=cols, legend.text=TRUE,
main="featureCounts result on the anti-sense strand")There are some small differences between the two approaches, caused
by the level of overlaps between our annotation. Remember that
featureCounts will not be counting the same read more than
once, instead it will prefer to assign the count to the longest overlap,
and if not it will assign a fractional count to multiple annotations.
The countOverlaps function will count the same read as many
times as it finds overlaps.
I would always recommend using featureCounts since it
was designed exactly for these purposes.
To help illustrate the IRanges examples at the start of
this practical, I used the following function from the manual of the
IRanges package. We haven’t discussed creating our own
functions in this workshop, but you can copy/paste it into your code if
you want to use it.
plotRanges <- function(x, xlim=x, main=deparse(substitute(x)), col="black", sep=0.2, ...) {
height <- 1
if (is(xlim,"IntegerRanges")) {
xlim <- c(min(start(xlim)), max(end(xlim)))
}
bins <- disjointBins(IRanges(start(x), end(x) + 1))
plot.new()
plot.window(xlim, c(0, max(bins)*(height + sep)))
ybottom <- bins * (sep + height)- height
rect(start(x)-0.1, ybottom, end(x)+0.1, ybottom + height, col=col, ...)
title(main)
axis(1)
}And a basic example:
x <- IRanges(start=c(11,35,40), end=c(20,50,63))
plotRanges(x, col=4, xlim=c(0,80), main="IRanges example")