The purpose of this practical is to use our genome’s annotation to catalogue our sequenced reads. At the end of this practical you will produce a figure showing the relative amount of reads that can be assigned to each type of annotation (coding regions, miRNAs, rRNA, snoRNA, etc). We will do all of this inside R and using some Bioconductor packages.
Exactly the same strategy can be used even when no genome is available, as long as you have some sort of annotation regarding the reference sequences you are using (which could be contigs, ESTs, cDNAs, etc).
During this practical you will need the following Bioconductor packages: IRanges, rtracklayer, GenomicRanges, Rsamtools, GenomicAlignments
If you haven’t installed them before, you should do so now:
BiocManager::install(c("IRanges", "rtracklayer", "GenomicRanges", "Rsamtools", "GenomicAlignments"))In this practical we will be manipulating different types of information that can be represented as ranges (gene coordinates, positions of mapped sequences, etc). One of the most flexible ways of working with ranges in R is to use the IRanges package.
library(IRanges)To begin, let’s create an IRanges object, by defining the start and end positions of 3 ranges. Don’t worry about what these ranges mean right now, we will get to it.
x <- IRanges(start=c(11,35,40), end=c(20,50,63))
x## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 20 10
## [2] 35 50 16
## [3] 40 63 24
There are several functions that we can use to extract information from an IRanges object:
start(x) # All the start positions## [1] 11 35 40
end(x) # All the end positions## [1] 20 50 63
width(x) # The width of each range## [1] 10 16 24
range(x) # The combined range (smallest start to largest end)## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 63 53
There are also some functions that can do some interesting operations:
coverage(x) # The sum of how many ranges overlap each position## integer-Rle of length 63 with 6 runs
## Lengths: 10 10 14 5 11 13
## Values : 0 1 0 1 2 1
reduce(x) # Fuses all overlapping ranges## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 20 10
## [2] 35 63 29
And now to give you a closer idea to what we will be doing with these objects:
exons <- reduce(x)
reads <- IRanges(start=c(1,21,30,50,80), width=20)
reads## IRanges object with 5 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 20 20
## [2] 21 40 20
## [3] 30 49 20
## [4] 50 69 20
## [5] 80 99 20
countOverlaps(exons, reads)## [1] 1 3
For more details, you can have a look at the IRanges vignette (browseVignettes(package="IRanges")).
The first thing we need is information about the genome elements we are interested in studying. For this project, we are interested in the annotation of the mouse and H. bakeri genomes. For each region of a genome we are interested in, we need to know it’s coordinates (what chromosome, start and end positions) and the strand on which it lies. We also want to know what type of element is encoded there: a protein? a non-coding RNA?
There are many ways to obtain these types of information. One very useful resource is Ensembl, where we can download a GTF/GFF (Gene Transfer/Feature Format) file with all the annotation information of a genome. Unfortunately, Ensembl does not include all possible genomes, and in particular H. bakeri is missing. There are similar resources though, and WormBase ParaSite does include the genome of H. bakeri.
For this practical, I am including here links to the files for our two genomes:
We will now use a function from the rtracklayerpackage to import the data into R. We will also be using the GenomicRanges package, so let’s load them both.
library(rtracklayer)
library(GenomicRanges)hbak <- import("Heligmosomoides_bakeri.gff3.gz")save(hbak, file="hbak.Rdata")If for whatever reason you were not able to load the gff3 file as above, you can download the Rdata file I saved with the previous instruction hbak.Rdata, and load it directly into your R session. If you did load the file correctly with the
importfunction, then do NOT use the following command.
load("hbak.Rdata")In either case, you should now be able to view all the loaded annotation:
hbak## GRanges object with 25747 ranges and 13 metadata columns:
## seqnames ranges strand | source type score phase ID
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
## [1] nHp.2.0.scaf00728 9194-9469 + | maker gene NA <NA> HPOL_0000914501
## [2] nHp.2.0.scaf00728 9194-9469 + | maker mRNA NA <NA> HPOL_0000914501-mRNA-1
## [3] nHp.2.0.scaf00728 9194-9302 + | maker exon NA <NA> HPOL_0000914501-mRNA..
## [4] nHp.2.0.scaf00728 9194-9302 + | maker CDS NA 0 HPOL_0000914501-mRNA..
## [5] nHp.2.0.scaf00728 9360-9469 + | maker exon NA <NA> HPOL_0000914501-mRNA..
## ... ... ... ... . ... ... ... ... ...
## [25743] nHp.2.0.scaf02062 2356-2448 - | rtracklayer mapmiRNA NA <NA> <NA>
## [25744] nHp.2.0.scaf02588 33853-33908 - | rtracklayer miRNA NA <NA> <NA>
## [25745] nHp.2.0.scaf03626 1418-1488 + | rtracklayer tRNA NA <NA> <NA>
## [25746] nHp.2.0.scaf04380 11594-11691 + | rtracklayer miRNA NA <NA> <NA>
## [25747] nHp.2.0.scaf04380 15930-15998 + | rtracklayer piRNA NA <NA> <NA>
## Name Parent _AED _eAED _QI Target
## <character> <CharacterList> <character> <character> <character> <character>
## [1] HPOL_0000914501 <NA> <NA> <NA> <NA>
## [2] HPOL_0000914501-mRNA-1 HPOL_0000914501 1.00 1.00 0|0|0|0|0|0|2|0|72 <NA>
## [3] <NA> HPOL_0000914501-mRNA-1 <NA> <NA> <NA> <NA>
## [4] <NA> HPOL_0000914501-mRNA-1 <NA> <NA> <NA> <NA>
## [5] <NA> HPOL_0000914501-mRNA-1 <NA> <NA> <NA> <NA>
## ... ... ... ... ... ... ...
## [25743] <NA> . <NA> <NA> <NA> <NA>
## [25744] <NA> . <NA> <NA> <NA> <NA>
## [25745] <NA> . <NA> <NA> <NA> <NA>
## [25746] <NA> . <NA> <NA> <NA> <NA>
## [25747] <NA> . <NA> <NA> <NA> <NA>
## rep_name class
## <character> <character>
## [1] <NA> <NA>
## [2] <NA> <NA>
## [3] <NA> <NA>
## [4] <NA> <NA>
## [5] <NA> <NA>
## ... ... ...
## [25743] <NA> <NA>
## [25744] <NA> <NA>
## [25745] <NA> <NA>
## [25746] <NA> <NA>
## [25747] <NA> <NA>
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
You will see that this object belongs to the GRanges class, which is very useful to store information about coordinates on genomes (so, most kinds of annotation)
We can access different parts of the object with the following functions:
seqnames(hbak)## factor-Rle of length 25747 with 58 runs
## Lengths: 230 188 111 ... 1 2
## Values : nHp.2.0.scaf00728 nHp.2.0.scaf00765 nHp.2.0.scaf01123 ... nHp.2.0.scaf03626 nHp.2.0.scaf04380
## Levels(20): nHp.2.0.scaf00001 nHp.2.0.scaf00002 nHp.2.0.scaf00033 ... nHp.2.0.scaf03626 nHp.2.0.scaf04380
ranges(hbak)## IRanges object with 25747 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 9194 9469 276
## [2] 9194 9469 276
## [3] 9194 9302 109
## [4] 9194 9302 109
## [5] 9360 9469 110
## ... ... ... ...
## [25743] 2356 2448 93
## [25744] 33853 33908 56
## [25745] 1418 1488 71
## [25746] 11594 11691 98
## [25747] 15930 15998 69
strand(hbak)## factor-Rle of length 25747 with 11266 runs
## Lengths: 6 74 18 25 46 61 38 46 71 29 8 40 6 ... 3 2 7 7 16 2 11 14 2 1 2 2 3
## Values : + - + - + - + - + - + - + ... + - + - + - + - + - + - +
## Levels(3): + - *
mcols(hbak)## DataFrame with 25747 rows and 13 columns
## source type score phase ID Name Parent
## <factor> <factor> <numeric> <integer> <character> <character> <CharacterList>
## 1 maker gene NA NA HPOL_0000914501 HPOL_0000914501
## 2 maker mRNA NA NA HPOL_0000914501-mRNA-1 HPOL_0000914501-mRNA-1 HPOL_0000914501
## 3 maker exon NA NA HPOL_0000914501-mRNA.. NA HPOL_0000914501-mRNA-1
## 4 maker CDS NA 0 HPOL_0000914501-mRNA.. NA HPOL_0000914501-mRNA-1
## 5 maker exon NA NA HPOL_0000914501-mRNA.. NA HPOL_0000914501-mRNA-1
## ... ... ... ... ... ... ... ...
## 25743 rtracklayer mapmiRNA NA NA NA NA .
## 25744 rtracklayer miRNA NA NA NA NA .
## 25745 rtracklayer tRNA NA NA NA NA .
## 25746 rtracklayer miRNA NA NA NA NA .
## 25747 rtracklayer piRNA NA NA NA NA .
## _AED _eAED _QI Target rep_name class
## <character> <character> <character> <character> <character> <character>
## 1 NA NA NA NA NA NA
## 2 1.00 1.00 0|0|0|0|0|0|2|0|72 NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## ... ... ... ... ... ... ...
## 25743 NA NA NA NA NA NA
## 25744 NA NA NA NA NA NA
## 25745 NA NA NA NA NA NA
## 25746 NA NA NA NA NA NA
## 25747 NA NA NA NA NA NA
The last function, mcols(), returns a table of metadata, where we can store and add any kind of extra information for each coordinate or region.
We can easily manipulate information stored in this section, like this:
table(mcols(hbak)$type)##
## gene mRNA exon CDS three_prime_UTR five_prime_UTR rep_unk_S
## 350 378 2676 2892 246 138 12580
## rep_dna_S rep_rna_S rep_simple rep_sat_S tRNA piRNA miRNA
## 2807 2116 1316 9 166 38 28
## snRNA rRNA other_ncRNA mapmiRNA
## 1 3 1 2
Clearly there are many more of these columns than we actually need. If you just want to select a few, and remove the rest, you can do something like this:
mcols(hbak) <- mcols(hbak)[,c("source","type","ID","Name","rep_name","class")]
mcols(hbak)## DataFrame with 25747 rows and 6 columns
## source type ID Name rep_name class
## <factor> <factor> <character> <character> <character> <character>
## 1 maker gene HPOL_0000914501 HPOL_0000914501 NA NA
## 2 maker mRNA HPOL_0000914501-mRNA-1 HPOL_0000914501-mRNA-1 NA NA
## 3 maker exon HPOL_0000914501-mRNA.. NA NA NA
## 4 maker CDS HPOL_0000914501-mRNA.. NA NA NA
## 5 maker exon HPOL_0000914501-mRNA.. NA NA NA
## ... ... ... ... ... ... ...
## 25743 rtracklayer mapmiRNA NA NA NA NA
## 25744 rtracklayer miRNA NA NA NA NA
## 25745 rtracklayer tRNA NA NA NA NA
## 25746 rtracklayer miRNA NA NA NA NA
## 25747 rtracklayer piRNA NA NA NA NA
The next step we need to do is to load all the information about the genome positions to which our reads aligned. For this we are going to use sequences that were trimmed and filtered as in the last practical, and then aligned against the mouse genome. Please download the BAM file directly from here: Pure_EV_r1_100k.bam. You can use wget on your command line, or download it and move it to your working directory. Inside R we can create an object which contains the filename. If you decide to use a different filename, just make sure that you change the code to the same name.
bamFile <- "Pure_EV_r1_100k.bam"We are going to use the Rsamtools package for importing data from SAM/BAM files. As we saw in the previous practical, these files can store a lot of information, and we will not need all of it in R. We can choose what parts of the file we want to import. In the end, we would like to create a GRanges object with the coordinates to which our short reads align, so as a minimum we will need:
Since the BAM file can also contain unmapped reads (which will not have any of the above information), we also need to tell R that we do not want to try to load these.
library(Rsamtools)
what <- c("rname", "strand", "pos", "qwidth")
param <- ScanBamParam(what=what, flag=scanBamFlag(isUnmappedQuery=FALSE))
bam <- scanBam(bamFile, param=param)
class(bam)## [1] "list"
lapply(bam, names)## [[1]]
## [1] "rname" "strand" "pos" "qwidth"
The scanBam() function gives us a very crude, but quite flexible, interface to the data in a BAM file. The result in R is a list of lists, since the function allows us to retrieve only data from certain ranges (which would be the first level of the list) and then only data from certain fields (the second level of the list). In the above case we are only specifying this second level.
We can now create a new GRanges object with these data, which will allow us to more easily manipulate them.
mapGR <- GRanges(
seqnames = bam[[1]]$rname,
ranges = IRanges(start=bam[[1]]$pos, width=bam[[1]]$qwidth),
strand = bam[[1]]$strand
)
mapGR## GRanges object with 98938 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] nHp.2.0.scaf00001 1779-1799 +
## [2] nHp.2.0.scaf00001 1780-1802 -
## [3] nHp.2.0.scaf00001 2085-2105 +
## [4] nHp.2.0.scaf00001 2275-2298 -
## [5] nHp.2.0.scaf00001 2275-2298 -
## ... ... ... ...
## [98934] nHp.2.0.scaf04380 19333-19357 +
## [98935] nHp.2.0.scaf04380 19333-19354 +
## [98936] nHp.2.0.scaf04380 19335-19357 +
## [98937] nHp.2.0.scaf04380 19339-19360 +
## [98938] nHp.2.0.scaf04380 19339-19361 +
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
We are using [[1]] which indicates the first position of the first level of the list. Since we did not restrict
scanBamto specific ranges (e.g. chromosomes), all data will be saved within this [[1]] slot.
The previous instructions allow us to control exactly what is going into our mapGR object.
There is an easier way to obtain the same result, which is using the GenomicAlignments package (which internally uses scanBam()). This is the recommended way, for simplicity, but also because it deals with different situations to our example, such as paired-end reads, or alignments with gaps.
library(GenomicAlignments)
bamAlign <- readGAlignments(bamFile)
bamAlign## GAlignments object with 98938 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer>
## [1] nHp.2.0.scaf00001 + 21M 21 1779 1799 21 0
## [2] nHp.2.0.scaf00001 - 23M 23 1780 1802 23 0
## [3] nHp.2.0.scaf00001 + 21M 21 2085 2105 21 0
## [4] nHp.2.0.scaf00001 - 24M 24 2275 2298 24 0
## [5] nHp.2.0.scaf00001 - 24M 24 2275 2298 24 0
## ... ... ... ... ... ... ... ... ...
## [98934] nHp.2.0.scaf04380 + 25M 25 19333 19357 25 0
## [98935] nHp.2.0.scaf04380 + 22M 22 19333 19354 22 0
## [98936] nHp.2.0.scaf04380 + 23M 23 19335 19357 23 0
## [98937] nHp.2.0.scaf04380 + 22M 22 19339 19360 22 0
## [98938] nHp.2.0.scaf04380 + 23M 23 19339 19361 23 0
## -------
## seqinfo: 20 sequences from an unspecified genome
The slight problem is that the default object class is GAlignments, but this can be converted to a GRanges object:
mapGR2 <- as(bamAlign, "GRanges")
mapGR2## GRanges object with 98938 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] nHp.2.0.scaf00001 1779-1799 +
## [2] nHp.2.0.scaf00001 1780-1802 -
## [3] nHp.2.0.scaf00001 2085-2105 +
## [4] nHp.2.0.scaf00001 2275-2298 -
## [5] nHp.2.0.scaf00001 2275-2298 -
## ... ... ... ...
## [98934] nHp.2.0.scaf04380 19333-19357 +
## [98935] nHp.2.0.scaf04380 19333-19354 +
## [98936] nHp.2.0.scaf04380 19335-19357 +
## [98937] nHp.2.0.scaf04380 19339-19360 +
## [98938] nHp.2.0.scaf04380 19339-19361 +
## -------
## seqinfo: 20 sequences from an unspecified genome
sum(mapGR == mapGR2)## [1] 98938
all(mapGR == mapGR2)## [1] TRUE
Which you can see produces the same result.
We can use the countOverlaps() function to find out how many of our reads fall within (or nearby) each of our annotated regions. We can then store this result directly as an extra column of our annotation.
WARNING: Although we are going to ignore this for now, many of our annotations could actually be overlapping each other. Any read that maps to these overlapping regions will be counted again and again for each overlapping region.
mcols(hbak)$counts <- countOverlaps(hbak, mapGR)
mcols(hbak)## DataFrame with 25747 rows and 7 columns
## source type ID Name rep_name class counts
## <factor> <factor> <character> <character> <character> <character> <integer>
## 1 maker gene HPOL_0000914501 HPOL_0000914501 NA NA 0
## 2 maker mRNA HPOL_0000914501-mRNA-1 HPOL_0000914501-mRNA-1 NA NA 0
## 3 maker exon HPOL_0000914501-mRNA.. NA NA NA 0
## 4 maker CDS HPOL_0000914501-mRNA.. NA NA NA 0
## 5 maker exon HPOL_0000914501-mRNA.. NA NA NA 0
## ... ... ... ... ... ... ... ...
## 25743 rtracklayer mapmiRNA NA NA NA NA 0
## 25744 rtracklayer miRNA NA NA NA NA 0
## 25745 rtracklayer tRNA NA NA NA NA 0
## 25746 rtracklayer miRNA NA NA NA NA 3
## 25747 rtracklayer piRNA NA NA NA NA 10
If you receive a Warning with the above code you can ignore it for now. This warning arises because the “chromosomes” (reference sequences) are not identical between the annotation and the reads. R is warning us that we might be comparing different genomes, but we know that this is not so in our case.
One way of visualising these counts is to aggregate them in some way. We could do this by biotype, or even by individual gene.
typeCounts <- aggregate(mcols(hbak)$counts, by=list("type"=mcols(hbak)$type), sum)
typeCounts## type x
## 1 gene 15403
## 2 mRNA 16380
## 3 exon 295
## 4 CDS 295
## 5 three_prime_UTR 0
## 6 five_prime_UTR 0
## 7 rep_unk_S 10191
## 8 rep_dna_S 1246
## 9 rep_rna_S 13252
## 10 rep_simple 432
## 11 rep_sat_S 1
## 12 tRNA 711
## 13 piRNA 198
## 14 miRNA 2656
## 15 snRNA 0
## 16 rRNA 100
## 17 other_ncRNA 0
## 18 mapmiRNA 0
If we only wanted to count reads in genes, we would first want to select the part of the object that corresponds to genes.
genes <- hbak[hbak$type == "gene"]geneCounts <- aggregate(mcols(genes)$counts, by=list("id"=mcols(genes)$ID), sum)
head(geneCounts)## id x
## 1 HPOL_0000000001 67
## 2 HPOL_0000000101 0
## 3 HPOL_0000000201 0
## 4 HPOL_0000000301 9
## 5 HPOL_0000000401 0
## 6 HPOL_0000000501 5
We could then use these geneCounts directly to perform differential gene expression analysis (next practical).
On the other hand, it can be quite useful to plot the distribution of our reads amongst different biotypes.
The process of assigning mapped reads to a particular gene (to count it’s expression) is much more complex than what we have done in this practical. Think about the following questions:
There are various programs that specialise in obtaining gene expression values from mapped reads. These allow us to control many of the issues brought up by the above questions. One good option to do this inside R is the featureCounts() function from the Rsubread package.
hbak object.