Genome annotation and SAM/BAM alignments

Introduction

The purpose of this practical is to use our genome’s annotation to catalogue our sequenced reads. At the end of this practical you will produce a figure showing the relative amount of reads that can be assigned to each type of annotation (coding regions, miRNAs, rRNA, snoRNA, etc). We will do all of this inside R and using some Bioconductor packages.

Exactly the same strategy can be used even when no genome is available, as long as you have some sort of annotation regarding the reference sequences you are using (which could be contigs, ESTs, cDNAs, etc).

During this practical you will need the following Bioconductor packages: IRanges, rtracklayer, GenomicRanges, Rsamtools, GenomicAlignments

If you haven’t installed them before, you should do so now:

BiocManager::install(c("IRanges", "rtracklayer", "GenomicRanges", "Rsamtools", "GenomicAlignments"))

Brief introduction to IRanges

In this practical we will be manipulating different types of information that can be represented as ranges (gene coordinates, positions of mapped sequences, etc). One of the most flexible ways of working with ranges in R is to use the IRanges package.

library(IRanges)

To begin, let’s create an IRanges object, by defining the start and end positions of 3 ranges. Don’t worry about what these ranges mean right now, we will get to it.

x <- IRanges(start=c(11,35,40), end=c(20,50,63))

x

## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        11        20        10
##   [2]        35        50        16
##   [3]        40        63        24

There are several functions that we can use to extract information from an IRanges object:

start(x) # All the start positions

## [1] 11 35 40

end(x)   # All the end positions

## [1] 20 50 63

width(x) # The width of each range

## [1] 10 16 24

range(x) # The combined range (smallest start to largest end)

## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        11        63        53

There are also some functions that can do some interesting operations:

coverage(x) # The sum of how many ranges overlap each position

## integer-Rle of length 63 with 6 runs
##   Lengths: 10 10 14  5 11 13
##   Values :  0  1  0  1  2  1

reduce(x)   # Fuses all overlapping ranges

## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        11        20        10
##   [2]        35        63        29

And now to give you a closer idea to what we will be doing with these objects:

exons <- reduce(x)

reads <- IRanges(start=c(1,21,30,50,80), width=20)

reads

## IRanges object with 5 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        20        20
##   [2]        21        40        20
##   [3]        30        49        20
##   [4]        50        69        20
##   [5]        80        99        20

countOverlaps(exons, reads)

## [1] 1 3

For more details, you can have a look at the IRanges vignette (browseVignettes(package="IRanges")).

Working with a genome’s annotation

The first thing we need is information about the genome elements we are interested in studying. For this project, we are interested in the annotation of the mouse and H. bakeri genomes. For each region of a genome we are interested in, we need to know it’s coordinates (what chromosome, start and end positions) and the strand on which it lies. We also want to know what type of element is encoded there: a protein? a non-coding RNA?

There are many ways to obtain these types of information. One very useful resource is Ensembl, where we can download a GTF/GFF (Gene Transfer/Feature Format) file with all the annotation information of a genome. Unfortunately, Ensembl does not include all possible genomes, and in particular H. bakeri is missing. There are similar resources though, and WormBase ParaSite does include the genome of H. bakeri.

For this practical, I am including here links to the files for our two genomes:

Heligmosomoides_bakeri.gff3.gz (for now, only download this file)
Mus_musculus.15_16_17.gff3.gz (for now, do NOT download this file)

We will now use a function from the rtracklayerpackage to import the data into R. We will also be using the GenomicRanges package, so let’s load them both.

library(rtracklayer)
library(GenomicRanges)

hbak <- import("Heligmosomoides_bakeri.gff3.gz")

save(hbak, file="hbak.Rdata")

If for whatever reason you were not able to load the gff3 file as above, you can download the Rdata file I saved with the previous instruction hbak.Rdata, and load it directly into your R session. If you did load the file correctly with the import function, then do NOT use the following command.

load("hbak.Rdata")

In either case, you should now be able to view all the loaded annotation:

hbak

## GRanges object with 25747 ranges and 13 metadata columns:
##                    seqnames      ranges strand |      source     type     score     phase                     ID
##                       <Rle>   <IRanges>  <Rle> |    <factor> <factor> <numeric> <integer>            <character>
##       [1] nHp.2.0.scaf00728   9194-9469      + |       maker     gene        NA      <NA>        HPOL_0000914501
##       [2] nHp.2.0.scaf00728   9194-9469      + |       maker     mRNA        NA      <NA> HPOL_0000914501-mRNA-1
##       [3] nHp.2.0.scaf00728   9194-9302      + |       maker     exon        NA      <NA> HPOL_0000914501-mRNA..
##       [4] nHp.2.0.scaf00728   9194-9302      + |       maker     CDS         NA         0 HPOL_0000914501-mRNA..
##       [5] nHp.2.0.scaf00728   9360-9469      + |       maker     exon        NA      <NA> HPOL_0000914501-mRNA..
##       ...               ...         ...    ... .         ...      ...       ...       ...                    ...
##   [25743] nHp.2.0.scaf02062   2356-2448      - | rtracklayer mapmiRNA        NA      <NA>                   <NA>
##   [25744] nHp.2.0.scaf02588 33853-33908      - | rtracklayer miRNA           NA      <NA>                   <NA>
##   [25745] nHp.2.0.scaf03626   1418-1488      + | rtracklayer tRNA            NA      <NA>                   <NA>
##   [25746] nHp.2.0.scaf04380 11594-11691      + | rtracklayer miRNA           NA      <NA>                   <NA>
##   [25747] nHp.2.0.scaf04380 15930-15998      + | rtracklayer piRNA           NA      <NA>                   <NA>
##                             Name                 Parent        _AED       _eAED                _QI      Target
##                      <character>        <CharacterList> <character> <character>        <character> <character>
##       [1]        HPOL_0000914501                               <NA>        <NA>               <NA>        <NA>
##       [2] HPOL_0000914501-mRNA-1        HPOL_0000914501        1.00        1.00 0|0|0|0|0|0|2|0|72        <NA>
##       [3]                   <NA> HPOL_0000914501-mRNA-1        <NA>        <NA>               <NA>        <NA>
##       [4]                   <NA> HPOL_0000914501-mRNA-1        <NA>        <NA>               <NA>        <NA>
##       [5]                   <NA> HPOL_0000914501-mRNA-1        <NA>        <NA>               <NA>        <NA>
##       ...                    ...                    ...         ...         ...                ...         ...
##   [25743]                   <NA>                      .        <NA>        <NA>               <NA>        <NA>
##   [25744]                   <NA>                      .        <NA>        <NA>               <NA>        <NA>
##   [25745]                   <NA>                      .        <NA>        <NA>               <NA>        <NA>
##   [25746]                   <NA>                      .        <NA>        <NA>               <NA>        <NA>
##   [25747]                   <NA>                      .        <NA>        <NA>               <NA>        <NA>
##              rep_name       class
##           <character> <character>
##       [1]        <NA>        <NA>
##       [2]        <NA>        <NA>
##       [3]        <NA>        <NA>
##       [4]        <NA>        <NA>
##       [5]        <NA>        <NA>
##       ...         ...         ...
##   [25743]        <NA>        <NA>
##   [25744]        <NA>        <NA>
##   [25745]        <NA>        <NA>
##   [25746]        <NA>        <NA>
##   [25747]        <NA>        <NA>
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

You will see that this object belongs to the GRanges class, which is very useful to store information about coordinates on genomes (so, most kinds of annotation)

We can access different parts of the object with the following functions:

seqnames(hbak)

## factor-Rle of length 25747 with 58 runs
##   Lengths:               230               188               111 ...                 1                 2
##   Values : nHp.2.0.scaf00728 nHp.2.0.scaf00765 nHp.2.0.scaf01123 ... nHp.2.0.scaf03626 nHp.2.0.scaf04380
## Levels(20): nHp.2.0.scaf00001 nHp.2.0.scaf00002 nHp.2.0.scaf00033 ... nHp.2.0.scaf03626 nHp.2.0.scaf04380

ranges(hbak)

## IRanges object with 25747 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]      9194      9469       276
##       [2]      9194      9469       276
##       [3]      9194      9302       109
##       [4]      9194      9302       109
##       [5]      9360      9469       110
##       ...       ...       ...       ...
##   [25743]      2356      2448        93
##   [25744]     33853     33908        56
##   [25745]      1418      1488        71
##   [25746]     11594     11691        98
##   [25747]     15930     15998        69

strand(hbak)

## factor-Rle of length 25747 with 11266 runs
##   Lengths:   6  74  18  25  46  61  38  46  71  29   8  40   6 ...   3   2   7   7  16   2  11  14   2   1   2   2   3
##   Values :   +   -   +   -   +   -   +   -   +   -   +   -   + ...   +   -   +   -   +   -   +   -   +   -   +   -   +
## Levels(3): + - *

mcols(hbak)

## DataFrame with 25747 rows and 13 columns
##            source     type     score     phase                     ID                   Name                 Parent
##          <factor> <factor> <numeric> <integer>            <character>            <character>        <CharacterList>
## 1           maker     gene        NA        NA        HPOL_0000914501        HPOL_0000914501                       
## 2           maker     mRNA        NA        NA HPOL_0000914501-mRNA-1 HPOL_0000914501-mRNA-1        HPOL_0000914501
## 3           maker     exon        NA        NA HPOL_0000914501-mRNA..                     NA HPOL_0000914501-mRNA-1
## 4           maker     CDS         NA         0 HPOL_0000914501-mRNA..                     NA HPOL_0000914501-mRNA-1
## 5           maker     exon        NA        NA HPOL_0000914501-mRNA..                     NA HPOL_0000914501-mRNA-1
## ...           ...      ...       ...       ...                    ...                    ...                    ...
## 25743 rtracklayer mapmiRNA        NA        NA                     NA                     NA                      .
## 25744 rtracklayer miRNA           NA        NA                     NA                     NA                      .
## 25745 rtracklayer tRNA            NA        NA                     NA                     NA                      .
## 25746 rtracklayer miRNA           NA        NA                     NA                     NA                      .
## 25747 rtracklayer piRNA           NA        NA                     NA                     NA                      .
##              _AED       _eAED                _QI      Target    rep_name       class
##       <character> <character>        <character> <character> <character> <character>
## 1              NA          NA                 NA          NA          NA          NA
## 2            1.00        1.00 0|0|0|0|0|0|2|0|72          NA          NA          NA
## 3              NA          NA                 NA          NA          NA          NA
## 4              NA          NA                 NA          NA          NA          NA
## 5              NA          NA                 NA          NA          NA          NA
## ...           ...         ...                ...         ...         ...         ...
## 25743          NA          NA                 NA          NA          NA          NA
## 25744          NA          NA                 NA          NA          NA          NA
## 25745          NA          NA                 NA          NA          NA          NA
## 25746          NA          NA                 NA          NA          NA          NA
## 25747          NA          NA                 NA          NA          NA          NA

The last function, mcols(), returns a table of metadata, where we can store and add any kind of extra information for each coordinate or region.

We can easily manipulate information stored in this section, like this:

table(mcols(hbak)$type)

## 
##            gene            mRNA            exon             CDS three_prime_UTR  five_prime_UTR       rep_unk_S 
##             350             378            2676            2892             246             138           12580 
##       rep_dna_S       rep_rna_S      rep_simple       rep_sat_S            tRNA           piRNA           miRNA 
##            2807            2116            1316               9             166              38              28 
##           snRNA            rRNA     other_ncRNA        mapmiRNA 
##               1               3               1               2

Clearly there are many more of these columns than we actually need. If you just want to select a few, and remove the rest, you can do something like this:

mcols(hbak) <- mcols(hbak)[,c("source","type","ID","Name","rep_name","class")]

mcols(hbak)

## DataFrame with 25747 rows and 6 columns
##            source     type                     ID                   Name    rep_name       class
##          <factor> <factor>            <character>            <character> <character> <character>
## 1           maker     gene        HPOL_0000914501        HPOL_0000914501          NA          NA
## 2           maker     mRNA HPOL_0000914501-mRNA-1 HPOL_0000914501-mRNA-1          NA          NA
## 3           maker     exon HPOL_0000914501-mRNA..                     NA          NA          NA
## 4           maker     CDS  HPOL_0000914501-mRNA..                     NA          NA          NA
## 5           maker     exon HPOL_0000914501-mRNA..                     NA          NA          NA
## ...           ...      ...                    ...                    ...         ...         ...
## 25743 rtracklayer mapmiRNA                     NA                     NA          NA          NA
## 25744 rtracklayer miRNA                        NA                     NA          NA          NA
## 25745 rtracklayer tRNA                         NA                     NA          NA          NA
## 25746 rtracklayer miRNA                        NA                     NA          NA          NA
## 25747 rtracklayer piRNA                        NA                     NA          NA          NA

Exercises:

How can you save a new object where you only keep annotation corresponding to microRNAs?
How about only all annotations on the “-” strand?

Importing alignment results from a BAM file

The next step we need to do is to load all the information about the genome positions to which our reads aligned. For this we are going to use sequences that were trimmed and filtered as in the last practical, and then aligned against the mouse genome. Please download the BAM file directly from here: Pure_EV_r1_100k.bam. You can use wget on your command line, or download it and move it to your working directory. Inside R we can create an object which contains the filename. If you decide to use a different filename, just make sure that you change the code to the same name.

bamFile <- "Pure_EV_r1_100k.bam"

We are going to use the Rsamtools package for importing data from SAM/BAM files. As we saw in the previous practical, these files can store a lot of information, and we will not need all of it in R. We can choose what parts of the file we want to import. In the end, we would like to create a GRanges object with the coordinates to which our short reads align, so as a minimum we will need:

the name of the chromosome (or reference sequence)
the strand
the start of the alignment
the end, or in this case the length of the aligned sequence

Since the BAM file can also contain unmapped reads (which will not have any of the above information), we also need to tell R that we do not want to try to load these.

library(Rsamtools)

what <- c("rname", "strand", "pos", "qwidth")

param <- ScanBamParam(what=what, flag=scanBamFlag(isUnmappedQuery=FALSE))

bam <- scanBam(bamFile, param=param)

class(bam)

## [1] "list"

lapply(bam, names)

## [[1]]
## [1] "rname"  "strand" "pos"    "qwidth"

The scanBam() function gives us a very crude, but quite flexible, interface to the data in a BAM file. The result in R is a list of lists, since the function allows us to retrieve only data from certain ranges (which would be the first level of the list) and then only data from certain fields (the second level of the list). In the above case we are only specifying this second level.

We can now create a new GRanges object with these data, which will allow us to more easily manipulate them.

mapGR <- GRanges(
   seqnames = bam[[1]]$rname,
   ranges   = IRanges(start=bam[[1]]$pos, width=bam[[1]]$qwidth),
   strand   = bam[[1]]$strand
)

mapGR

## GRanges object with 98938 ranges and 0 metadata columns:
##                    seqnames      ranges strand
##                       <Rle>   <IRanges>  <Rle>
##       [1] nHp.2.0.scaf00001   1779-1799      +
##       [2] nHp.2.0.scaf00001   1780-1802      -
##       [3] nHp.2.0.scaf00001   2085-2105      +
##       [4] nHp.2.0.scaf00001   2275-2298      -
##       [5] nHp.2.0.scaf00001   2275-2298      -
##       ...               ...         ...    ...
##   [98934] nHp.2.0.scaf04380 19333-19357      +
##   [98935] nHp.2.0.scaf04380 19333-19354      +
##   [98936] nHp.2.0.scaf04380 19335-19357      +
##   [98937] nHp.2.0.scaf04380 19339-19360      +
##   [98938] nHp.2.0.scaf04380 19339-19361      +
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

We are using [[1]] which indicates the first position of the first level of the list. Since we did not restrict scanBam to specific ranges (e.g. chromosomes), all data will be saved within this [[1]] slot.

The previous instructions allow us to control exactly what is going into our mapGR object.

There is an easier way to obtain the same result, which is using the GenomicAlignments package (which internally uses scanBam()). This is the recommended way, for simplicity, but also because it deals with different situations to our example, such as paired-end reads, or alignments with gaps.

library(GenomicAlignments)

bamAlign <- readGAlignments(bamFile)

bamAlign

## GAlignments object with 98938 alignments and 0 metadata columns:
##                    seqnames strand       cigar    qwidth     start       end     width     njunc
##                       <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##       [1] nHp.2.0.scaf00001      +         21M        21      1779      1799        21         0
##       [2] nHp.2.0.scaf00001      -         23M        23      1780      1802        23         0
##       [3] nHp.2.0.scaf00001      +         21M        21      2085      2105        21         0
##       [4] nHp.2.0.scaf00001      -         24M        24      2275      2298        24         0
##       [5] nHp.2.0.scaf00001      -         24M        24      2275      2298        24         0
##       ...               ...    ...         ...       ...       ...       ...       ...       ...
##   [98934] nHp.2.0.scaf04380      +         25M        25     19333     19357        25         0
##   [98935] nHp.2.0.scaf04380      +         22M        22     19333     19354        22         0
##   [98936] nHp.2.0.scaf04380      +         23M        23     19335     19357        23         0
##   [98937] nHp.2.0.scaf04380      +         22M        22     19339     19360        22         0
##   [98938] nHp.2.0.scaf04380      +         23M        23     19339     19361        23         0
##   -------
##   seqinfo: 20 sequences from an unspecified genome

The slight problem is that the default object class is GAlignments, but this can be converted to a GRanges object:

mapGR2 <- as(bamAlign, "GRanges")

mapGR2

## GRanges object with 98938 ranges and 0 metadata columns:
##                    seqnames      ranges strand
##                       <Rle>   <IRanges>  <Rle>
##       [1] nHp.2.0.scaf00001   1779-1799      +
##       [2] nHp.2.0.scaf00001   1780-1802      -
##       [3] nHp.2.0.scaf00001   2085-2105      +
##       [4] nHp.2.0.scaf00001   2275-2298      -
##       [5] nHp.2.0.scaf00001   2275-2298      -
##       ...               ...         ...    ...
##   [98934] nHp.2.0.scaf04380 19333-19357      +
##   [98935] nHp.2.0.scaf04380 19333-19354      +
##   [98936] nHp.2.0.scaf04380 19335-19357      +
##   [98937] nHp.2.0.scaf04380 19339-19360      +
##   [98938] nHp.2.0.scaf04380 19339-19361      +
##   -------
##   seqinfo: 20 sequences from an unspecified genome

sum(mapGR == mapGR2)

## [1] 98938

all(mapGR == mapGR2)

## [1] TRUE

Which you can see produces the same result.

Counting reads mapping within annotations

We can use the countOverlaps() function to find out how many of our reads fall within (or nearby) each of our annotated regions. We can then store this result directly as an extra column of our annotation.

WARNING: Although we are going to ignore this for now, many of our annotations could actually be overlapping each other. Any read that maps to these overlapping regions will be counted again and again for each overlapping region.

mcols(hbak)$counts <- countOverlaps(hbak, mapGR)

mcols(hbak)

## DataFrame with 25747 rows and 7 columns
##            source     type                     ID                   Name    rep_name       class    counts
##          <factor> <factor>            <character>            <character> <character> <character> <integer>
## 1           maker     gene        HPOL_0000914501        HPOL_0000914501          NA          NA         0
## 2           maker     mRNA HPOL_0000914501-mRNA-1 HPOL_0000914501-mRNA-1          NA          NA         0
## 3           maker     exon HPOL_0000914501-mRNA..                     NA          NA          NA         0
## 4           maker     CDS  HPOL_0000914501-mRNA..                     NA          NA          NA         0
## 5           maker     exon HPOL_0000914501-mRNA..                     NA          NA          NA         0
## ...           ...      ...                    ...                    ...         ...         ...       ...
## 25743 rtracklayer mapmiRNA                     NA                     NA          NA          NA         0
## 25744 rtracklayer miRNA                        NA                     NA          NA          NA         0
## 25745 rtracklayer tRNA                         NA                     NA          NA          NA         0
## 25746 rtracklayer miRNA                        NA                     NA          NA          NA         3
## 25747 rtracklayer piRNA                        NA                     NA          NA          NA        10

If you receive a Warning with the above code you can ignore it for now. This warning arises because the “chromosomes” (reference sequences) are not identical between the annotation and the reads. R is warning us that we might be comparing different genomes, but we know that this is not so in our case.

One way of visualising these counts is to aggregate them in some way. We could do this by biotype, or even by individual gene.

typeCounts <- aggregate(mcols(hbak)$counts, by=list("type"=mcols(hbak)$type), sum)

typeCounts

##               type     x
## 1             gene 15403
## 2             mRNA 16380
## 3             exon   295
## 4              CDS   295
## 5  three_prime_UTR     0
## 6   five_prime_UTR     0
## 7        rep_unk_S 10191
## 8        rep_dna_S  1246
## 9        rep_rna_S 13252
## 10      rep_simple   432
## 11       rep_sat_S     1
## 12            tRNA   711
## 13           piRNA   198
## 14           miRNA  2656
## 15           snRNA     0
## 16            rRNA   100
## 17     other_ncRNA     0
## 18        mapmiRNA     0

If we only wanted to count reads in genes, we would first want to select the part of the object that corresponds to genes.

genes <- hbak[hbak$type == "gene"]

geneCounts <- aggregate(mcols(genes)$counts, by=list("id"=mcols(genes)$ID), sum)

head(geneCounts)

##                id  x
## 1 HPOL_0000000001 67
## 2 HPOL_0000000101  0
## 3 HPOL_0000000201  0
## 4 HPOL_0000000301  9
## 5 HPOL_0000000401  0
## 6 HPOL_0000000501  5

We could then use these geneCounts directly to perform differential gene expression analysis (next practical).

On the other hand, it can be quite useful to plot the distribution of our reads amongst different biotypes.

Exercise:

Make a plot like the one below.
Sort your data from smallest to largest counts.
Collapse the smallest categories (below a threshold that you define) into a single “other” category, and then remove the small ones.

Food for thought

The process of assigning mapped reads to a particular gene (to count it’s expression) is much more complex than what we have done in this practical. Think about the following questions:

What should we do if a read maps equally well to two genes?
What should we do if a read overlaps with a gene, but only by one base?
What should we do if a read maps to a gene, but on the opposite strand?
What should we do if a read maps to more than one location in the genome?
What should we do if a read maps to an intron of a gene?

There are various programs that specialise in obtaining gene expression values from mapped reads. These allow us to control many of the issues brought up by the above questions. One good option to do this inside R is the featureCounts() function from the Rsubread package.

Advanced exercises:

Calculate the absolute number of bases (and percent) where there is any kind of overlapping annotation according to your hbak object.
Propose a strategy (you don’t have to program it) to avoid counting more than once the reads that fall within overlapping annotations.