Extra R/BioC stuff

In this final practical I will just describe some useful things that you can try out with R and Bioconductor.

Empirical demonstration of the multiple testing problem

This is a big problem for biologists in the era of genomics. Wiki page in case you want to read some more: Multiple_comparisons_problem. You will have heard that you cannot trust raw p-values, that p < 0.05 is no longer useful, etc. But perhaps you don’t fully grasp the problem.

One of my favorite explanations is by xkcd:

But, sometimes it’s nice to actually see the problem occurring in front of you.

So: you’ve all heard about the normal distribution, right? And you’ve heard about Student’s t-test for deciding if two samples come from the same normal distribution? This is how it works in R:

Let’s first define two truly different vectors.

x <- c(10.1,9.8,10.4)
y <- c(5.5,6.1,6.3)

Now let’s perform a t-test, which should have no problem deciding that the probability of them coming from distributions with the same mean is very small. (So we reject the null hypothesis, and accept that they are different.)

t.test(x, y, var.equal=TRUE) # var.equal is for the standard t-test

## 
##  Two Sample t-test
## 
## data:  x and y
## t = 13.951, df = 4, p-value = 0.0001531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.310747 4.955919
## sample estimates:
## mean of x mean of y 
## 10.100000  5.966667

We can now define two vectors that DO come from the same distribution (Normal with mean=0 and SD=1).

x <- rnorm(3, mean=0, sd=1)
y <- rnorm(3, mean=0, sd=1)

t.test(x, y, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  x and y
## t = 0.1988, df = 4, p-value = 0.8521
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.415671  1.634037
## sample estimates:
## mean of x mean of y 
## 0.4360368 0.3268540

Even though the values are random (that’s what rnorm produces) the t-test will probably give a high p-value, so we cannot confidently reject the null hypothesis.

Everything OK until now. But, what happens if we perform many t-tests? (of course on different numbers.) Let’s try…

x <- matrix(rnorm(30000, mean=0, sd=1), nrow=10000, ncol=3)
y <- matrix(rnorm(30000, mean=0, sd=1), nrow=10000, ncol=3)

We can even visualize that they seem to be pretty random:

plot(x[,1], y[,1], xlab="one set of random numbers", ylab="another set of random numbers")

So, these data are as if we had gene expression measurements for 10000 genes, with 3 replicates and two conditions. But, the numbers are completely random. There should be no significant differences between the values in the rows of x, compared to the rows of y. Let’s try that out.

pvals <- vector(mode="numeric", length=nrow(x))

for (i in 1:nrow(x)) {
  pvals[i] <- t.test(x[i,], y[i,], var.equal=TRUE)$p.value
}

It can be quite useful looking at a histogram of the p-values that we obtained.

hist(pvals, breaks=10, main=paste(length(pvals),"t-tests on random numbers"), xlab="p-value")

We now have all the evidence to prove our intuition wrong. Assuming our intuition was that there should not be any tiny p-value in our results, or that because the numbers were random, most p-values should be quite high (I’ve heard people say “close to 0.5” or even “close to 1”). If the previous histogram does not prove these ideas wrong immediately, how about these calculations:

sum(pvals < 0.001)

## [1] 11

sum(pvals < 0.01)

## [1] 101

sum(pvals < 0.05)

## [1] 497

sum(pvals < 0.1)

## [1] 988

So, what do you think now?

(take home message: if the null hypothesis is always true - e.g. all samples have the same mean - the p-values should have a uniform distribution)

Oh, and we can even visualize some of the most “differential” random genes.

library(pheatmap)
norm <- cbind(x, y)
rownames(norm) <- paste0("gene",1:nrow(norm))
colnames(norm) <- c(paste0("x",1:3), paste0("y",1:3))
anno <- data.frame(row.names=colnames(norm),"type"=c(rep("x",3),rep("y",3)))
pheatmap(norm[pvals < 0.005,], annotation_col=anno)

A quick example using biomaRt

Say we wanted to get our 3’UTR sequences to perform miRNA target site enrichment analysis. And we want to keep all our analyses reproducible, so that our friends can also download the same 3’UTR sequences. And so that we can download new sequences when the database is updated. And so we can download equivalent sequences for another species, with little effort.

Let’s do this in biomaRt.

library(biomaRt)

## Warning: package 'biomaRt' was built under R version 4.1.1

Remember it’s always useful to check out the vignettes. This one has a lot of worked examples that are quite didactic.

browseVignettes(package="biomaRt")

Similar to what we saw on the Ensembl BioMart website, we first need to connect to a BioMart database. We can find out which ones are available:

listMarts(host="http://uswest.ensembl.org")

##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 104
## 2   ENSEMBL_MART_MOUSE      Mouse strains 104
## 3     ENSEMBL_MART_SNP  Ensembl Variation 104
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 104

We want to connect to Ensembl, so:

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="http://uswest.ensembl.org")

Each BioMart database will contain many Datasets. We can list them and connect to the one we want.

datasets <- listDatasets(ensembl)
head(datasets)

##                        dataset                           description     version
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1) ASM259213v1
## 2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.2)  fAstCal1.2
## 3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2) AnoCar2.0v2
## 4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)  bAquChr1.2
## 5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)    Midas_v5
## 6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2) ASM200744v2

ensembl <- useDataset("dmelanogaster_gene_ensembl", mart=ensembl)

We now need to decide what we want to download.

Filters (listFilters) are used to restrict the results we want by some condition. We are not going to use them because we want all possible 3’UTRs.

Attributes (listAttributes) define the actual fields of the database that we want to download. Let’s see if we can find an attribute related to 3’UTRs.

searchAttributes(ensembl, pattern="UTR")

##                   name                                description         page
## 14   transcript_length Transcript length (including UTRs and CDS) feature_page
## 139  transcript_length Transcript length (including UTRs and CDS)    structure
## 143        5_utr_start                               5' UTR start    structure
## 144          5_utr_end                                 5' UTR end    structure
## 145        3_utr_start                               3' UTR start    structure
## 146          3_utr_end                                 3' UTR end    structure
## 2570              5utr                                     5' UTR    sequences
## 2571              3utr                                     3' UTR    sequences
## 2591       5_utr_start                               5' UTR start    sequences
## 2592         5_utr_end                                 5' UTR end    sequences
## 2593       3_utr_start                               3' UTR start    sequences
## 2594         3_utr_end                                 3' UTR end    sequences
## 2602 transcript_length Transcript length (including UTRs and CDS)    sequences

The usual function for getting data is getBM, but there is a convenient function for specifically getting sequences. But, we need to specify either chromosome positions or ids for which we want sequences.

Let’s just get all the Ensembl Gene and Transcript ids:

head(searchAttributes(ensembl, pattern="ensembl_transcript"))

##                       name          description         page
## 2    ensembl_transcript_id Transcript stable ID feature_page
## 131  ensembl_transcript_id Transcript stable ID    structure
## 165  ensembl_transcript_id Transcript stable ID     homologs
## 2595 ensembl_transcript_id Transcript stable ID    sequences

head(searchAttributes(ensembl, pattern="ensembl_gene"))

##                                 name                                   description         page
## 1                    ensembl_gene_id                                Gene stable ID feature_page
## 130                  ensembl_gene_id                                Gene stable ID    structure
## 164                  ensembl_gene_id                                Gene stable ID     homologs
## 177 cabingdonii_homolog_ensembl_gene Abingdon island giant tortoise gene stable ID     homologs
## 189 scaustralis_homolog_ensembl_gene                African ostrich gene stable ID     homologs
## 201    mspretus_homolog_ensembl_gene                 Algerian mouse gene stable ID     homologs

Might as well only do for protein coding, since there won’t be a 3’UTR otherwise.

searchFilters(ensembl, pattern="biotype")

##                  name     description
## 91            biotype            Type
## 92 transcript_biotype Transcript Type

bmRes <- getBM(attributes=c("ensembl_transcript_id","ensembl_gene_id","transcript_biotype"),
              filters="transcript_biotype", values="protein_coding", mart=ensembl)

Let’s see if we got what we expected.

head(bmRes)

##   ensembl_transcript_id ensembl_gene_id transcript_biotype
## 1           FBtr0330209     FBgn0015380     protein_coding
## 2           FBtr0081195     FBgn0015380     protein_coding
## 3           FBtr0088240     FBgn0010356     protein_coding
## 4           FBtr0091512     FBgn0250732     protein_coding
## 5           FBtr0334671     FBgn0250732     protein_coding
## 6           FBtr0114531     FBgn0250732     protein_coding

dim(bmRes)

## [1] 30719     3

table(bmRes$transcript_biotype)

## 
## protein_coding 
##          30719

Now we can get the sequences we wanted. To make it a bit faster, let’s only do it for now for the first 100 ids.

bmSmall <- bmRes[1:100,]
seq3utrs <- getSequence(id=bmSmall$ensembl_transcript_id, type="ensembl_transcript_id",
                       seqType="3utr", mart=ensembl)

dim(seq3utrs)

## [1] 100   2

seq3utrs[1,]

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                3utr
## 1 TATGATGGACATTATGGGACGAAATTGCATTTTCGTTGTGTTATATTCCTCTTTTTTTATCTGTTGTCTCTTACCAGCTCATCAAAATCCAAGAGTTCGCCAAGATTTCGGCAATGTGATTGAAAACCAAGCGAATAAAAACATTAATTAGTTTAAAGTTTAAAGCATTTTATGTTTAGTGTTATTGTGGACATAAAACATAGATAACATTGATAACATAGCCAAACATCGTTATTGCTAAGCATTGATGTCGCTAAACATTGATATTGCTAAACATTAATGTTGCTAAACATTAATGTTGCTAAACATTAATGTTGCCAAACAAGCGATGTGGCCGCTTAGAAAAGGAACAACTTTTCCAGGAAGAAAAGGTCAAGTTGACGCAGGAGGAGTGGGATTTAGGTTATTTTATTAACCAATAAAAAATGCTGTTCGTTTGTAATTAATTT
##   ensembl_transcript_id
## 1           FBtr0076078

There is a possibility though that a 3’UTR might not be available for a particular gene. So, we can look out for these cases an remove them.

seq3utrs[seq3utrs[,1] == "Sequence unavailable",]

##                    3utr ensembl_transcript_id
## 27 Sequence unavailable           FBtr0076997

seq3utrs <- seq3utrs[seq3utrs[,1] != "Sequence unavailable",]

We could even set up a DNAStringSet:

library(Biostrings)

utrs3 <- DNAStringSet(seq3utrs$`3utr`)
names(utrs3) <- seq3utrs$ensembl_transcript_id

utrs3

## DNAStringSet object of length 99:
##      width seq                                                                                      names               
##  [1]   449 TATGATGGACATTATGGGACGAAATTGCATTTTCGTTGTGTTA...TTTATTAACCAATAAAAAATGCTGTTCGTTTGTAATTAATTT FBtr0076078
##  [2]   177 CCCAATGCGGGTCATGTCCATATGTTGATTTTTGACTTCTGTT...CTAAGAATCCTTTTTTTTTCTGACAAATAAAAAACCGATTTG FBtr0074062
##  [3]    38 AATTTTAATGAATCTATAATATAGAGTATTTGGAAAGC                                                   FBtr0071783
##  [4]    92 AGATGTGCCCTCAACCTAATCGGCACTGACCTTTTATCTGCTG...AAACTGCTGTCTAATAAAACTATTATCATTCCTGCACGACCC FBtr0076160
##  [5]  1287 GTAGGGGCTTTGGTGGGAGTGGTTGCCGAATCGGAACCGGAAT...GTCGTAAATCTCAAATAAAATAAATGCAAAAATCAATGGTGT FBtr0070945
##  ...   ... ...
## [95]   963 GTAGGGGCTTTGGTGGGAGTGGTTGCCGAATCGGAACCGGAAT...ATATATATATTTCTTTGCGTTCGCATTAAAAAAAAAAAGAAG FBtr0331599
## [96]   359 TACAGGAGGCTTGGGGATTCGCCAAGTGAACGGACGGGGAGCT...AACTGATGAGCATCGAGCATTTTCTTGGCTCTTCTCGCCACC FBtr0339555
## [97]   188 TAATTTCGGTATTCCGGTCGTAACAAAGGTATATGTGATTCGC...GTTTATCTGGCAATAAACAAATATACATTAATTAGACAATTC FBtr0332758
## [98]    59 ACGTTTCGAATTACAGAGCAATGTACCATATGGGGAATTATATTAAAAATCCGAGAAAT                              FBtr0306712
## [99]   320 TGCTCGACGGAACCAATCGACTACCAAACGTTTGATAGCCCTG...TTGTACAAAGCCCAAAAAACAATAAATTAATACTGTAAATTG FBtr0452078

and save them as a fasta file

writeXStringSet(utrs3, "biomart_3utrs.fa")