In this final practical I will just describe some useful things that you can try out with R and Bioconductor.
This is a big problem for biologists in the era of genomics. Wiki page in case you want to read some more: Multiple_comparisons_problem. You will have heard that you cannot trust raw p-values, that p < 0.05 is no longer useful, etc. But perhaps you don’t fully grasp the problem.
One of my favorite explanations is by xkcd:
But, sometimes it’s nice to actually see the problem occurring in front of you.
So: you’ve all heard about the normal distribution, right? And you’ve heard about Student’s t-test for deciding if two samples come from the same normal distribution? This is how it works in R:
Let’s first define two truly different vectors.
x <- c(10.1,9.8,10.4)
y <- c(5.5,6.1,6.3)Now let’s perform a t-test, which should have no problem deciding that the probability of them coming from distributions with the same mean is very small. (So we reject the null hypothesis, and accept that they are different.)
t.test(x, y, var.equal=TRUE) # var.equal is for the standard t-test##
## Two Sample t-test
##
## data: x and y
## t = 13.951, df = 4, p-value = 0.0001531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.310747 4.955919
## sample estimates:
## mean of x mean of y
## 10.100000 5.966667
We can now define two vectors that DO come from the same distribution (Normal with mean=0 and SD=1).
x <- rnorm(3, mean=0, sd=1)
y <- rnorm(3, mean=0, sd=1)
t.test(x, y, var.equal=TRUE)##
## Two Sample t-test
##
## data: x and y
## t = 0.1988, df = 4, p-value = 0.8521
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.415671 1.634037
## sample estimates:
## mean of x mean of y
## 0.4360368 0.3268540
Even though the values are random (that’s what rnorm produces) the t-test will probably give a high p-value, so we cannot confidently reject the null hypothesis.
Everything OK until now. But, what happens if we perform many t-tests? (of course on different numbers.) Let’s try…
x <- matrix(rnorm(30000, mean=0, sd=1), nrow=10000, ncol=3)
y <- matrix(rnorm(30000, mean=0, sd=1), nrow=10000, ncol=3)We can even visualize that they seem to be pretty random:
plot(x[,1], y[,1], xlab="one set of random numbers", ylab="another set of random numbers")So, these data are as if we had gene expression measurements for 10000 genes, with 3 replicates and two conditions. But, the numbers are completely random. There should be no significant differences between the values in the rows of x, compared to the rows of y. Let’s try that out.
pvals <- vector(mode="numeric", length=nrow(x))
for (i in 1:nrow(x)) {
pvals[i] <- t.test(x[i,], y[i,], var.equal=TRUE)$p.value
}It can be quite useful looking at a histogram of the p-values that we obtained.
hist(pvals, breaks=10, main=paste(length(pvals),"t-tests on random numbers"), xlab="p-value")We now have all the evidence to prove our intuition wrong. Assuming our intuition was that there should not be any tiny p-value in our results, or that because the numbers were random, most p-values should be quite high (I’ve heard people say “close to 0.5” or even “close to 1”). If the previous histogram does not prove these ideas wrong immediately, how about these calculations:
sum(pvals < 0.001)## [1] 11
sum(pvals < 0.01)## [1] 101
sum(pvals < 0.05)## [1] 497
sum(pvals < 0.1)## [1] 988
So, what do you think now?
(take home message: if the null hypothesis is always true - e.g. all samples have the same mean - the p-values should have a uniform distribution)
Oh, and we can even visualize some of the most “differential” random genes.
library(pheatmap)
norm <- cbind(x, y)
rownames(norm) <- paste0("gene",1:nrow(norm))
colnames(norm) <- c(paste0("x",1:3), paste0("y",1:3))
anno <- data.frame(row.names=colnames(norm),"type"=c(rep("x",3),rep("y",3)))
pheatmap(norm[pvals < 0.005,], annotation_col=anno)Say we wanted to get our 3’UTR sequences to perform miRNA target site enrichment analysis. And we want to keep all our analyses reproducible, so that our friends can also download the same 3’UTR sequences. And so that we can download new sequences when the database is updated. And so we can download equivalent sequences for another species, with little effort.
Let’s do this in biomaRt.
library(biomaRt)## Warning: package 'biomaRt' was built under R version 4.1.1
Remember it’s always useful to check out the vignettes. This one has a lot of worked examples that are quite didactic.
browseVignettes(package="biomaRt")Similar to what we saw on the Ensembl BioMart website, we first need to connect to a BioMart database. We can find out which ones are available:
listMarts(host="http://uswest.ensembl.org")## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 104
## 2 ENSEMBL_MART_MOUSE Mouse strains 104
## 3 ENSEMBL_MART_SNP Ensembl Variation 104
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 104
We want to connect to Ensembl, so:
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="http://uswest.ensembl.org")Each BioMart database will contain many Datasets. We can list them and connect to the one we want.
datasets <- listDatasets(ensembl)
head(datasets)## dataset description version
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1) ASM259213v1
## 2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2) fAstCal1.2
## 3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2) AnoCar2.0v2
## 4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2) bAquChr1.2
## 5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5) Midas_v5
## 6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2) ASM200744v2
ensembl <- useDataset("dmelanogaster_gene_ensembl", mart=ensembl)We now need to decide what we want to download.
Filters (listFilters) are used to restrict the results we want by some condition. We are not going to use them because we want all possible 3’UTRs.
Attributes (listAttributes) define the actual fields of the database that we want to download. Let’s see if we can find an attribute related to 3’UTRs.
searchAttributes(ensembl, pattern="UTR")## name description page
## 14 transcript_length Transcript length (including UTRs and CDS) feature_page
## 139 transcript_length Transcript length (including UTRs and CDS) structure
## 143 5_utr_start 5' UTR start structure
## 144 5_utr_end 5' UTR end structure
## 145 3_utr_start 3' UTR start structure
## 146 3_utr_end 3' UTR end structure
## 2570 5utr 5' UTR sequences
## 2571 3utr 3' UTR sequences
## 2591 5_utr_start 5' UTR start sequences
## 2592 5_utr_end 5' UTR end sequences
## 2593 3_utr_start 3' UTR start sequences
## 2594 3_utr_end 3' UTR end sequences
## 2602 transcript_length Transcript length (including UTRs and CDS) sequences
The usual function for getting data is getBM, but there is a convenient function for specifically getting sequences. But, we need to specify either chromosome positions or ids for which we want sequences.
Let’s just get all the Ensembl Gene and Transcript ids:
head(searchAttributes(ensembl, pattern="ensembl_transcript"))## name description page
## 2 ensembl_transcript_id Transcript stable ID feature_page
## 131 ensembl_transcript_id Transcript stable ID structure
## 165 ensembl_transcript_id Transcript stable ID homologs
## 2595 ensembl_transcript_id Transcript stable ID sequences
head(searchAttributes(ensembl, pattern="ensembl_gene"))## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 130 ensembl_gene_id Gene stable ID structure
## 164 ensembl_gene_id Gene stable ID homologs
## 177 cabingdonii_homolog_ensembl_gene Abingdon island giant tortoise gene stable ID homologs
## 189 scaustralis_homolog_ensembl_gene African ostrich gene stable ID homologs
## 201 mspretus_homolog_ensembl_gene Algerian mouse gene stable ID homologs
Might as well only do for protein coding, since there won’t be a 3’UTR otherwise.
searchFilters(ensembl, pattern="biotype")## name description
## 91 biotype Type
## 92 transcript_biotype Transcript Type
bmRes <- getBM(attributes=c("ensembl_transcript_id","ensembl_gene_id","transcript_biotype"),
filters="transcript_biotype", values="protein_coding", mart=ensembl)Let’s see if we got what we expected.
head(bmRes)## ensembl_transcript_id ensembl_gene_id transcript_biotype
## 1 FBtr0330209 FBgn0015380 protein_coding
## 2 FBtr0081195 FBgn0015380 protein_coding
## 3 FBtr0088240 FBgn0010356 protein_coding
## 4 FBtr0091512 FBgn0250732 protein_coding
## 5 FBtr0334671 FBgn0250732 protein_coding
## 6 FBtr0114531 FBgn0250732 protein_coding
dim(bmRes)## [1] 30719 3
table(bmRes$transcript_biotype)##
## protein_coding
## 30719
Now we can get the sequences we wanted. To make it a bit faster, let’s only do it for now for the first 100 ids.
bmSmall <- bmRes[1:100,]
seq3utrs <- getSequence(id=bmSmall$ensembl_transcript_id, type="ensembl_transcript_id",
seqType="3utr", mart=ensembl)
dim(seq3utrs)## [1] 100 2
seq3utrs[1,]## 3utr
## 1 TATGATGGACATTATGGGACGAAATTGCATTTTCGTTGTGTTATATTCCTCTTTTTTTATCTGTTGTCTCTTACCAGCTCATCAAAATCCAAGAGTTCGCCAAGATTTCGGCAATGTGATTGAAAACCAAGCGAATAAAAACATTAATTAGTTTAAAGTTTAAAGCATTTTATGTTTAGTGTTATTGTGGACATAAAACATAGATAACATTGATAACATAGCCAAACATCGTTATTGCTAAGCATTGATGTCGCTAAACATTGATATTGCTAAACATTAATGTTGCTAAACATTAATGTTGCTAAACATTAATGTTGCCAAACAAGCGATGTGGCCGCTTAGAAAAGGAACAACTTTTCCAGGAAGAAAAGGTCAAGTTGACGCAGGAGGAGTGGGATTTAGGTTATTTTATTAACCAATAAAAAATGCTGTTCGTTTGTAATTAATTT
## ensembl_transcript_id
## 1 FBtr0076078
There is a possibility though that a 3’UTR might not be available for a particular gene. So, we can look out for these cases an remove them.
seq3utrs[seq3utrs[,1] == "Sequence unavailable",]## 3utr ensembl_transcript_id
## 27 Sequence unavailable FBtr0076997
seq3utrs <- seq3utrs[seq3utrs[,1] != "Sequence unavailable",]We could even set up a DNAStringSet:
library(Biostrings)
utrs3 <- DNAStringSet(seq3utrs$`3utr`)
names(utrs3) <- seq3utrs$ensembl_transcript_id
utrs3## DNAStringSet object of length 99:
## width seq names
## [1] 449 TATGATGGACATTATGGGACGAAATTGCATTTTCGTTGTGTTA...TTTATTAACCAATAAAAAATGCTGTTCGTTTGTAATTAATTT FBtr0076078
## [2] 177 CCCAATGCGGGTCATGTCCATATGTTGATTTTTGACTTCTGTT...CTAAGAATCCTTTTTTTTTCTGACAAATAAAAAACCGATTTG FBtr0074062
## [3] 38 AATTTTAATGAATCTATAATATAGAGTATTTGGAAAGC FBtr0071783
## [4] 92 AGATGTGCCCTCAACCTAATCGGCACTGACCTTTTATCTGCTG...AAACTGCTGTCTAATAAAACTATTATCATTCCTGCACGACCC FBtr0076160
## [5] 1287 GTAGGGGCTTTGGTGGGAGTGGTTGCCGAATCGGAACCGGAAT...GTCGTAAATCTCAAATAAAATAAATGCAAAAATCAATGGTGT FBtr0070945
## ... ... ...
## [95] 963 GTAGGGGCTTTGGTGGGAGTGGTTGCCGAATCGGAACCGGAAT...ATATATATATTTCTTTGCGTTCGCATTAAAAAAAAAAAGAAG FBtr0331599
## [96] 359 TACAGGAGGCTTGGGGATTCGCCAAGTGAACGGACGGGGAGCT...AACTGATGAGCATCGAGCATTTTCTTGGCTCTTCTCGCCACC FBtr0339555
## [97] 188 TAATTTCGGTATTCCGGTCGTAACAAAGGTATATGTGATTCGC...GTTTATCTGGCAATAAACAAATATACATTAATTAGACAATTC FBtr0332758
## [98] 59 ACGTTTCGAATTACAGAGCAATGTACCATATGGGGAATTATATTAAAAATCCGAGAAAT FBtr0306712
## [99] 320 TGCTCGACGGAACCAATCGACTACCAAACGTTTGATAGCCCTG...TTGTACAAAGCCCAAAAAACAATAAATTAATACTGTAAATTG FBtr0452078
and save them as a fasta file
writeXStringSet(utrs3, "biomart_3utrs.fa")