Discovering enriched sequence motifs

Introduction

This practical is not related to the previous one. Here, we will work a bit more with the Bioconductor package Biostrings. In this case we will be working with real data of a transcriptomic experiment that was performed using mouse Th1 cells that play important roles in the immune system. The experiment was designed to understand the function of a certain microRNA that is highly and specifically expressed in certain immune cells. A mouse miRNA KO model was used, and cells were obtained from KO and WT mice with biological replicates. At the end of the transcriptomic analysis a set of transcripts was defined that were repressed only in the WT. Since we know miRNAs are repressors of transcript levels, these repressed transcripts are expected to contain the targets for the miRNA of interest. One way of proving this, is to show that the set of repressed targets contain significantly more target sites than expected, i.e. more targets than in the non-repressed transcripts measured by the experiment.

We can also approach this problem in a more general way. Let’s look at target sites for all possible miRNAs, and see which one is the most enriched in the repressed transcripts. We can then check if this target site coincides with the miRNA that we were trying to predict, demonstrating against all possible controls that the repression is caused in part by the direct effect of this particular miRNA.

So, what does a miRNA target look like? It’s easier to explain with a simple diagram.

     cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p

            1 2 3 4 5 6 7 8 9
       (5') U G A G G U A G U A G G U U G U A U A G U U (3')
              | | | | | |
              C U C C A U
 (3') N N N N             N N N N N N N N N N N N N N N N N N N (5')

Positions 2-7 of the miRNA are known as the seed region, and these 6 nucleotides are the most important for target specificity in animals. The nice thing about this observation is that our motif discovery problem can be reduced to looking for enriched oligonucleotides of length 6, which are not that many (4^6 = 4096).

Where do miRNA targets occur? Mostly in the 3’UTR (3’ Un-Translated Region) of protein-coding transcripts. This also allows us to more cleanly define the potential target space. (Transcription Factors, on the other hand, bind to promoter regions or enhancers, that are MUCH more difficult to define.)

The data

For this problem, we need the following:

The 3’UTR sequences for all the transcripts that were measured by the experiment.

all_utrs.fa.gz (for Linux or Mac)
all_utrs.fa.zip (for Windows)

The ids for the transcripts that were determined to be repressed.

repressed_ids.txt

That’s it! Please download the files you need.

Working out the solution

1. You need to load the Biostrings library.

library(Biostrings)

2. The next step will be to read the data into R. You will need the function readDNAStringSet for reading the fasta file. You can use scan to import the text ids (but you need to use the what argument to specify that you are scanning text instead of numbers).

ids <- scan("data/repressed_ids.txt", what="a")

Make sure you have a look to see if the data look as expected.

utrs

## DNAStringSet object of length 14431:
##         width seq                                                                                   names               
##     [1]    55 ATTTGGCATCCTTACCTTGAAGTCAAAAAGAGAAGGCGAATAAAACATGGAAACC                               NM_029373
##     [2]    55 AAATTCCCATTCCCTCTGCAGTCACCATACCAATAAAGTGAAATTTTCTTCAGCA                               NM_011645
##     [3]    55 GAGACTGTCACCTGTGTGAGCTCTTGTCCTTGGAGCAACAATAAAATGCTTCCTG                               NM_138597
##     [4]    55 GATGAGACCCAGAAGTCTGGTTCCATTGTTCTTAAAGCCTGTGTCCTGACGTGAA                               NM_029017
##     [5]    55 CCCCGGAATTGGAGAGGCTGGGAGGGAGGGGGTCTCAATAAATTATTGTTCAACA                               NM_134469
##     ...   ... ...
## [14427]  8338 ATGTTGTGTGTGCCGCAGTGCAGCCTGTGCTTGCCCTGCCC...ATATTTTATAATGAAAGCAATAAATATGAATGTTGAATTTA NM_025965
## [14428]  9772 GGCGGGGCGGAGCAAGTGTGTGTGAGAGGTGTGGTAGACGT...CCTGGCTTTGTTGTAATAGAGTTGTTTATGTAGAAATACGA NM_011983
## [14429]  9832 AATGCTTGCTGTCTTCCAGTGTGTCCAATGANNNNNNNNNN...TGAACTATGACACTCAATAAAGACCTGTTTAACTTATTGAA NM_025292
## [14430] 10278 TCTTGTCTAAGGTGAAATGTGTGCTGTTGAAAAGCAGGTTT...TTGGAAATTTTTAATAAAACTAAATAACTTGTTCTCTTGTC NM_021458
## [14431] 10476 CTTTGGAAACTACCAGCATGTGCTTGCTATCAGACTGTAAA...GTATTGACATAATCATTAAATATGGAATAAAGTTTTGACAG NM_025335

head(ids)

## [1] "ENSMUST00000002678" "ENSMUST00000034074" "ENSMUST00000038750" "ENSMUST00000051210" "ENSMUST00000052366"
## [6] "ENSMUST00000061111"

One important control is to check that all the ids are actually within the utrs sequences.

all(ids %in% names(utrs))

## [1] TRUE

Break apart that code to make sure you understand it.

I would go so far as to try out a toy example:

x <- 1:5
y <- 4:8
x

## [1] 1 2 3 4 5

## [1] 4 5 6 7 8

x %in% y

## [1] FALSE FALSE FALSE  TRUE  TRUE

3. Count all possible 6-nt oligonucleotides in all sequences. We can use the oligonucleotideFrequency function for this (check the help page).

If calculated correctly, your result should look like this:

o6all[1:10,1:6]

##       AAAAAA AAAAAC AAAAAG AAAAAT AAAACA AAAACC
##  [1,]      0      0      1      0      1      0
##  [2,]      0      0      0      0      0      0
##  [3,]      0      0      0      0      0      0
##  [4,]      0      0      0      0      0      0
##  [5,]      0      0      0      0      0      0
##  [6,]      0      0      0      0      0      0
##  [7,]      0      0      0      0      0      0
##  [8,]      0      0      0      0      0      0
##  [9,]      0      0      0      1      0      0
## [10,]      0      0      0      0      0      0

dim(o6all)

## [1] 14431  4096

Do these numbers look familiar? They should: length(utrs)=14431 and 4^6 = 4096.

4. We should now obtain from this result, the counts corresponding to the repressed transcripts.

Since what we have are the ids, we can compare them to the names of the utrs, and see which ones match. We can then use this result to subset the count matrix.

wanted <- names(utrs) %in% ids

How about another check? The total number of matching names should be exactly the same as the number of ids.

sum(wanted) == length(ids)

## [1] TRUE

We can now use this logical vector to get the part of the count matrix that we want.

o6rep <- o6all[wanted,]

5. The problem stated in the introduction requires us to compare two sets. We need to find the total number of targets in the repressed transcripts, to compare against the total expected number of targets given all the transcripts.

So, we do not really need counts for each individual transcript, but the totals for each of the two sets. Since every possible target site (6mer) is represented in the columns of our count matrix, we can use the convenient colSums function.

You should now be able to produce two vectors, with the total number of all possible 6mers in two sets of sequences. The result should look like this:

head(o6allSum)

## AAAAAA AAAAAC AAAAAG AAAAAT AAAACA AAAACC 
##   3461   5212   6147   8395   8922   5197

head(o6repSum)

## AAAAAA AAAAAC AAAAAG AAAAAT AAAACA AAAACC 
##    237    328    443    529    610    405

How about comparing these with a plot?

plot(o6allSum, o6repSum, xlab="all 3'UTRs", ylab="repressed 3'UTRs")

Do you notice anything interesting?

6. Let’s think about how to compare our result against what is expected.

What would have happened if the experiment did not work as intended? Perhaps the KO mice was not a full KO, or the WT cells didn’t really express the miRNA. Maybe the miRNA KO was fine, but the KO cells were so drastically affected that the changes in gene expression were the result of cascading effects and not direct targets of the miRNA. There are soooo many ways an experiment might not work.

One way of thinking about this is to imagine what would happen if the repressed 3’UTRs were simply a random sample of all 3’UTRs? The advantage of this thought is that we can really test it out in R: we can sample a random set of rows from the count matrix, of the same length as our repressed transcripts, and repeat some of the steps we have done.

The sample function allows us to randomly sample part of an object, with or without replacement. Let’s use an easy example first:

x <- 1:10

sample(x, size=5, replace=FALSE)

## [1] 5 3 1 6 4

Now we can use this function to obtain our random or control set of “repressed” utrs.

o6rand <- o6all[sample(1:nrow(o6all), size=length(ids), replace=FALSE),]

o6randSum <- colSums(o6rand)

head(o6randSum)

## AAAAAA AAAAAC AAAAAG AAAAAT AAAACA AAAACC 
##    207    330    396    499    551    283

Let’s make the same plot as before, and alongside a new one using the random sequences.

par(mfrow=c(1,2))

plot(o6allSum, o6repSum, xlab="all 3'UTRs", ylab="real repressed 3'UTRs")
plot(o6allSum, o6randSum, xlab="all 3'UTRs", ylab="random 'repressed' 3'UTRs")

Now do you notice anything interesting?

7. Find a statistical test to evaluate 6mer over-representation.

Hopefully you will have noticed in the plots, one particular dot that clearly separates from the diagonal. Think of the diagonal as the expected (the random sample follows the diagonal more closely), so separation from the diagonal is what we are looking for. A 6mer that separates above the diagonal is more abundant in the y-axis (repressed utrs) than the x-axis (all utrs). A 6mer that separated below the diagonal is less abundant than expected (we are not interested in that case for now).

A statistical model that suits our problem quite well is the hypergeometric distribution. The easiest way to understand this model, is thinking of an urn with two kinds of marbles, e.g. blue and black.

We take a sample without replacement, and count the number of blue and black marbles. We could calculate the probability of obtaining that exact combination of blue and black marbles. More interestingly, we can calculate the probability of having extracted at least X blue marbles. This is what we use when searching for enrichment. If the sample contains significantly more blue balls than expected, we declare that it is not a random sample. Instead, there must be something special about it that led to the enrichment.

R has functions to deal with many popular distributions (e.g. rnorm, pnorm, dnorm for the normal distribution). The hypergeomtric distribution is also available. We are interested in two functions: dhyper for calculating exact probabilities and phyper for calculating cumulative probabilies. Let’s use a simple example first.

Imagine an urn with 100 marbles, 10 of which are white (the rest are black). If you randomly draw out 20, what would you expect? The most likely scenario would be to find 2 white balls (since randomly you expect 10/100 or 10% to be white). Let’s calculate this in R.

urn <- 100
urnWhite <- 10
draw <- 20
drawWhite <- 2

# the R functions ask specifically for the number of black balls in the urn, so let's calculate:
urnBlack <- urn - urnWhite

dhyper(drawWhite, urnWhite, urnBlack, draw)

## [1] 0.3181706

To put that into context, let’s calculate the probabilities of the other possibilities, i.e. getting 0, 1, 2 .. 10 white marbles.

drawWhite <- 0:10

pvalsExact <- dhyper(drawWhite, urnWhite, urnBlack, draw)

plot(drawWhite, pvalsExact, type="b", xlab="number of drawn white", ylab="p-value",
     main="Probability of an exact number of whites")

You can now see more easily that, yes, in fact the most likely result is to get 2 white marbles (p ~0.3), but getting 1 or 3 is quite likely as well.

Now, how about asking: what is the probability of getting at least 0, 1, 2 .. 10 white marbles?

pvalsAtLeast <- phyper(drawWhite-1, urnWhite, urnBlack, draw, lower.tail=FALSE)

plot(drawWhite, pvalsAtLeast, type="b", xlab="number of drawn white", ylab="p-value",
     main="Probability of at least a number of whites")

Note: we had to subtract 1 to the drawWhite vector because the phyper function starts integrating probabilities at +1 after each value. You will see that the result makes sense, because what is the probability of getting at least 0 whites? It has to be 1 since there is no other possibility!

Now, how do we interpret this kind of result, regarding the concept of enrichment. Well, if we are interested in cases in which we get a surprisingly large number of white marbles, and we were to observe 2 white marbles amongst our 20 it should not be at all surprising. The same goes for 0, 1, 3. Getting 4 would be in the grey area, and only if we get 5 or more should we start to get excited. You can actually play around with sample to simulate these scenarios.

8. Apply the hypergeometric test to evaluate which 6mer is significantly over-represented in the repressed 3’UTRs compared to the rest.

Now let’s apply this knowledge to our real problem. The only difference is that we need to perform 4^6=4**6 calculations, one for each 6mer that we have counted. Each 6mer shall become, in turn, the “white marbles”. All other 6mers shall be the “black marbles”. Let’s first solve for one 6mer.

kmer <- names(o6allSum)[1]

# Calculate all the numbers
urn <- sum(o6allSum)
urnWhite <- o6allSum[kmer]
draw <- sum(o6repSum)
drawWhite <- o6repSum[kmer]
urnBlack <- urn - urnWhite

# Calculate the probability of getting at least the observed number of this kmer
phyper(drawWhite-1, urnWhite, urnBlack, draw, lower.tail=FALSE)

##    AAAAAA 
## 0.8314682

Does this correspond to an enrichmer 6mer in our repressed sequences?

Since some of these calculations are redundant, let’s prepare a data.frame to store them and the results.

pvalRes <- data.frame(row.names=names(o6allSum))
pvalRes$drawWhite <- o6repSum-1
pvalRes$urnWhite  <- o6allSum
pvalRes$urnBlack  <- sum(o6allSum) - o6allSum
pvalRes$draw      <- sum(o6repSum)

head(pvalRes)

##        drawWhite urnWhite urnBlack    draw
## AAAAAA       236     3461 15482404 1123784
## AAAAAC       327     5212 15480653 1123784
## AAAAAG       442     6147 15479718 1123784
## AAAAAT       528     8395 15477470 1123784
## AAAACA       609     8922 15476943 1123784
## AAAACC       404     5197 15480668 1123784

Let’s make sure everything is calculate properly. We can use the numbers in our data.frame to re-calculate the p-value for rownames(pvalRes)[1].

phyper(pvalRes[1,1], pvalRes[1,2], pvalRes[1,3], pvalRes[1,4], lower.tail=FALSE)

## [1] 0.8314682

Now, we could perfectly well set up a for loop to go down all the rows making these calculations. But it turns out, that the phyper function accepts vectors in each of it’s arguments. So, we can perform the calculation with one instruction:

pvalRes$phyper <- phyper(pvalRes$drawWhite, pvalRes$urnWhite, pvalRes$urnBlack, pvalRes$draw, lower.tail=FALSE)

head(pvalRes)

##        drawWhite urnWhite urnBlack    draw     phyper
## AAAAAA       236     3461 15482404 1123784 0.83146820
## AAAAAC       327     5212 15480653 1123784 0.99711265
## AAAAAG       442     6147 15479718 1123784 0.56715640
## AAAAAT       528     8395 15477470 1123784 0.99973353
## AAAACA       609     8922 15476943 1123784 0.94035643
## AAAACC       404     5197 15480668 1123784 0.07283608

We would now like to order the table, according to the phyper column, in ascending order (smaller p-values first, please!)

pvalRes <- pvalRes[order(pvalRes$phyper),]

head(pvalRes)

##        drawWhite urnWhite urnBlack    draw       phyper
## GCATTA       384     2656 15483209 1123784 1.160235e-37
## CATTAA       452     4817 15481048 1123784 1.954931e-08
## AGCATT       526     5749 15480116 1123784 4.092038e-08
## CCTCCC       712     8233 15477632 1123784 9.647014e-07
## CCCTCC       637     7399 15478466 1123784 5.573590e-06
## CTGCCA       663     7760 15478105 1123784 9.048327e-06

Homework: spend some time learning how to use the order function.

We can even visualize our top predictions in some way.

par(mfrow=c(1,2))

# First all results
barplot(-log10(pvalRes[,"phyper"]), names.arg=NULL, ylab="-log10(p-value)",
        main=paste("Enrichment values for all 6mers"))

# Now only top results
topN <- 5
barplot(-log10(pvalRes[1:topN,"phyper"]), names.arg=rownames(pvalRes)[1:topN], ylab="-log10(p-value)",
        main=paste("Top",topN,"most enriched 6mers"))

9. Homework: use miRBase to get miRNA sequences and decide which miRNA can bind to our enriched 6mer.

Due to the duration of this course, there will be no time to go into detail of this part of the practical. Nevertheless, if you have digested the last two practicals, you should be able to do the following on your own.

Go to miRBase
Download the sequences of all mature miRNA sequences
Read them into R
Subset to only keep those that are mouse miRNAs
Extract the seed region (nucleotides 2-7)
Calculate the target site, i.e. the reverse complement.
Find out which miRNA can bind to your predicted target site!