Week 9 practical

Instructions for this practical

This is an R Markdown document. It has some explanations and some hints. Before doing anything else, make sure that you can generate the HTML (web) page that this document is designed to produce. To do so, you click on the Knit button that should be near the top left of this script window. If this is the first time you are working with R Markdown, RStudio might need to install some packages, so make sure you are able to do so. This only happens once, and from then on, knitting your R Markdown documents will only take about as long as it takes to run your R code inside the document.

Feel free to modify this document however you want. You can add or modify any of the free text (like this one), and you can also add or modify R code (we will get to that further down). Just don’t go too fast until you are used to the format. Change something small, and click Knit again to make sure that the document still works. If it doesn’t, try to figure out what you did wrong. Perhaps use Undo to remove your changes and try something different. If you really mess up, and no one seems to be able to help you, simply download the original document again.

Start out by adding your own name in the third row of this document. Knit again and make sure it worked. You can then delete this line of instruction.

Introduction to R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for generating HTML, PDF, and documents in a few other formats. For more details on using R Markdown see http://rmarkdown.rstudio.com.

You will already have noticed that simple formating can be achieved by certain text characters. For instance, you use single star characters to make italic text and double stars to make it bold. For section headers we use one or more hash characters (try adding or removing one ‘#’ to the current header). You should be able to access a Markdown Quick Reference in the Help menu of RStudio.

As you have hopefully already seen, when you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x <- 1:10
y <- c(1,-1)
x * y

##  [1]   1  -2   3  -4   5  -6   7  -8   9 -10

You can add any valid R code in the lines between those that have the triple single quote. To insert another code chunk there is a button at the top of this window, or you can try the keyboard short cut (Windows: Ctrl-Alt-i, Mac: Option-Command-i). Try it after this paragraph, and add some simple R code. Then make sure that everything still works when you knit.

Including Plots

You can also embed plots, for example:

plot(x, x**2, main="A simple plot")

Notice that as your code progresses (in different chunks) throughout the document, the objects are not removed, so in the last chunk the x object already contained the values you saved in one of the previous chunks.

If you are very comfortable with all the above, feel free to remove the introductory section of this document and focus on the remainder of the practical. If not, just leave everything as it is.

The problem of Multiple testing

To revisit the problem of multiple testing, we are going to:

simulate some gene expression data
perform hypothesis tests to determine if the difference in the means of two groups is significant
evaluate for how many genes we are able to reject the null hypothesis of true difference of means = 0

Simulating gene expression data

We will start out by generating data according to the null hypothesis, that is, with a constant mean. We will use the normal distribution, which will allow us to use the t test for hypothesis testing.

I have set the code up with some values, the whole point of the exercise is for you to play around with these values and make sure you understand the results.

mu    <- 100 # a single expression average for every simulated value
stdev <- 5   # the standard deviation for the simulated values
genes <- 100  # for how many genes shall I simulate expression values?
reps  <- 3   # how many replicate measurements am I simulating for each gene?

Now we can simulate two sets of data: a control and a treated group of genes. If we use the same parameters for both groups, we know that the null hypothesis is true: the true means are the same. Nevertheless, since we are sampling random values, the numbers will of course be different, but the differences are only the sampling fluctuations expected under the null hypothesis.

We will format the results into a genes*reps matrix, so that each row contains the expression values of one gene.

control <- rnorm(genes*reps, mean=mu, sd=stdev)
treated <- rnorm(genes*reps, mean=mu, sd=stdev)

control <- matrix(data=control, nrow=genes, ncol=reps)
treated <- matrix(data=treated, nrow=genes, ncol=reps)

head(control)

##           [,1]      [,2]     [,3]
## [1,] 103.16907 111.36585 93.46029
## [2,]  96.78435 107.15380 96.45599
## [3,]  98.74684  97.15717 99.97421
## [4,]  99.14940  92.13701 98.60704
## [5,]  88.15788  96.60742 98.28889
## [6,] 101.01973  88.60008 98.32328

head(treated)

##           [,1]      [,2]     [,3]
## [1,]  98.77231 110.02785 94.59938
## [2,]  99.76973 110.07194 98.09608
## [3,] 100.06786  98.97226 95.89846
## [4,] 100.93735 102.08234 91.16080
## [5,] 109.43291 101.35317 95.84549
## [6,]  97.36513 109.81741 95.28534

We can calculate and plot the mean expression values of the two groups. We expect to see the values forming a cloud around a common value: the true mean, 100 (yes, we can insert bits of r code inline with our text using the previous convention).

plot(rowMeans(control), rowMeans(treated), xlab="control", ylab="treated", main="Average gene expression")
abline(h=mu, v=mu, lty=2)

Notice how I changed some of the properties of the last code chunk, to be able to get a square figure.

Hypothesis testing on the simulated data

We would now like to perform a t test on each gene (rows of our two matrices). Let’s assume that we are not working with paired data, so we want to perform a regular two sample t.test. Let’s do this manually for the first gene:

t.test(control[1,], treated[1,], var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  control[1, ] and treated[1, ]
## t = 0.22108, df = 4, p-value = 0.8359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17.70606  20.76983
## sample estimates:
## mean of x mean of y 
##  102.6651  101.1332

By default this used a two-sided test and the null hypothesis is that the difference in means is 0. We did need to turn the var.equal argument to TRUE to get a classical Student’s t test.

If we only want to access (perhaps for saving?) the p-value, we can do so:

t.test(control[1,], treated[1,], var.equal=TRUE)$p.value

## [1] 0.835854

Although the data are completely random, I would guess that the resulting p-value will not be below 0.05. But you now know that this will not be the case for every single gene you test, even though all the values were sampled from the same distribution.

Now to perform the t test on all the genes. One way to do so is using a for cycle which you have come across a couple of times, although you might not yet be comfortable using it. So I am providing hints in the following code:

pvalues <- rep(1, times=genes) # This generates an object filled with 1s
for (i in 1:genes) {
  # any code here will be executed the same number of times as the number of genes
  # the only thing that will change, is that the 'i' object will increase by 1 each time
  # if you want your code to do something slightly different, it has to use 'i' object somehow
  # the following code does NOT do this yet
  t.test(control[1,], treated[1,], var.equal=TRUE)$p.value
  # you will also want to store any result you get, but you don't want to overwrite!
  pvalues[i] <- 0.5
}

Where you able to fix the previous code to calculate what you wanted? If so, you should now have a collection of different p-values, one for each gene.

range(pvalues)

## [1] 0.5 0.5

Let’s have a look at the distribution of the p-values. Remember what to expect? How close you get to that expected distribution will depend on the total number of genes you are simulating.

hist(pvalues)

You might want to have a look at the breaks argument for the hist() plot, to change the number of bars to smooth out the histogram.

Counting the number of significant results

A significant result is one in which we were able to reject the null hypothesis. We will follow the convention of choosing a threshold of 0.05 (called \(\alpha\) (yes, we can add greek symbols and in general mathematical equations). Any p-value below this threshold we feel is unlikely under the null hypothesis, leading us to reject it and claim a significant result. Unfortunately here we know that the null hypothesis is true for all genes, so any rejection of the null is an error on our part. This is usually called a type I error or a False Positive, and our \(\alpha\) value is called the False Positive Rate.

So let’s start by counting how many significant results we obtained. This part will only work if you have first corrected the previous section.

alpha <- 0.05

any(pvalues < alpha)

## [1] FALSE

sum(pvalues < alpha)

## [1] 0

We could also use this number to estimate our empirical False Positive Rate. It should be pretty close to \(\alpha\) but it will depend a lot on the number of genes you simulated.

sum(pvalues < alpha) / genes

## [1] 0

Finally, how about locating these genes in a plot? We can repeat our average expression plot we used above, but we would want to colour in the significant genes. There are many ways of doing this in R, I will give you a hint on how to do so with the simple plot() function.

Just remember you are trying to plot the average simulated expression values, but colouring them according to your p-values.

minips <- c(0.9, 0.04, 0.7, 0.3, 0.01, 0.6)

ifelse(minips < 0.05, "red", "black")

## [1] "black" "red"   "black" "black" "red"   "black"

plot(minips, col=ifelse(minips < 0.05, "red", "black"))
abline(h=0.05, lty=2)

If you’ve got this far, and have got everything working fine, now is the time to go back and start playing around with some values. At the very least try increasing the number of genes in your simulation. See what happens when you change your alpha value. How has your histograph of p-values behaved?

Exploring the power of statistical tests

Although we didn’t explicitly cover power during SDA, it is a very important concept in statistics, and was mentioned briefly. In the first part of this practical, by simulating data under the null hypothesis, we were able to calculate the number, or the rate, of false positives. But what about true positive results? These would be those cases when the null hypothesis is correctly rejected.

In our case, it would mean that the two groups are truly samples from distributions with different means, and that our hypothesis test returned a \(p < \alpha\) allowing us to claim the difference in means was significant. The True Positive Rate is also known as the power of the hypothesis test. One thing to note is that the power will depend on the actual value of the difference of the means. This should be easy to see intuitively: if the true difference is very large, it is highly unlikely that we would see such a difference under the null hypothesis, and the test should gives us a very small p-value. On the other hand, if the difference is much smaller (although greater than zero), it is much more likely to see such a difference under the null, and the test would return a p-value that is bigger making it less likely for us to reject the null. When the difference is large, the test has more power (the True Positive Rate is higher). When the difference is small, the same test has less power (the TPR is lower).

As before, we can generate simulated data to explore these ideas. But in this case, we need to simulate data where we know that the alternative hypothesis (true difference in means > 0) is true.

I will start you out with a small bit of code, but you should be able to re-use a lot of code from the previous sections and adapt it to this new exploration.

mu1   <- 100 # expression average for the control group
mu2   <- 110 # expression average for the treated group

stdev <- 5   # the standard deviation for the simulated values
genes <- 100  # for how many genes shall I simulate expression values?
reps  <- 3   # how many replicate measurements am I simulating for each gene?

Do you remember how to interpret the spread of a normal distribution with respect to the standard deviation? It could help you choose your values for mu1, mu2 and stdev.

At this point, you get to work on your own. But as guidelines consider the following goals:

Simulate gene expression values for a control group of genes
Simulate gene expression values for a treated group of genes, using a different mean, but the same standard deviation (following the assumptions of the t test)
Perform hypothesis testing for each gene using the t test
Count how many genes are significant at a given \(\alpha\) (express this as a fraction)
Change the parameters of your simulation, and see how these values change
- Which parameters can you change to increase the power of the test?

Slightly more advanced goals:

Try to use the power.t.test() function to obtain the same results
Change your hypothesis test to use wilcox.test() but keep everything else the same
- What do you expect should happen? How do you interpret the results?
Can you think of a way to generate simulated data for which it would be better to use wilcox.test() than t.test()?

Delete below this before giving to students

mu1   <- 100 # expression average for the control group
mu2   <- 110 # expression average for the treated group

stdev <- 5   # the standard deviation for the simulated values
reps  <- 50   # how many replicate measurements am I simulating for each gene?
genes <- 1000

#control <- rnorm(genes*reps, mean=mu1, sd=stdev)
#treated <- rnorm(genes*reps, mean=mu2, sd=stdev)
#control <- runif(genes*reps, 1, 10)
#treated <- runif(genes*reps, 2, 11)
control <- rt(genes*reps, 1)
treated <- rt(genes*reps, 1)+1


control <- matrix(data=control, nrow=genes, ncol=reps)
treated <- matrix(data=treated, nrow=genes, ncol=reps)

pvalues <- rep(1, times=genes)
pvalues2 <- rep(1, times=genes)
for (i in 1:genes) {
  pvalues[i] <- t.test(control[i,], treated[i,], var.equal=TRUE)$p.value
  pvalues2[i] <- wilcox.test(control[i,], treated[i,])$p.value
}
range(pvalues)

## [1] 9.178068e-05 9.942912e-01

hist(pvalues)

hist(pvalues2)

sum(pvalues < 0.2) / genes

## [1] 0.268

sum(pvalues < 0.1) / genes

## [1] 0.13

sum(pvalues < 0.05) / genes

## [1] 0.064

sum(pvalues2 < 0.2) / genes

## [1] 0.905

sum(pvalues2 < 0.1) / genes

## [1] 0.827

sum(pvalues2 < 0.05) / genes

## [1] 0.735