In this practical we will be analising our sRNA-Seq data, using the edgeR Bioconductor package. This package contains one of the most popular methods for analysing differential gene expression from sequencing data. Our data needs to be in the form of “counts” (integers), similar to what we produced at the end of the previous practical, and these counts will be compared to a Negative Binomial distribution. This type of distribution is often used then the data being modelled has higher variability than what a Poisson would predict. This is the case with sequencing data due to what we refer to as biological variability (any differences between samples beyond what is expected by sampling noise).
During this practical you will need two Bioconductor packages, so make sure you have them correctly installed:
If you do not have either of them, you can install the correct
version using the Bioconductor install function:
To simplify this practical a bit, I already prepared a count table
for our sRNA-Seq data. I basically used the tally Linux
command to collapse all the short sequences detected across any of the
libraries, thus defining a “universe” of all sRNAs in our experiment. I
then kept any sequence that had at least 5 counts in total across all
conditions. I then used tally on each individual library to obtain
counts for each one, and combined these into a count matrix.
Please download the file that is the result of this process: all_tally_counts.tab.gz
The first thing you need to do is load the edgeR package
in your R session:
I should point out that this is not a very complete tutorial, but a brief overview of some of the possibilities. If you need to work more with these types of analyses, I highly recommend you have a look at the edgeR manual, which you can open like this:
Now, where do you have your file with the expression counts?
It is a simple table, so we can use read.table() to
import the data into R:
## MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## TGAGGTAGTAGGTTGTATGGTT 731227 660554 790987 594588 47863 556576 22
## TGAGGTAGTAGGTTGTGTGGTT 282718 251073 305412 221319 10725 212029 23
## TGAGGTAGTAGGTTGTATAGTT 142811 120775 145612 117475 36641 111974 2312
## TGAGGTAGTAGGTTGTATGGT 62532 56285 70073 47725 3400 43606 8
## TGAGGTAGTAGGTTGTGTGGT 54367 47793 61973 33023 1750 31405 10
## TGAGGTAGGAGGTTGTATAGTT 45504 38557 47125 39197 1095 36642 0
## TGAGGTAGTAGGTTGTGTGGTTT 41253 36301 42755 24802 1887 23941 0
## TGAGGTAGTAGGTTGTATGGTTT 36561 32563 37021 27443 2708 26046 0
## CGTAGAACCGACCTTGCG 19939 30328 38217 29544 3827 34961 0
## TAGGTAGTTTCATGTTGTTGGG 30373 24789 27680 29014 447 28141 0
## GATGACCAACCGGCTGTGGAAGC 0 0 0 8046 8046 7682 42233
## CTAGACTGAGGCTCCTTGAGG 24038 21321 21915 21879 1029 20172 0
## Pure_EV_r2
## TGAGGTAGTAGGTTGTATGGTT 71
## TGAGGTAGTAGGTTGTGTGGTT 29
## TGAGGTAGTAGGTTGTATAGTT 4296
## TGAGGTAGTAGGTTGTATGGT 5
## TGAGGTAGTAGGTTGTGTGGT 10
## TGAGGTAGGAGGTTGTATAGTT 0
## TGAGGTAGTAGGTTGTGTGGTTT 13
## TGAGGTAGTAGGTTGTATGGTTT 2
## CGTAGAACCGACCTTGCG 0
## TAGGTAGTTTCATGTTGTTGGG 0
## GATGACCAACCGGCTGTGGAAGC 59302
## CTAGACTGAGGCTCCTTGAGG 3
You should already be able to see an interesting sequence:
## MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## GATGACCAACCGGCTGTGGAAGC 0 0 0 8046 8046 7682 42233
## Pure_EV_r2
## GATGACCAACCGGCTGTGGAAGC 59302
Surely this looks like a parasite sequence that has entred the mouse cells, right?
Another thing you might have noticed even with this brief look at the table, is that the Pure_EV libraries seem very different to the rest. This could cause a problem with the analysis. Why? One of the assumptions of differential gene expression analysis is that there are always a large number of genes (sRNAs in our case) that are not really changing between the conditions we want to compare. If there are samples, such as our Pure_EV samples, where all the genes have very different expression values to the other conditions, this would break this assumption, and we would have to remove them from our analysis.
Remember that for this experiment, our most important question will be to compare MODEK_EV against MODEK_ctrl samples, to ask which sRNA sequences are significantly higher in MODEK_EV and thus could be parasite sRNAs.
This is an obvious question that you might be thinking about. Although this strategy will work in some cases, there are several reasons why it will not always work:
For these reasons, it is better to perform a statistical comparison between these samples, as we will be doing in this practical.
So, let’s get back to exploring the data. Can you find out how many “genes” are in the table? How many samples?
## [1] 104790 8
Since different samples could be sequenced to different depths, we will need to normalise this in some way. It is useful to find out just how different the total number of counts are in each sample:
## MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## 2.1 1.8 2.3 1.9 1.5 1.9 1.1
## Pure_EV_r2
## 1.6
Before we actually get to the differential expression analysis, it is a good idea to identify genes (sRNAs in this case) that were sequenced at extremely low levels, and remove them. The reason for this is that when the counts are too low, we will not be able to find any statistically significant differences (this has to do with statistical power). For instance, if a gene simply has count values of 0 in some conditions and 1 in others, although that might seem like a clear change, it could equally have occurred just by chance. Unless we had a ridiculously big number of replicates, no statistical method will allow us to conclude that 1 is significantly higher than 0 in these cases. Something similar will happen with other small values.
If we do not remove these cases before starting our analysis, we will end up having a bigger problem of multiple testing. If you are not familiar with this concept, the final practical has some examples. For now, just trust me…
To perform this filtering step, we need to propose a value for low expression, to find the genes that are not expressed higher than this, and remove them. But, if our samples have different total sequencing depth, it would be a bad idea to propose a value based on counts. This is because a sample with more total counts, just by chance, will have higher counts for a similarly lowly expressed gene than a sample with lower total counts.
The cpm() function allows us to correct this problem,
since it calculates counts per million. This simply means that
it divides each count value by the total sum of its column (sample) and
multiples the result by 1 million. These cpm values are
now comparable between columns, so we can use them to choose a threshold
for low expression.
## keep
## FALSE TRUE
## 37811 66979
## [1] 66979 8
This last bit of code will find all the cpm values that
are at least 1 (1 read for each million reads in that sample) and it
will then choose genes (rows) that have such a cpm value in
at least 2 samples. It is quite reasonable to discard a large fraction
of genes with this type of filter. Warning: we cannot
use these exact same values for every experiment we might come across.
In this case, I chose a cpm of 1 because the sequencing
depth of this experiment didn’t allow me to choose a smaller value (many
samples had about 1 million total reads). I then chose at least 2
samples because some of the sample types were only sequenced in
duplicate, so a gene that might be specific to such a type of sample
would only be expressed in 2 samples across the whole experiment.
I would usually recommend trying out different combinations of filters. You want to eliminate a large number of genes that are uninformative (too lowly expressed) but also retain a representative number of genes. There are no hard rules for this, and some intuition about your experiment is important. In the case of sRNAs, we might want to keep a large number of them, since we are dealing with individual sRNA sequences. On the other hand, we do not want to only keep sequences that come from samples which have high sequencing depth.
An alternative is to use an automatic function that
edgeR has to make this choice. Careful: this
function can return VERY different results if we use only the original
count matrix, or in addition a grouping factor from the experimental
design (which we shall explore further down).
## No group or design set. Assuming all samples belong to one group.
##
## FALSE TRUE
## 104648 142
In this case, I would definitely advise against using this function. Do remember that what is recommended in a manual might not be the best option for all cases.
The next step is to tell edgeR how our samples are grouped. In our case, we have three types of sample (MODEK, MODEK+EV and EV). We can define each sample type by hand, but it is much better to try to do this automatically, both to save time, but also to avoid easy-to-make errors.
The column names already contain this information, we just have to trim them a bit. We can use an R function to substitute (in this case remove) some text from another test. (‘.’ means any character, ‘+’ means repeat the previous character or special character one or more times)
## [1] "MODEK_ctrl_r1" "MODEK_ctrl_r2" "MODEK_ctrl_r3" "MODEK_EV_r1_alt" "MODEK_EV_r1" "MODEK_EV_r2"
## [7] "Pure_EV_r1" "Pure_EV_r2"
## [1] MODEK_ctrl MODEK_ctrl MODEK_ctrl MODEK_EV MODEK_EV MODEK_EV Pure_EV Pure_EV
## Levels: MODEK_ctrl MODEK_EV Pure_EV
## group
## MODEK_ctrl MODEK_EV Pure_EV
## 3 3 2
With these two components, we can now create a DGEList
object to analyse our data with edgeR.
## An object of class "DGEList"
## $counts
## MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## TGAGGTAGTAGGTTGTATGGTT 731227 660554 790987 594588 47863 556576 22
## TGAGGTAGTAGGTTGTGTGGTT 282718 251073 305412 221319 10725 212029 23
## TGAGGTAGTAGGTTGTATAGTT 142811 120775 145612 117475 36641 111974 2312
## TGAGGTAGTAGGTTGTATGGT 62532 56285 70073 47725 3400 43606 8
## TGAGGTAGTAGGTTGTGTGGT 54367 47793 61973 33023 1750 31405 10
## Pure_EV_r2
## TGAGGTAGTAGGTTGTATGGTT 71
## TGAGGTAGTAGGTTGTGTGGTT 29
## TGAGGTAGTAGGTTGTATAGTT 4296
## TGAGGTAGTAGGTTGTATGGT 5
## TGAGGTAGTAGGTTGTGTGGT 10
## 66974 more rows ...
##
## $samples
## group lib.size norm.factors
## MODEK_ctrl_r1 MODEK_ctrl 2063899 1
## MODEK_ctrl_r2 MODEK_ctrl 1811880 1
## MODEK_ctrl_r3 MODEK_ctrl 2259655 1
## MODEK_EV_r1_alt MODEK_EV 1893224 1
## MODEK_EV_r1 MODEK_EV 1429975 1
## MODEK_EV_r2 MODEK_EV 1854187 1
## Pure_EV_r1 Pure_EV 1021579 1
## Pure_EV_r2 Pure_EV 1527242 1
One of the best criteria to decide if our experiment was successful
is to check if our replicates are similar to each other. This will not
always be the case, and one of the easiest ways to check this is to use
edgeR to create a Multi-Dimensional Scaling (MDS) plot. It
is quite similar to a PCA, if you are more used to those.
To make it easier to visualise, we can use colours or symbols to distinguish the different types of sample.
## MODEK_ctrl MODEK_EV Pure_EV
## "#FF0000" "#00FF00" "#0000FF"
As you might have expected, the biggest separation is caused by the
Pure_EV samples. We can also see that the sample called
MODEK_EV_r1, which we had already identified as problematic
for potentially containing degraded RNA, is more separated from the
other samples of the same type. We could use these results to argue for
removal of some samples.
Even though edgeR does not normalise the actual values
(counts) in the sense of transforming these, it does perform a
calculation of normalisation factors, which are used
internally.
## group lib.size norm.factors
## MODEK_ctrl_r1 MODEK_ctrl 2063899 0.4794710
## MODEK_ctrl_r2 MODEK_ctrl 1811880 0.5170582
## MODEK_ctrl_r3 MODEK_ctrl 2259655 0.4740660
## MODEK_EV_r1_alt MODEK_EV 1893224 0.6647895
## MODEK_EV_r1 MODEK_EV 1429975 0.8801520
## MODEK_EV_r2 MODEK_EV 1854187 0.6548752
## Pure_EV_r1 Pure_EV 1021579 4.8249135
## Pure_EV_r2 Pure_EV 1527242 4.6022430
The default method for this is called the weighted Trimmed Mean of M-values (TMM) which first ignores atypical expression values (those that are too high or too low, or those that change too much between samples) before calculating the normalisation factors. You can imagine these factors as values that are used to centre the remaining expression values, so that these reference expression values are actually centred on 0 average fold-change. You can also interpret these values as a number that can be multiplied by the sequencing depth of each library to obtain the normalised library size.
To visualise the effect of this normalisation step, we can generate Mean Difference (MD) plots. In these, we compare the average expression across all samples (X axis) against the relative expression of each sample against the rest (Y axis).
We want the bulk of the genes (black dots) to be centred around the red line, indicating that the changes in expression are not biased towards upward or downward changes.
par(mfrow=c(1,2))
for (i in c(3,4,5,7)) {
plotMD(cpm(dge, log=TRUE), column=i)
grid(col = "blue")
abline(h=0, col='red', lty=2, lwd=2)
}We can also verify if the normalisation had an effect on the MDS plot.
plotMDS(dge, col=colors[dge$samples$group], labels=colnames(dge$counts), main="MDS after normalization factors")Even though the normalisation process improved the sample grouping,
the previous plotMD figures suggest that the data are not really being
normalised properly, and if we continue we are likely to obtain biased
results. A sensible strategy is to remove the Pure_EV
samples and also remove the sample that seems to be degraded. To
compensate this possibility, I already included an alternative replicate
MODEK_EV_r1_alt, so that our experiment will still have
biological replicates for this sample type.
To make sure we do this correctly, we will start the analysis from the top, removing the problematic samples from the count table.
We will define a new counts object with the samples we
want to work with.
## [1] "MODEK_ctrl_r1" "MODEK_ctrl_r2" "MODEK_ctrl_r3" "MODEK_EV_r1_alt" "MODEK_EV_r1" "MODEK_EV_r2"
## [7] "Pure_EV_r1" "Pure_EV_r2"
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
counts <- counts0[,!colnames(counts0) %in% badSamples]
keep <- rowSums(cpm(counts) >= 1) >= 2
table(keep)## keep
## FALSE TRUE
## 79780 25010
We proceed to redefine the grouping factor, create the object for edgeR and calculate normalisation factors. It is important to not forget to do all these steps for the new data.
## [1] MODEK_ctrl MODEK_ctrl MODEK_ctrl MODEK_EV MODEK_EV
## Levels: MODEK_ctrl MODEK_EV
## group lib.size norm.factors
## MODEK_ctrl_r1 MODEK_ctrl 2060469 0.9904081
## MODEK_ctrl_r2 MODEK_ctrl 1809130 1.0791995
## MODEK_ctrl_r3 MODEK_ctrl 2256422 1.0512655
## MODEK_EV_r1_alt MODEK_EV 1852057 0.9146423
## MODEK_EV_r2 MODEK_EV 1811443 0.9730169
Let us see if this new attempt improves the plotMD results:
par(mfrow=c(1,2))
for (i in c(2,3,4,5)) {
plotMD(cpm(dge, log=TRUE), column=i)
grid(col = "blue")
abline(h=0, col='red', lty=2, lwd=2)
}Definitely much better, don’t you agree? See that in the MD plots the
main “cone” of the data is centred on zero. There is an additional
diagonal of points, moving upwards in the MODEK_EV samples.
These are likely the sequences that were present inside the EVs that
were internalised by the MODEK cells, while the dots that are centered
on zero are the mouse sequences that do not change too much upon EV
treatment. Compare these results to the previous MD plots, where the was
no clear set of dots around the horizontal zero.
We can also see if there is an effect on the MDS plot:
plotMDS(dge, col=colors[dge$samples$group], labels=colnames(dge$counts), main="MDS after removal of samples")The main separation (treated vs untreated samples) is more cleanly define on the X axis. While you might worry that the Y axis is now separating some of the replicates, do realise that the MDS process is designed to produce the 2nd (and 3rd, 4th, etc, although we do not use them here) most important variance component, and since we no longer have another biological difference, the Y axis was bound to separate some of the replicates. You can also compare the percent of explained variance in each of the axes.
In order to perform a differential expression analysis we need to
tell edgeR which samples belong to each condition or sample
type. It is possible to have complicated experimental designs, where
each sample belongs to more than one type (imagine different
combinations of times, treatments, cell-types, genotypes). For this
reason, we define the experimental design as a matrix. Again, although
we could define this matrix by hand, this could introduce unintented
errors, so it is recommended to use R functions and extract the data
from the object we already have.
## MODEK_ctrl MODEK_EV
## 1 1 0
## 2 1 0
## 3 1 0
## 4 0 1
## 5 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`dge$samples$group`
## [1] "contr.treatment"
The ~0 syntax is for designs where there is no clear
control group, so each type of sample is maintained in the design
matrix. (Technically this means we will not use an intercept term in the
linear models.)
The Negative Binomial distribution can be used for cases where the
Poisson does not adequately describe the observed variation in the data.
The Poisson distribution has a single parameter, lambda, that represents
the mean and the variance. The Negative Binomial has an additional
parameter for the over-dispersion of the data. A simple interpretation
is that it allows higher variance than the Poisson, which is usually the
case with real biological data. Since edgeR uses the
Negative Binomial distribution for calculating probabilities, we need to
calculate the over-dispersion parameter from our data.
This function calculates the dispersion at three levels:
We can visualise this with the following figure:
It is to be expected that genes with lower expression are more noisy and have higher variability. Genes with higher expression are usually more stable across replicates.
In the following steps we can choose which dispersion values we want to use. The common dispersion is usually the least appropriate, since it will be too high for genes with high expression, and too low for genes with low expression. The tagwise expression, especially in experiments with many replicates would be ideal, but if the number of replicates is low it might not be very robust.
In this case, our problem consists of trying to detect sequences that are likely to have very low expression. For this reason, we might want to start using the common dispersion, which will be more permisive for these types of sequences. Basically this means that we might be able to detect more of them, but we will also have more false positives.
We now need to define the questions we would like to ask of our data.
That is, what comparisons or contrasts we would like to perform between
sample types. There is a function called makeContrasts that
helps us to define these contrasts.
Since we have reduced our design to two types of sample, and we want to find the sRNAs that are entering cells as a consequence of EV treatment, we only really have one possible contrast. We define it like this:
## Contrasts
## Levels EV_effect
## MODEK_ctrl -1
## MODEK_EV 1
If we had a more complicated design, we would be able to add extra
contrast lines, where on the left side of the equation we define any
name for each contrast, and on the right side we define an algebraic
comparison. If we had sample types called A,
B, C, for example, we could define contrasts
as: "A - B", "B - C",
"A - C".
The next step is to perform the analysis of differences in expression for the contrasts (questions) that we have defined. This means that we are asking which genes (sRNAs in our case) have expression differences that are statistically different from zero.
As I mentioned above, we could have defined several different contrasts, but since our experimental design was quite simple, we only have one contrast. So it is easier to extract that single contrast from the matrix to continue.
## Contrasts
## Levels EV_effect
## MODEK_ctrl -1
## MODEK_EV 1
## MODEK_ctrl MODEK_EV
## -1 1
We will fit Generalized Linear Models to each gene, using the
experimental design (already included in dge) and the
dispersion we selected. We will then perform a Likelihood Ratio Test for
the contrast of interest.
fit <- glmFit(dge, dispersion=dge$common.dispersion)
lrt <- glmLRT(fit, contrast=contrast)
topTags(lrt)## Coefficient: -1*MODEK_ctrl 1*MODEK_EV
## logFC logCPM LR PValue FDR
## GATGACCAACCGGCTGTGGAAGC 16.12702 10.831735 89.21792 3.536319e-21 8.844333e-17
## GACGCCGAGATACGTGGGGACG 13.57184 8.279930 69.55421 7.434306e-17 9.296600e-13
## GACCAACCGGCTGTGGAAGCGT 13.46217 8.170591 68.71109 1.139981e-16 9.503641e-13
## GATAAGAGAATAGGCCGGACACGC 13.36516 8.073875 67.96532 1.663964e-16 1.040393e-12
## GAAGCGAAGATCACCGGATGTAGG 13.11095 7.820576 66.01170 4.482548e-16 2.242171e-12
## GAAGCAGGCGTCGAATGGACC 12.77641 7.487503 63.44198 1.651608e-15 6.290595e-12
## GTTTTCTGAACCCGGCAGCGATG 12.76001 7.471179 63.31602 1.760662e-15 6.290595e-12
## GAGATATGTGACTCTGTCGGGT 12.34882 7.062336 60.16008 8.744771e-15 2.733834e-11
## GACGCCGAGATACGTGGGGACGA 12.23615 6.950440 59.29595 1.356555e-14 3.769716e-11
## GAGCACCTTCTGATCGTTCGAGG 12.06651 6.782105 57.99547 2.627219e-14 6.570675e-11
By default, the last function returns the 10 most significantly different genes.
A next step might be to count how many genes pass certain criteria, such as a False-Discovery Rate of 5%. There is a function that can help us with this:
## dt
## -1 0 1
## 1142 9693 14175
And we can also visualise this in a figure:
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a graphical parameter
In many cases we would like to obtain the results for all the genes we analysed, for further exploration.
## logFC logCPM LR PValue FDR
## GATGACCAACCGGCTGTGGAAGC 16.12702 10.831735 89.21792 3.536319e-21 8.844333e-17
## GACGCCGAGATACGTGGGGACG 13.57184 8.279930 69.55421 7.434306e-17 9.296600e-13
## GACCAACCGGCTGTGGAAGCGT 13.46217 8.170591 68.71109 1.139981e-16 9.503641e-13
## GATAAGAGAATAGGCCGGACACGC 13.36516 8.073875 67.96532 1.663964e-16 1.040393e-12
## GAAGCGAAGATCACCGGATGTAGG 13.11095 7.820576 66.01170 4.482548e-16 2.242171e-12
## GAAGCAGGCGTCGAATGGACC 12.77641 7.487503 63.44198 1.651608e-15 6.290595e-12
## [1] 25010 5
We can now manually ask how many genes pass certain thresholds, or how many are up-regulated (positive logFC) or down-regulated (negative logFC).
## [1] "down-regulated: 1142"
## [1] "up-regulated: 14175"
counts0 table. According to the
Pure_EV columns, do you think these sequences are likely to
be from the parasite?