Differential expression analysis for smallRNA-Seq

Introduction

In this practical we will be analising our sRNA-Seq data, using the edgeR Bioconductor package. This package contains one of the most popular methods for analysing differential gene expression from sequencing data. Our data needs to be in the form of “counts” (integers), similar to what we produced at the end of the previous practical, and these counts will be compared to a Negative Binomial distribution. This type of distribution is often used then the data being modelled has higher variability than what a Poisson would predict. This is the case with sequencing data due to what we refer to as biological variability (any differences between samples beyond what is expected by sampling noise).

During this practical you will need two Bioconductor packages, so make sure you have them correctly installed:

library("edgeR")
library("statmod")

If you do not have either of them, you can install the correct version using the Bioconductor install function:

BiocManager::install(c("edgeR","statmod"))

To simplify this practical a bit, I already prepared a count table for our sRNA-Seq data. I basically used the tally Linux command to collapse all the short sequences detected across any of the libraries, thus defining a “universe” of all sRNAs in our experiment. I then kept any sequence that had at least 5 counts in total across all conditions. I then used tally on each individual library to obtain counts for each one, and combined these into a count matrix.

Please download the file that is the result of this process: all_tally_counts.tab.gz

Exploring and pre-processing the count data

The first thing you need to do is load the edgeR package in your R session:

library(edgeR)

I should point out that this is not a very complete tutorial, but a brief overview of some of the possibilities. If you need to work more with these types of analyses, I highly recommend you have a look at the edgeR manual, which you can open like this:

edgeRUsersGuide()

Now, where do you have your file with the expression counts?

countFile <- "all_tally_counts.tab.gz"

It is a simple table, so we can use read.table() to import the data into R:

counts0 <- read.table(countFile)

head(counts0, 12)

##                         MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## TGAGGTAGTAGGTTGTATGGTT         731227        660554        790987          594588       47863      556576         22
## TGAGGTAGTAGGTTGTGTGGTT         282718        251073        305412          221319       10725      212029         23
## TGAGGTAGTAGGTTGTATAGTT         142811        120775        145612          117475       36641      111974       2312
## TGAGGTAGTAGGTTGTATGGT           62532         56285         70073           47725        3400       43606          8
## TGAGGTAGTAGGTTGTGTGGT           54367         47793         61973           33023        1750       31405         10
## TGAGGTAGGAGGTTGTATAGTT          45504         38557         47125           39197        1095       36642          0
## TGAGGTAGTAGGTTGTGTGGTTT         41253         36301         42755           24802        1887       23941          0
## TGAGGTAGTAGGTTGTATGGTTT         36561         32563         37021           27443        2708       26046          0
## CGTAGAACCGACCTTGCG              19939         30328         38217           29544        3827       34961          0
## TAGGTAGTTTCATGTTGTTGGG          30373         24789         27680           29014         447       28141          0
## GATGACCAACCGGCTGTGGAAGC             0             0             0            8046        8046        7682      42233
## CTAGACTGAGGCTCCTTGAGG           24038         21321         21915           21879        1029       20172          0
##                         Pure_EV_r2
## TGAGGTAGTAGGTTGTATGGTT          71
## TGAGGTAGTAGGTTGTGTGGTT          29
## TGAGGTAGTAGGTTGTATAGTT        4296
## TGAGGTAGTAGGTTGTATGGT            5
## TGAGGTAGTAGGTTGTGTGGT           10
## TGAGGTAGGAGGTTGTATAGTT           0
## TGAGGTAGTAGGTTGTGTGGTTT         13
## TGAGGTAGTAGGTTGTATGGTTT          2
## CGTAGAACCGACCTTGCG               0
## TAGGTAGTTTCATGTTGTTGGG           0
## GATGACCAACCGGCTGTGGAAGC      59302
## CTAGACTGAGGCTCCTTGAGG            3

You should already be able to see an interesting sequence:

counts0["GATGACCAACCGGCTGTGGAAGC",]

##                         MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## GATGACCAACCGGCTGTGGAAGC             0             0             0            8046        8046        7682      42233
##                         Pure_EV_r2
## GATGACCAACCGGCTGTGGAAGC      59302

Surely this looks like a parasite sequence that has entred the mouse cells, right?

Another thing you might have noticed even with this brief look at the table, is that the Pure_EV libraries seem very different to the rest. This could cause a problem with the analysis. Why? One of the assumptions of differential gene expression analysis is that there are always a large number of genes (sRNAs in our case) that are not really changing between the conditions we want to compare. If there are samples, such as our Pure_EV samples, where all the genes have very different expression values to the other conditions, this would break this assumption, and we would have to remove them from our analysis.

Remember that for this experiment, our most important question will be to compare MODEK_EV against MODEK_ctrl samples, to ask which sRNA sequences are significantly higher in MODEK_EV and thus could be parasite sRNAs.

Why can we not simply look for sequences that are present (counts > 0) in the MODEK_EV samples and absent (counts == 0) in MODEK_ctrl samples?

This is an obvious question that you might be thinking about. Although this strategy will work in some cases, there are several reasons why it will not always work:

Low abundance RNAs (and parasitic sequences are expected to be low abundance) will quite easily be missed in some replicates, even if they are truly there. This would mean that in one replicate we might have high counts, but in another we could have 0 counts. How to distinguish this from noise?
Some parasite sRNAs can have very similar or identical sequence to those in mouse (e.g. highly conserved miRNAs). In these cases we would expect an increase of counts in MODEK_EV compared to MODEK_ctrl, but not >0 vs 0.
Unfortunately sequencing is not perfect, and a small amount of molecules will end up being sequenced with the wrong sample barcode. This means that some real parasite sRNAs could end up having small (but > 0) counts in the MODEK_ctrl samples even if they are truly not there.

For these reasons, it is better to perform a statistical comparison between these samples, as we will be doing in this practical.

So, let’s get back to exploring the data. Can you find out how many “genes” are in the table? How many samples?

dim(counts0)

## [1] 104790      8

Since different samples could be sequenced to different depths, we will need to normalise this in some way. It is useful to find out just how different the total number of counts are in each sample:

round(colSums(counts0) / 1e6, 1)

##   MODEK_ctrl_r1   MODEK_ctrl_r2   MODEK_ctrl_r3 MODEK_EV_r1_alt     MODEK_EV_r1     MODEK_EV_r2      Pure_EV_r1 
##             2.1             1.8             2.3             1.9             1.5             1.9             1.1 
##      Pure_EV_r2 
##             1.6

Before we actually get to the differential expression analysis, it is a good idea to identify genes (sRNAs in this case) that were sequenced at extremely low levels, and remove them. The reason for this is that when the counts are too low, we will not be able to find any statistically significant differences (this has to do with statistical power). For instance, if a gene simply has count values of 0 in some conditions and 1 in others, although that might seem like a clear change, it could equally have occurred just by chance. Unless we had a ridiculously big number of replicates, no statistical method will allow us to conclude that 1 is significantly higher than 0 in these cases. Something similar will happen with other small values.

If we do not remove these cases before starting our analysis, we will end up having a bigger problem of multiple testing. If you are not familiar with this concept, the final practical has some examples. For now, just trust me…

To perform this filtering step, we need to propose a value for low expression, to find the genes that are not expressed higher than this, and remove them. But, if our samples have different total sequencing depth, it would be a bad idea to propose a value based on counts. This is because a sample with more total counts, just by chance, will have higher counts for a similarly lowly expressed gene than a sample with lower total counts.

The cpm() function allows us to correct this problem, since it calculates counts per million. This simply means that it divides each count value by the total sum of its column (sample) and multiples the result by 1 million. These cpm values are now comparable between columns, so we can use them to choose a threshold for low expression.

keep <- rowSums(cpm(counts0) >= 1) >= 2

table(keep)

## keep
## FALSE  TRUE 
## 37811 66979

counts <- counts0[keep,]

dim(counts)

## [1] 66979     8

This last bit of code will find all the cpm values that are at least 1 (1 read for each million reads in that sample) and it will then choose genes (rows) that have such a cpm value in at least 2 samples. It is quite reasonable to discard a large fraction of genes with this type of filter. Warning: we cannot use these exact same values for every experiment we might come across. In this case, I chose a cpm of 1 because the sequencing depth of this experiment didn’t allow me to choose a smaller value (many samples had about 1 million total reads). I then chose at least 2 samples because some of the sample types were only sequenced in duplicate, so a gene that might be specific to such a type of sample would only be expressed in 2 samples across the whole experiment.

I would usually recommend trying out different combinations of filters. You want to eliminate a large number of genes that are uninformative (too lowly expressed) but also retain a representative number of genes. There are no hard rules for this, and some intuition about your experiment is important. In the case of sRNAs, we might want to keep a large number of them, since we are dealing with individual sRNA sequences. On the other hand, we do not want to only keep sequences that come from samples which have high sequencing depth.

An alternative is to use an automatic function that edgeR has to make this choice. Careful: this function can return VERY different results if we use only the original count matrix, or in addition a grouping factor from the experimental design (which we shall explore further down).

table(filterByExpr(counts0))

## No group or design set. Assuming all samples belong to one group.

## 
##  FALSE   TRUE 
## 104648    142

In this case, I would definitely advise against using this function. Do remember that what is recommended in a manual might not be the best option for all cases.

The next step is to tell edgeR how our samples are grouped. In our case, we have three types of sample (MODEK, MODEK+EV and EV). We can define each sample type by hand, but it is much better to try to do this automatically, both to save time, but also to avoid easy-to-make errors.

The column names already contain this information, we just have to trim them a bit. We can use an R function to substitute (in this case remove) some text from another test. (‘.’ means any character, ‘+’ means repeat the previous character or special character one or more times)

colnames(counts)

## [1] "MODEK_ctrl_r1"   "MODEK_ctrl_r2"   "MODEK_ctrl_r3"   "MODEK_EV_r1_alt" "MODEK_EV_r1"     "MODEK_EV_r2"    
## [7] "Pure_EV_r1"      "Pure_EV_r2"

group <- factor(sub("_r.+", "", colnames(counts)))

group

## [1] MODEK_ctrl MODEK_ctrl MODEK_ctrl MODEK_EV   MODEK_EV   MODEK_EV   Pure_EV    Pure_EV   
## Levels: MODEK_ctrl MODEK_EV Pure_EV

table(group)

## group
## MODEK_ctrl   MODEK_EV    Pure_EV 
##          3          3          2

With these two components, we can now create a DGEList object to analyse our data with edgeR.

dge <- DGEList(counts=counts, group=group)

dge

## An object of class "DGEList"
## $counts
##                        MODEK_ctrl_r1 MODEK_ctrl_r2 MODEK_ctrl_r3 MODEK_EV_r1_alt MODEK_EV_r1 MODEK_EV_r2 Pure_EV_r1
## TGAGGTAGTAGGTTGTATGGTT        731227        660554        790987          594588       47863      556576         22
## TGAGGTAGTAGGTTGTGTGGTT        282718        251073        305412          221319       10725      212029         23
## TGAGGTAGTAGGTTGTATAGTT        142811        120775        145612          117475       36641      111974       2312
## TGAGGTAGTAGGTTGTATGGT          62532         56285         70073           47725        3400       43606          8
## TGAGGTAGTAGGTTGTGTGGT          54367         47793         61973           33023        1750       31405         10
##                        Pure_EV_r2
## TGAGGTAGTAGGTTGTATGGTT         71
## TGAGGTAGTAGGTTGTGTGGTT         29
## TGAGGTAGTAGGTTGTATAGTT       4296
## TGAGGTAGTAGGTTGTATGGT           5
## TGAGGTAGTAGGTTGTGTGGT          10
## 66974 more rows ...
## 
## $samples
##                      group lib.size norm.factors
## MODEK_ctrl_r1   MODEK_ctrl  2063899            1
## MODEK_ctrl_r2   MODEK_ctrl  1811880            1
## MODEK_ctrl_r3   MODEK_ctrl  2259655            1
## MODEK_EV_r1_alt   MODEK_EV  1893224            1
## MODEK_EV_r1       MODEK_EV  1429975            1
## MODEK_EV_r2       MODEK_EV  1854187            1
## Pure_EV_r1         Pure_EV  1021579            1
## Pure_EV_r2         Pure_EV  1527242            1

Data analysis, consistency of replicates

One of the best criteria to decide if our experiment was successful is to check if our replicates are similar to each other. This will not always be the case, and one of the easiest ways to check this is to use edgeR to create a Multi-Dimensional Scaling (MDS) plot. It is quite similar to a PCA, if you are more used to those.

To make it easier to visualise, we can use colours or symbols to distinguish the different types of sample.

colors <- rainbow(length(levels(group)))
names(colors) <- levels(group)
colors

## MODEK_ctrl   MODEK_EV    Pure_EV 
##  "#FF0000"  "#00FF00"  "#0000FF"

plotMDS(dge, col=colors[dge$samples$group], labels=colnames(dge$counts))

As you might have expected, the biggest separation is caused by the Pure_EV samples. We can also see that the sample called MODEK_EV_r1, which we had already identified as problematic for potentially containing degraded RNA, is more separated from the other samples of the same type. We could use these results to argue for removal of some samples.

Data normalisation

Even though edgeR does not normalise the actual values (counts) in the sense of transforming these, it does perform a calculation of normalisation factors, which are used internally.

dge <- calcNormFactors(dge)

dge$samples

##                      group lib.size norm.factors
## MODEK_ctrl_r1   MODEK_ctrl  2063899    0.4794710
## MODEK_ctrl_r2   MODEK_ctrl  1811880    0.5170582
## MODEK_ctrl_r3   MODEK_ctrl  2259655    0.4740660
## MODEK_EV_r1_alt   MODEK_EV  1893224    0.6647895
## MODEK_EV_r1       MODEK_EV  1429975    0.8801520
## MODEK_EV_r2       MODEK_EV  1854187    0.6548752
## Pure_EV_r1         Pure_EV  1021579    4.8249135
## Pure_EV_r2         Pure_EV  1527242    4.6022430

The default method for this is called the weighted Trimmed Mean of M-values (TMM) which first ignores atypical expression values (those that are too high or too low, or those that change too much between samples) before calculating the normalisation factors. You can imagine these factors as values that are used to centre the remaining expression values, so that these reference expression values are actually centred on 0 average fold-change. You can also interpret these values as a number that can be multiplied by the sequencing depth of each library to obtain the normalised library size.

To visualise the effect of this normalisation step, we can generate Mean Difference (MD) plots. In these, we compare the average expression across all samples (X axis) against the relative expression of each sample against the rest (Y axis).

We want the bulk of the genes (black dots) to be centred around the red line, indicating that the changes in expression are not biased towards upward or downward changes.

par(mfrow=c(1,2))

for (i in c(3,4,5,7)) {
  plotMD(cpm(dge, log=TRUE), column=i)
  grid(col = "blue")
  abline(h=0, col='red', lty=2, lwd=2)
}

We can also verify if the normalisation had an effect on the MDS plot.

plotMDS(dge, col=colors[dge$samples$group], labels=colnames(dge$counts), main="MDS after normalization factors")

Even though the normalisation process improved the sample grouping, the previous plotMD figures suggest that the data are not really being normalised properly, and if we continue we are likely to obtain biased results. A sensible strategy is to remove the Pure_EV samples and also remove the sample that seems to be degraded. To compensate this possibility, I already included an alternative replicate MODEK_EV_r1_alt, so that our experiment will still have biological replicates for this sample type.

Re-generating a new object for differential expression

To make sure we do this correctly, we will start the analysis from the top, removing the problematic samples from the count table.

We will define a new counts object with the samples we want to work with.

badSamples <- c("MODEK_EV_r1","Pure_EV_r1","Pure_EV_r2")
colnames(counts0)

## [1] "MODEK_ctrl_r1"   "MODEK_ctrl_r2"   "MODEK_ctrl_r3"   "MODEK_EV_r1_alt" "MODEK_EV_r1"     "MODEK_EV_r2"    
## [7] "Pure_EV_r1"      "Pure_EV_r2"

colnames(counts0) %in% badSamples

## [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE

counts <- counts0[,!colnames(counts0) %in% badSamples]

keep <- rowSums(cpm(counts) >= 1) >= 2
table(keep)

## keep
## FALSE  TRUE 
## 79780 25010

counts <- counts[keep,]

We proceed to redefine the grouping factor, create the object for edgeR and calculate normalisation factors. It is important to not forget to do all these steps for the new data.

group <- factor(sub("_r.+", "", colnames(counts)))
group

## [1] MODEK_ctrl MODEK_ctrl MODEK_ctrl MODEK_EV   MODEK_EV  
## Levels: MODEK_ctrl MODEK_EV

dge <- DGEList(counts=counts, group=group)

dge <- calcNormFactors(dge)
dge$samples

##                      group lib.size norm.factors
## MODEK_ctrl_r1   MODEK_ctrl  2060469    0.9904081
## MODEK_ctrl_r2   MODEK_ctrl  1809130    1.0791995
## MODEK_ctrl_r3   MODEK_ctrl  2256422    1.0512655
## MODEK_EV_r1_alt   MODEK_EV  1852057    0.9146423
## MODEK_EV_r2       MODEK_EV  1811443    0.9730169

Let us see if this new attempt improves the plotMD results:

par(mfrow=c(1,2))

for (i in c(2,3,4,5)) {
  plotMD(cpm(dge, log=TRUE), column=i)
  grid(col = "blue")
  abline(h=0, col='red', lty=2, lwd=2)
}

Definitely much better, don’t you agree? See that in the MD plots the main “cone” of the data is centred on zero. There is an additional diagonal of points, moving upwards in the MODEK_EV samples. These are likely the sequences that were present inside the EVs that were internalised by the MODEK cells, while the dots that are centered on zero are the mouse sequences that do not change too much upon EV treatment. Compare these results to the previous MD plots, where the was no clear set of dots around the horizontal zero.

We can also see if there is an effect on the MDS plot:

plotMDS(dge, col=colors[dge$samples$group], labels=colnames(dge$counts), main="MDS after removal of samples")

The main separation (treated vs untreated samples) is more cleanly define on the X axis. While you might worry that the Y axis is now separating some of the replicates, do realise that the MDS process is designed to produce the 2nd (and 3rd, 4th, etc, although we do not use them here) most important variance component, and since we no longer have another biological difference, the Y axis was bound to separate some of the replicates. You can also compare the percent of explained variance in each of the axes.

Experimental design

In order to perform a differential expression analysis we need to tell edgeR which samples belong to each condition or sample type. It is possible to have complicated experimental designs, where each sample belongs to more than one type (imagine different combinations of times, treatments, cell-types, genotypes). For this reason, we define the experimental design as a matrix. Again, although we could define this matrix by hand, this could introduce unintented errors, so it is recommended to use R functions and extract the data from the object we already have.

design <- model.matrix(~0+dge$samples$group)

colnames(design) <- levels(dge$samples$group)

design

##   MODEK_ctrl MODEK_EV
## 1          1        0
## 2          1        0
## 3          1        0
## 4          0        1
## 5          0        1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`dge$samples$group`
## [1] "contr.treatment"

The ~0 syntax is for designs where there is no clear control group, so each type of sample is maintained in the design matrix. (Technically this means we will not use an intercept term in the linear models.)

Estimating dispersion

The Negative Binomial distribution can be used for cases where the Poisson does not adequately describe the observed variation in the data. The Poisson distribution has a single parameter, lambda, that represents the mean and the variance. The Negative Binomial has an additional parameter for the over-dispersion of the data. A simple interpretation is that it allows higher variance than the Poisson, which is usually the case with real biological data. Since edgeR uses the Negative Binomial distribution for calculating probabilities, we need to calculate the over-dispersion parameter from our data.

dge <- estimateDisp(dge, design=design, robust=TRUE)

This function calculates the dispersion at three levels:

Common: a global or average value for all genes
Trended: a moving average calculated across different expression levels
Tagwise: an individual value for each gene

We can visualise this with the following figure:

plotBCV(dge)

It is to be expected that genes with lower expression are more noisy and have higher variability. Genes with higher expression are usually more stable across replicates.

In the following steps we can choose which dispersion values we want to use. The common dispersion is usually the least appropriate, since it will be too high for genes with high expression, and too low for genes with low expression. The tagwise expression, especially in experiments with many replicates would be ideal, but if the number of replicates is low it might not be very robust.

In this case, our problem consists of trying to detect sequences that are likely to have very low expression. For this reason, we might want to start using the common dispersion, which will be more permisive for these types of sequences. Basically this means that we might be able to detect more of them, but we will also have more false positives.

Differential expression analysis

We now need to define the questions we would like to ask of our data. That is, what comparisons or contrasts we would like to perform between sample types. There is a function called makeContrasts that helps us to define these contrasts.

Since we have reduced our design to two types of sample, and we want to find the sRNAs that are entering cells as a consequence of EV treatment, we only really have one possible contrast. We define it like this:

contrasts <- makeContrasts(
  "EV_effect" = "MODEK_EV - MODEK_ctrl",
  levels=dge$design
)

contrasts

##             Contrasts
## Levels       EV_effect
##   MODEK_ctrl        -1
##   MODEK_EV           1

If we had a more complicated design, we would be able to add extra contrast lines, where on the left side of the equation we define any name for each contrast, and on the right side we define an algebraic comparison. If we had sample types called A, B, C, for example, we could define contrasts as: "A - B", "B - C", "A - C".

The next step is to perform the analysis of differences in expression for the contrasts (questions) that we have defined. This means that we are asking which genes (sRNAs in our case) have expression differences that are statistically different from zero.

As I mentioned above, we could have defined several different contrasts, but since our experimental design was quite simple, we only have one contrast. So it is easier to extract that single contrast from the matrix to continue.

contrasts

##             Contrasts
## Levels       EV_effect
##   MODEK_ctrl        -1
##   MODEK_EV           1

contrast <- contrasts[,1]

contrast

## MODEK_ctrl   MODEK_EV 
##         -1          1

We will fit Generalized Linear Models to each gene, using the experimental design (already included in dge) and the dispersion we selected. We will then perform a Likelihood Ratio Test for the contrast of interest.

fit <- glmFit(dge, dispersion=dge$common.dispersion)

lrt <- glmLRT(fit, contrast=contrast)

topTags(lrt)

## Coefficient:  -1*MODEK_ctrl 1*MODEK_EV 
##                             logFC    logCPM       LR       PValue          FDR
## GATGACCAACCGGCTGTGGAAGC  16.12702 10.831735 89.21792 3.536319e-21 8.844333e-17
## GACGCCGAGATACGTGGGGACG   13.57184  8.279930 69.55421 7.434306e-17 9.296600e-13
## GACCAACCGGCTGTGGAAGCGT   13.46217  8.170591 68.71109 1.139981e-16 9.503641e-13
## GATAAGAGAATAGGCCGGACACGC 13.36516  8.073875 67.96532 1.663964e-16 1.040393e-12
## GAAGCGAAGATCACCGGATGTAGG 13.11095  7.820576 66.01170 4.482548e-16 2.242171e-12
## GAAGCAGGCGTCGAATGGACC    12.77641  7.487503 63.44198 1.651608e-15 6.290595e-12
## GTTTTCTGAACCCGGCAGCGATG  12.76001  7.471179 63.31602 1.760662e-15 6.290595e-12
## GAGATATGTGACTCTGTCGGGT   12.34882  7.062336 60.16008 8.744771e-15 2.733834e-11
## GACGCCGAGATACGTGGGGACGA  12.23615  6.950440 59.29595 1.356555e-14 3.769716e-11
## GAGCACCTTCTGATCGTTCGAGG  12.06651  6.782105 57.99547 2.627219e-14 6.570675e-11

By default, the last function returns the 10 most significantly different genes.

A next step might be to count how many genes pass certain criteria, such as a False-Discovery Rate of 5%. There is a function that can help us with this:

dt <- decideTests(lrt, adjust.method="BH", p.value=0.05, lfc=0)

table(dt)

## dt
##    -1     0     1 
##  1142  9693 14175

And we can also visualise this in a figure:

deGenes <- rownames(lrt)[dt != 0]

plotSmear(lrt, de.tags=deGenes)

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a graphical parameter

In many cases we would like to obtain the results for all the genes we analysed, for further exploration.

topTable <- topTags(lrt, n=Inf)$table

head(topTable)

##                             logFC    logCPM       LR       PValue          FDR
## GATGACCAACCGGCTGTGGAAGC  16.12702 10.831735 89.21792 3.536319e-21 8.844333e-17
## GACGCCGAGATACGTGGGGACG   13.57184  8.279930 69.55421 7.434306e-17 9.296600e-13
## GACCAACCGGCTGTGGAAGCGT   13.46217  8.170591 68.71109 1.139981e-16 9.503641e-13
## GATAAGAGAATAGGCCGGACACGC 13.36516  8.073875 67.96532 1.663964e-16 1.040393e-12
## GAAGCGAAGATCACCGGATGTAGG 13.11095  7.820576 66.01170 4.482548e-16 2.242171e-12
## GAAGCAGGCGTCGAATGGACC    12.77641  7.487503 63.44198 1.651608e-15 6.290595e-12

dim(topTable)

## [1] 25010     5

We can now manually ask how many genes pass certain thresholds, or how many are up-regulated (positive logFC) or down-regulated (negative logFC).

print(paste("down-regulated:", sum(topTable$logFC < 0 & topTable$FDR < 0.05)))

## [1] "down-regulated: 1142"

print(paste("up-regulated:", sum(topTable$logFC > 0 & topTable$FDR < 0.05)))

## [1] "up-regulated: 14175"

Exercises

Obtain a list of all the sequences that are “up-regulated” upon EV treatment. These should contain all the candidates of being parasite sequences.
Explore the level of expression (counts) for these sequences in the original counts0 table. According to the Pure_EV columns, do you think these sequences are likely to be from the parasite?
Check the first nucleotide of these sequences. Remember that parasite sequences frequently start with a G. Can you use this to argue in favour of your results?