Please note: This practical is adapted from the data carpentry Intro to R and RStudio for Genomics https://datacarpentry.github.io/genomics-r-intro/index.html
ggplot2 is a plotting package, part of
the tidyverse, that makes it simple to create complex plots from data in
a data frame. It provides a more programmatic interface for specifying
what variables to plot, how they are displayed, and general visual
properties. Therefore, we only need minimal changes if the underlying
data change or if we decide to change from a bar plot to a scatter plot.
This helps in creating publication-quality plots with minimal amounts of
adjustments and tweaking.
The idea of mapping is crucial in
ggplot. One familiar example is to map the
value of one variable in a dataset to x and the other to y. However, we
often encounter datasets that include multiple (more than two)
variables. In this case, ggplot allows you to
map those other variables to visual marks such as
color and shape
(aesthetics or aes). One thing you may
want to remember is the difference between discrete and
continuous variables. Some aesthetics, such as the
shape of dots, do not accept continuous variables. If forced to do so, R
will give an error. This is easy to understand; we cannot create a
continuum of shapes for a variable, unlike, say, color.
Tip: when having doubts about whether a variable is
continuous
or discrete, a quick way to check is to use the summary()
function. Continuous variables have descriptive statistics but not the
discrete variables.
First, we need to load the ggplot2 package which is part
of the tidyverse package. In addition we will also use
ggrepel.
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We are using here the table of differentially expressed genes generated in the last session. Please download this file: Diff_expr_seq.csv
Remember to set your working directory to the right folder, so you can read the file in correctly.
## Rows: 25010 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sequence
## dbl (5): logFC, logCPM, LR, PValue, FDR
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 6
## sequence logFC logCPM LR PValue FDR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 GATGACCAACCGGCTGTGGAAGC 16.1 10.8 89.2 3.54e-21 8.84e-17
## 2 GACGCCGAGATACGTGGGGACG 13.6 8.28 69.6 7.43e-17 9.30e-13
## 3 GACCAACCGGCTGTGGAAGCGT 13.5 8.17 68.7 1.14e-16 9.50e-13
## 4 GATAAGAGAATAGGCCGGACACGC 13.4 8.07 68.0 1.66e-16 1.04e-12
## 5 GAAGCGAAGATCACCGGATGTAGG 13.1 7.82 66.0 4.48e-16 2.24e-12
## 6 GAAGCAGGCGTCGAATGGACC 12.8 7.49 63.4 1.65e-15 6.29e-12
ggplot2 functions like data in the
long format, i.e., a column for every dimension
(variable), and a row for every observation. Well-structured data will
save you time when making figures with
ggplot2
ggplot2 graphics are built step-by-step
by adding new elements. Adding layers in this fashion allows for
extensive flexibility and customization of plots, and more equally
important the readability of the code.
The general format of ggplot2 is:
{ggplot(data = DATA, aes(x, y, other aesthetics)) +} geom_function()
ggplot() function and bind the plot to a
specific data frame using the data argumentaes) function),
by selecting the variables to be plotted and specifying how to present
them in the graph, e.g. as x and y positions or characteristics such as
size, shape, color, fill, etc.ggplot2 offers many
different geoms; we will use only geom_point() today, but
there are many others including:
geom_point() for scatter plots, dot plots,
etc.
geom_boxplot() for, well, boxplots!
geom_density()) for showing the distribution of a
variable in a dataset
geom_histogram() for histograms
…
To add a geom to the plot use + operator.
geom_point()We will first replicate the graph we generated at the end of last session. Plotting logCPM against logFC using a scatter plot:
Then, we start modifying this plot to extract more information from
it. For instance, we can add transparency (alpha) to avoid
over-plotting:
We can also add colors for all the points:
Or we can define a subset of points to colour - for example we want
to highlight all significant points. To do this we first need to define
which sequences are significantly changing in expression between MODEK
EVs and CTRL. This can be done by adding another column to our table
using the mutate() command together with
if_else(). A handy tool for streamlining data manipulation
and improving code readability in tidyverse is the piping function
%>%. It allows you to chain multiple operations together
by passing the output of one function as the input to the next:
de <- de %>%
mutate(significant = if_else(logFC > 2 & logCPM > 1 | logFC < -2 & logCPM > 1, "sig", "ns"))
head(de)## # A tibble: 6 × 7
## sequence logFC logCPM LR PValue FDR significant
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 GATGACCAACCGGCTGTGGAAGC 16.1 10.8 89.2 3.54e-21 8.84e-17 sig
## 2 GACGCCGAGATACGTGGGGACG 13.6 8.28 69.6 7.43e-17 9.30e-13 sig
## 3 GACCAACCGGCTGTGGAAGCGT 13.5 8.17 68.7 1.14e-16 9.50e-13 sig
## 4 GATAAGAGAATAGGCCGGACACGC 13.4 8.07 68.0 1.66e-16 1.04e-12 sig
## 5 GAAGCGAAGATCACCGGATGTAGG 13.1 7.82 66.0 4.48e-16 2.24e-12 sig
## 6 GAAGCAGGCGTCGAATGGACC 12.8 7.49 63.4 1.65e-15 6.29e-12 sig
This added an extra column with the values “sig” or “ns” (for non-significant) according to the defined cut-offs: logFC > 2 or < -2 and logCPM > 1.
Now we can add the color as another aesthetic to the plot.
If we would rather like to keep our favorite colour, we can easily set the colours manually. You can either use pre-defined colours like “blue” which we used above or you can enter any colour as hex code. We just need to remember to define two colours now, otherwise R will give an error:
ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
geom_point(alpha = 0.2)+
scale_colour_manual(values = c("darkgrey","blue"))# using hex colours
ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
geom_point(alpha = 0.2)+
scale_colour_manual(values = c("#a6a6a6","#0000ff"))We can easily change other plot parameters such as the axis titles, plot title and theme (general look of the plot) - there are lots of themes to choose from or you can define a completely custom design by defining all the parameters (for more info check this link ggplot2 themes):
ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
geom_point(alpha = 0.2)+
scale_colour_manual(values = c("darkgrey","blue"))+
labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
theme_minimal()We can also label points:
ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
geom_point(alpha = 0.2)+
scale_colour_manual(values = c("darkgrey","blue"))+
geom_text(aes(label = sequence), size = 3)+
labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
theme_minimal()As that looks pretty messy we could only label the most significant
sequences instead - this can be defined directly in the plot, we don’t
need to create another column in our table. In addition, we will swap
geom_text for geom_text_repel to increase
readability:
ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
geom_point(alpha = 0.2)+
scale_colour_manual(values = c("darkgrey","blue"))+
geom_text_repel(data = subset(de, logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10),
aes(x = logCPM, y = logFC, label = sequence))+
labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
theme_minimal()If we now wanted to highlight these points in a different colour it gets a bit more tricky, one of the easiest ways is to add an additional layer which plots only the points that you want to highlight:
ggplot(de, aes(x = logCPM, y = logFC)) +
geom_point(aes(colour = significant), alpha = 0.2)+
scale_colour_manual(values = c("darkgrey","blue"))+
geom_point(data = de %>% filter(logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), aes(x = logCPM, y = logFC), colour = "#ff3399")+
geom_text_repel(data = subset(de, logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10),
aes(x = logCPM, y = logFC, label = sequence))+
labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
theme_minimal()We can also add dotted line to indicate cut-off values
ggplot(de, aes(x = logCPM, y = logFC)) +
geom_point(aes(colour = significant), alpha = 0.2)+
scale_colour_manual(values = c("darkgrey","blue"))+
geom_point(data = de %>% filter(logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), aes(x = logCPM, y = logFC), colour = "#ff3399")+
geom_text_repel(data = subset(de, logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10),
aes(x = logCPM, y = logFC, label = sequence))+
labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
geom_vline(xintercept = 1, linetype = "dashed", colour = "black")+
geom_hline(yintercept = c(2, -2), linetype = "dashed", colour = "black")+
theme_minimal()Try to generate the other type of plot that is often used together with differential expression analysis - a volcano plot (also a type of scatter plot).
Hint: For a volcano plot you want to plot logFC against -log10 FDR, our table only contains FDR. We need to add a new column to our table with -log10 FDR.
Once you created the plot play around with the plot parameters, colouring and labelling.
Cheat cheats:
tidyverse: https://rstudio.github.io/cheatsheets/data-transformation.pdf
ggplot2: https://rstudio.github.io/cheatsheets/data-visualization.pdf
Useful links: