Introduction to ggplot2

Please note: This practical is adapted from the data carpentry Intro to R and RStudio for Genomics https://datacarpentry.github.io/genomics-r-intro/index.html

ggplot2 is a plotting package, part of the tidyverse, that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication-quality plots with minimal amounts of adjustments and tweaking.

The idea of mapping is crucial in ggplot. One familiar example is to map the value of one variable in a dataset to x and the other to y. However, we often encounter datasets that include multiple (more than two) variables. In this case, ggplot allows you to map those other variables to visual marks such as color and shape (aesthetics or aes). One thing you may want to remember is the difference between discrete and continuous variables. Some aesthetics, such as the shape of dots, do not accept continuous variables. If forced to do so, R will give an error. This is easy to understand; we cannot create a continuum of shapes for a variable, unlike, say, color.

Tip: when having doubts about whether a variable is continuous or discrete, a quick way to check is to use the summary() function. Continuous variables have descriptive statistics but not the discrete variables.

First, we need to load the ggplot2 package which is part of the tidyverse package. In addition we will also use ggrepel.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)

Loading the dataset

We are using here the table of differentially expressed genes generated in the last session. Please download this file: Diff_expr_seq.csv

Remember to set your working directory to the right folder, so you can read the file in correctly.

de <- read_csv("Diff_expr_seq.csv")
## Rows: 25010 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sequence
## dbl (5): logFC, logCPM, LR, PValue, FDR
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(de)
## # A tibble: 6 × 6
##   sequence                 logFC logCPM    LR   PValue      FDR
##   <chr>                    <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 GATGACCAACCGGCTGTGGAAGC   16.1  10.8   89.2 3.54e-21 8.84e-17
## 2 GACGCCGAGATACGTGGGGACG    13.6   8.28  69.6 7.43e-17 9.30e-13
## 3 GACCAACCGGCTGTGGAAGCGT    13.5   8.17  68.7 1.14e-16 9.50e-13
## 4 GATAAGAGAATAGGCCGGACACGC  13.4   8.07  68.0 1.66e-16 1.04e-12
## 5 GAAGCGAAGATCACCGGATGTAGG  13.1   7.82  66.0 4.48e-16 2.24e-12
## 6 GAAGCAGGCGTCGAATGGACC     12.8   7.49  63.4 1.65e-15 6.29e-12

ggplot2 functions like data in the long format, i.e., a column for every dimension (variable), and a row for every observation. Well-structured data will save you time when making figures with ggplot2

ggplot2 graphics are built step-by-step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots, and more equally important the readability of the code.

The general format of ggplot2 is:

{ggplot(data = DATA, aes(x, y, other aesthetics)) +} geom_function()

  • use the ggplot() function and bind the plot to a specific data frame using the data argument
  • define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x and y positions or characteristics such as size, shape, color, fill, etc.
  • add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use only geom_point() today, but there are many others including:
    • geom_point() for scatter plots, dot plots, etc.

    • geom_boxplot() for, well, boxplots!

    • geom_density()) for showing the distribution of a variable in a dataset

    • geom_histogram() for histograms

To add a geom to the plot use + operator.

Scatter plots with geom_point()

We will first replicate the graph we generated at the end of last session. Plotting logCPM against logFC using a scatter plot:

ggplot(de, aes(x = logCPM, y = logFC)) +
  geom_point()

Then, we start modifying this plot to extract more information from it. For instance, we can add transparency (alpha) to avoid over-plotting:

ggplot(de, aes(x = logCPM, y = logFC)) +
  geom_point(alpha = 0.2)

We can also add colors for all the points:

ggplot(de, aes(x = logCPM, y = logFC)) +
  geom_point(alpha = 0.2, colour = "blue")

Or we can define a subset of points to colour - for example we want to highlight all significant points. To do this we first need to define which sequences are significantly changing in expression between MODEK EVs and CTRL. This can be done by adding another column to our table using the mutate() command together with if_else(). A handy tool for streamlining data manipulation and improving code readability in tidyverse is the piping function %>%. It allows you to chain multiple operations together by passing the output of one function as the input to the next:

de <- de %>% 
  mutate(significant = if_else(logFC > 2 & logCPM > 1 | logFC < -2 & logCPM > 1, "sig", "ns"))

head(de)
## # A tibble: 6 × 7
##   sequence                 logFC logCPM    LR   PValue      FDR significant
##   <chr>                    <dbl>  <dbl> <dbl>    <dbl>    <dbl> <chr>      
## 1 GATGACCAACCGGCTGTGGAAGC   16.1  10.8   89.2 3.54e-21 8.84e-17 sig        
## 2 GACGCCGAGATACGTGGGGACG    13.6   8.28  69.6 7.43e-17 9.30e-13 sig        
## 3 GACCAACCGGCTGTGGAAGCGT    13.5   8.17  68.7 1.14e-16 9.50e-13 sig        
## 4 GATAAGAGAATAGGCCGGACACGC  13.4   8.07  68.0 1.66e-16 1.04e-12 sig        
## 5 GAAGCGAAGATCACCGGATGTAGG  13.1   7.82  66.0 4.48e-16 2.24e-12 sig        
## 6 GAAGCAGGCGTCGAATGGACC     12.8   7.49  63.4 1.65e-15 6.29e-12 sig

This added an extra column with the values “sig” or “ns” (for non-significant) according to the defined cut-offs: logFC > 2 or < -2 and logCPM > 1.

Now we can add the color as another aesthetic to the plot.

ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
  geom_point(alpha = 0.2)

If we would rather like to keep our favorite colour, we can easily set the colours manually. You can either use pre-defined colours like “blue” which we used above or you can enter any colour as hex code. We just need to remember to define two colours now, otherwise R will give an error:

ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
  geom_point(alpha = 0.2)+
  scale_colour_manual(values = c("darkgrey","blue"))

# using hex colours
ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
  geom_point(alpha = 0.2)+
  scale_colour_manual(values = c("#a6a6a6","#0000ff"))

We can easily change other plot parameters such as the axis titles, plot title and theme (general look of the plot) - there are lots of themes to choose from or you can define a completely custom design by defining all the parameters (for more info check this link ggplot2 themes):

ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
  geom_point(alpha = 0.2)+
  scale_colour_manual(values = c("darkgrey","blue"))+
  labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
  theme_minimal()

We can also label points:

ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
  geom_point(alpha = 0.2)+
  scale_colour_manual(values = c("darkgrey","blue"))+
  geom_text(aes(label = sequence), size = 3)+
  labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
  theme_minimal()

As that looks pretty messy we could only label the most significant sequences instead - this can be defined directly in the plot, we don’t need to create another column in our table. In addition, we will swap geom_text for geom_text_repel to increase readability:

ggplot(de, aes(x = logCPM, y = logFC, colour = significant)) +
  geom_point(alpha = 0.2)+
  scale_colour_manual(values = c("darkgrey","blue"))+
  geom_text_repel(data = subset(de, logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), 
            aes(x = logCPM, y = logFC, label = sequence))+
  labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
  theme_minimal()

If we now wanted to highlight these points in a different colour it gets a bit more tricky, one of the easiest ways is to add an additional layer which plots only the points that you want to highlight:

ggplot(de, aes(x = logCPM, y = logFC)) +
  geom_point(aes(colour = significant), alpha = 0.2)+
  scale_colour_manual(values = c("darkgrey","blue"))+
  geom_point(data = de %>% filter(logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), aes(x = logCPM, y = logFC), colour = "#ff3399")+
  geom_text_repel(data = subset(de, logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), 
            aes(x = logCPM, y = logFC, label = sequence))+
  labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
  theme_minimal()

We can also add dotted line to indicate cut-off values

ggplot(de, aes(x = logCPM, y = logFC)) +
  geom_point(aes(colour = significant), alpha = 0.2)+
  scale_colour_manual(values = c("darkgrey","blue"))+
  geom_point(data = de %>% filter(logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), aes(x = logCPM, y = logFC), colour = "#ff3399")+
  geom_text_repel(data = subset(de, logFC > 2 & logCPM > 10 | logFC < -2 & logCPM > 10), 
            aes(x = logCPM, y = logFC, label = sequence))+
  labs(x = "Average logCPM", y = "log Fold Change", title = "Differential expression")+
  geom_vline(xintercept = 1, linetype = "dashed", colour = "black")+
  geom_hline(yintercept = c(2, -2), linetype = "dashed", colour = "black")+
  theme_minimal()

Exercise

Try to generate the other type of plot that is often used together with differential expression analysis - a volcano plot (also a type of scatter plot).

Hint: For a volcano plot you want to plot logFC against -log10 FDR, our table only contains FDR. We need to add a new column to our table with -log10 FDR.

Once you created the plot play around with the plot parameters, colouring and labelling.