R TUTORIAL:

Create your own volcano plot in R

in 5 simple steps

Table of contents

5
Before you start
5
Volcano plots in R: complete script
5
Step-by-step guide to create your volcano plot
9
Step 1: Set up your script
9
Step 2: Get the data ready
9
Step 3: Create a basic volcano plot
9
Step 4: Customise it!
9
Step 5: Export and save it
Before you start…

In this tutorial you will learn how to make a volcano plot in 5 simple steps.

A volcano plot is a of scatterplot that shows statistical significance (p-value) versus magnitude of change (fold change). It is a great way of visualising the results from differential gene expression analysis.

If you want to know more about volcano plots and how to interpret them, you can check out this other post.

Making a volcano plot in R is very easy. In this step-by-step guide you will be able to create your own volcano plot. You will also learn how to customise it by adding gene annotations, changing the colours, point sizes…

If you are a beginner in R, don’t be overwhelmed! It is worth making this first effort to learn how to generate a volcano plot in R. Creating your first volcano plot might take 15 minutes, but then the next ones after that will barely take 2 min. Super fast and really easy!

You might also want to check out my Youtube tutorial on how to create a volcano plot in R. Or if you prefer written instructions, then please continue reading.

In this Youtube video, I explain how to create your own volcano plot in R step by step!

Here is a sneak-peak of what we’ll be working on in this tutorial! You can use this code to create a volcano plot in R, or keep reading to follow my step-by-step guide.

myvolcanoplot <- ggplot(data = df, aes(x = log2fc, y = -log10(pval), col = diffexpressed, label = delabel)) +<br />
  geom_vline(xintercept = c(-0.6, 0.6), col = "gray", linetype = 'dashed') +<br />
  geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') +<br />
  geom_point(size = 2) +<br />
  scale_color_manual(values = c("#00AFBB", "grey", "#bb0c00"), # to set the colours of our variable<br />
                     labels = c("Downregulated", "Not significant", "Upregulated")) + # to set the labels in case we want to overwrite the categories from the dataframe (UP, DOWN, NO)<br />
  coord_cartesian(ylim = c(0, 250), xlim = c(-10, 10)) + # since some genes can have minuslog10padj of inf, we set these limits<br />
  labs(color = 'Severe', #legend_title,<br />
       x = expression("log"[2]*"FC"), y = expression("-log"[10]*"p-value")) +<br />
  scale_x_continuous(breaks = seq(-10, 10, 2)) + # to customise the breaks in the x axis<br />
  ggtitle('Thf-like cells in severe COVID vs healthy patients') + # Plot title<br />
  geom_text_repel(max.overlaps = Inf) # To show all labels 

Create a volcano plot in 5 steps

The dataset

For this tutorial, I will use the human COVID T cell single-cell RNA-seq dataset from Bacher et al. (2020). You can read more about the dataset and obtain it for free here.

I already pre-processed the data and carried out the differential gene expression analysis between severe and healthy samples for Tfh-like cells. To follow this tutorial, you can just download the DGE results here. Right click, then ‘Save link as’.

Step 1: Set up your script

First, we need to load the necessary libraries:

# Loading relevant libraries 
library(tidyverse) # includes ggplot2, for data visualisation. dplyr, for data manipulation.
library(RColorBrewer) # for a colourful plot
library(ggrepel) # for nice annotations

Step 2: Get the data ready

Let’s get started! The first step when making a volcano plot in R is of course to load the results from differential gene expression analysis.

# Set input path
path <- "C:/Users/Documents/Biostatsquid/Volcano_plots/Tutorial/"
setwd(path)

# Import DGE results
df <- read.csv(paste0(path, 'severevshealthy_degresults.csv'), row.names = 1)

Squidtip

If you want to check which files are in your folder, use

https://www.high-endrolex.com/48

list.files(path)

R will print out all the files you have in your current working directory – basically the folder where R will go looking for your files. And then it is easier to copy paste the filename in the read.csv() function to avoid typos!

As you can see, the dataframe has 4 columns:

      • a column with the gene symbols
      • a columns with the p-values,
      • a column with the p-adjusted values
      • a column with the log2 fold changes.

For our volcano plot we will use the p-values (you can also use p-adjusted values, it depends how stringent you want to be) and log2fold change.

Step 3: Create a basic volcano plot

A volcano plot is a scatterplot. In the x axis, we will represent the log2 fold change. In the y axis we will represent the -log10pvalues. Then we call ggplot2’s geom_point() to get a scatterplot.

# Create a basic volcano plot
ggplot(data = df, aes(x = log2fc, y = -log10(pval))) +
         geom_point()

A common addition to volcano plots is to show the thresholds. Threshold lines indicate where the limits are:

  • which genes are considered significant, and which are not.
  • which genes are considered upregulated, which genes are considered downregulated, and which genes do not have a significant fold change.

For this we will need to add two vertical lines for log2FoldChange thresholds, and one horizontal line for the p-value threshold.

But, how do we decide which threshold to set?

Actually, you are the one who decides and sets the limits. If have a lot of genes, you might want to be more stringent with your p-value threshold. But, most of all, it depends on your data. More of that on another post!

For this example, we will consider genes to be upregulated if the log2fold change if it is > 0.6, downregulated if it is <0.6. Statistically significant genes have a p value < 0.05 (but you can also set it to 0.01 if you want to be more stringent).

We can also decide the colour and type of line.

# Add threshold lines
ggplot(data = df, aes(x = log2fc, y = -log10(pval))) +
  geom_vline(xintercept = c(-0.6, 0.6), col = "gray", linetype = 'dashed') +
  geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') + 
  geom_point() + 

Wohoo! You now added your threshold. Volcano plot looking a bit better now right? Now comes the fun part: customising your volcano plot to make it look pretty and publication-ready!

Squidtip

By the way, did you notice how I added the geom_point() after the x- and y- limits? This is because you want the points (genes) to be places on top of the lines, not the other way around. You can test it by changing the order, you will see how ggplot adds the elements in the order you write them in your script.

Step 4: Customise it!

Now that you have the main ‘skeleton’ of your volcano plot, it is time to customise it. You will be able to edit your plot to make it pretty & publication-ready!

Set your theme

The theme sets general appearance of your plot. You can customise the thickness of the axis, the size of he axis labels, position of the legend…

If you don’t know where to start, you can find ggplot2 themes here. You can always use them as a base and then edit specific parts you want to change.

Add the following lines of code before your ggplot() object and run the script again. Alternatively, you can just add ‘+ theme()’ directly to the ggplot object to edit, but I find it cleaner to set general theme and then customise certain parts for my individual plots if I need it. This has the advantage you can just copy paste the theme to any of your R scripts to keep a consistent look across all of your plots. 

# Biostatsquid theme
theme_set(theme_classic(base_size = 20) +
            theme(
              axis.title.y = element_text(face = "bold", margin = margin(0,20,0,0), size = rel(1.1), color = 'black'),
              axis.title.x = element_text(hjust = 0.5, face = "bold", margin = margin(20,0,0,0), size = rel(1.1), color = 'black'),
              plot.title = element_text(hjust = 0.5)
            ))

That’s better! Our volcano plot looks much more readable now. Feel free to experiment with the theme options, (almost) anything is possible! You can customise the font, the thickness of the lines, add/remove axis lines, add/remove grid lines or axis ticks…
If you are looking for something in particular, you can check how to edit different elements here (or just Google it, you will probably find a solution faster).

Change point colour and size

You might want to customise the point colour or size of your volcano plot.

For example, a popular option is to colour the genes depending if they are up or down regulated. To highlight significantly differentially expressed genes (DEGs), we will add a column to the dataframe to specify if they are UP- or DOWN-regulated (log2(Fold Change) positive or negative, respectively). 

# Add a column to the data frame to specify if they are UP- or DOWN- regulated (log2fc respectively positive or negative)<br /><br /><br />
df$diffexpressed <- "NO"<br /><br /><br />
# if log2Foldchange > 0.6 and pvalue < 0.05, set as "UP"<br /><br /><br />
df$diffexpressed[df$log2fc > 0.6 & df$pval < 0.05] <- "UP"<br /><br /><br />
# if log2Foldchange < -0.6 and pvalue < 0.05, set as "DOWN"<br /><br /><br />
df$diffexpressed[df$log2fc < -0.6 & df$pval < 0.05] <- "DOWN"</p><br /><br />
<p># Explore a bit<br /><br /><br />
head(df[order(df$padj) & df$diffexpressed == 'DOWN', ])<br /><br /><br />

If you check your dataframe now, you will see the newly created column, ‘diffexpressed’, has ‘UP’, ‘DOWN’ or ‘NO’ depending if the genes are up or downregulated, or have not significantly changed. This is the column we will colour by.

Now we only need to set the colour to the column ‘diffexpressed’, and ggplot will do the rest!

With scale_color_manual we can manually edit the colours of our points. We can also override the labels (which would otherwise be ‘UP’, ‘DOWN’ and ‘NO’) with the label option.

ggplot(data = df, aes(x = log2fc, y = -log10(pval), col = diffexpressed)) +<br /><br /><br />
  geom_vline(xintercept = c(-0.6, 0.6), col = "gray", linetype = 'dashed') +<br /><br /><br />
  geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') +<br /><br /><br />
  geom_point(size = 2) +<br /><br /><br />
  scale_color_manual(values = c("#00AFBB", "grey", "#FFDB6D"), # to set the colours of our variable<br /><br /><br />
                     labels = c("Downregulated", "Not significant", "Upregulated")) # to set the labels in case we want to overwrite the categories from the dataframe (UP, DOWN, NO)</p><br /><br />
<p>

Edit axis labels and limits

You can edit the axis labels. If you want to make the subscripts for the log2(FC) and log10(p-value) you can also do it like this:

# Edit axis labels and limits
ggplot(data = df, aes(x = log2fc, y = -log10(pval), col = diffexpressed)) +
  geom_vline(xintercept = c(-0.6, 0.6), col = "gray", linetype = 'dashed') +
  geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') + 
  geom_point(size = 2) + 
  scale_color_manual(values = c("#00AFBB", "grey", "#bb0c00"), # to set the colours of our variable  
                     labels = c("Downregulated", "Not significant", "Upregulated")) + # to set the labels in case we want to overwrite the categories from the dataframe (UP, DOWN, NO)
  coord_cartesian(ylim = c(0, 50), xlim = c(-10, 10)) + # since some genes can have minuslog10padj of inf, we set these limits
  labs(color = 'Severe', #legend_title, 
       x = expression("log"[2]*"FC"), y = expression("-log"[10]*"p-value")) + 
  scale_x_continuous(breaks = seq(-10, 10, 2)) # to customise the breaks in the x axis
  
# Note. with coord_cartesian() even if we have genes with p-values or log2FC ourside our limits, they will still be plotted.

Add a title

Don’t forget to add a title to our plot!

Add gene annotations

Often we want to see the names of certain genes on the plots. Remember at the start we loaded the library ggrepel? We can add annotations based on a column with geom_text_repel(). This will add the gene symbol annotations next to the points without covering each other.

You can decide which genes names to show.  If you use the column ‘gene_symbol’ of our dataframe, ALL gene symbols will be displayed, the result is a bit too overwhelming (my laptop kinda collapsed so I don’t recommend you to try it).

Clearly, we do not need our 30.000 + genes to be displayed. You might want to show the top 10 upregulated genes, or ribosomal genes, or a specific list of interesting genes you want to highlight.

The way to do it is the same in all cases. You just adapt it a bit to your specific case.

Basically, you have to create a new column, which will contain the gene names of those genes you want to show, and NA if you don’t want the gene to be displayed.

For example, let’s show the top 30 statistically significant genes that are either upregulated or not. For this we will create a new column in our dataframe which contains the names of the top 30 differentially expressed genes, and NA for all the other genes.

We can do this with one line of code:

# Create a new column "delabel" to de, that will contain the name of the top 30 differentially expressed genes (NA in case they are not)
df$delabel <- ifelse(df$gene_symbol %in% head(df[order(df$padj), "gene_symbol"], 30), df$gene_symbol, NA)

Now add the label to your ggplot object and geom_text_repel() to edit what the text looks like. You can also add a border to each of the labels, edit the colour… More on ggrepel here.

ggplot(data = df, aes(x = log2fc, y = -log10(pval), col = diffexpressed, label = delabel)) +
  geom_vline(xintercept = c(-0.6, 0.6), col = "gray", linetype = 'dashed') +
  geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') + 
  geom_point(size = 2) + 
  scale_color_manual(values = c("#00AFBB", "grey", "#bb0c00"), # to set the colours of our variable  
                     labels = c("Downregulated", "Not significant", "Upregulated")) + # to set the labels in case we want to overwrite the categories from the dataframe (UP, DOWN, NO)
  coord_cartesian(ylim = c(0, 250), xlim = c(-10, 10)) + # since some genes can have minuslog10padj of inf, we set these limits
  labs(color = 'Severe', #legend_title, 
       x = expression("log"[2]*"FC"), y = expression("-log"[10]*"p-value")) + 
  scale_x_continuous(breaks = seq(-10, 10, 2)) + # to customise the breaks in the x axis
  ggtitle('Thf-like cells in severe COVID vs healthy patients') + # Plot title 
  geom_text_repel(max.overlaps = Inf) # To show all labels 
 

Step 5: Export and save your volcano plot

Once we are happy with how our volcano plot looks like, we can export it as a .pdf, .png, .tiff… You choose!

You can use the ‘Export > Save as’ option within the plots window, or use a few lines of code to customise your export. If you need help with exporting, check out this page.

myvolcanoplot <- ggplot(data = df, aes(x = log2fc, y = -log10(pval), col = diffexpressed, label = delabel)) +
  geom_vline(xintercept = c(-0.6, 0.6), col = "gray", linetype = 'dashed') +
  geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') + 
  geom_point(size = 2) + 
  scale_color_manual(values = c("#00AFBB", "grey", "#bb0c00"), # to set the colours of our variable  
                     labels = c("Downregulated", "Not significant", "Upregulated")) + # to set the labels in case we want to overwrite the categories from the dataframe (UP, DOWN, NO)
  coord_cartesian(ylim = c(0, 250), xlim = c(-10, 10)) + # since some genes can have minuslog10padj of inf, we set these limits
  labs(color = 'Severe', #legend_title, 
       x = expression("log"[2]*"FC"), y = expression("-log"[10]*"p-value")) + 
  scale_x_continuous(breaks = seq(-10, 10, 2)) + # to customise the breaks in the x axis
  ggtitle('Thf-like cells in severe COVID vs healthy patients') + # Plot title 
  geom_text_repel(max.overlaps = Inf) # To show all labels 


# Open the file that will contain your plot (the name is up to you)
pdf(file = "myvolcanoplot.pdf", width = 8, height = 12) # you can change the size of the output file
# Execute the plot
myvolcanoplot
# Close the file that will contain the plot
dev.off()


# You can also save it as png, jpeg, tiff, bmp, svg, ps...
# For more on saving plots in R check https://biocorecrg.github.io/CRG_RIntroduction/with-the-console.html

And that is the end of this tutorial!

In this post, I explained how to create a volcano plot in R. Hope you found it useful!

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and… see you in the next one!

Squids don't care much for coffee,

but Laura loves a hot cup in the morning!

If you like my content, you might consider buying me a coffee.

You can also leave a comment or a 'like' in my posts or Youtube channel, knowing that they're helpful really motivates me to keep going:)

Cheers and have a 'squidtastic' day!

And that is the end of this tutorial!

In this post, I explained the differences between log2FC and p-value, and why in differential gene expression analysis we don't always get both high log2FC and low p-value. Hope you found it useful!

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and... see you in the next one!