R TUTORIAL:
Visualization of pathway enrichment analysis results with clusterProfiler
Table of contents
Before you start
Step-by-step tutorial
Creating an enrichResult object
Barplots
Dotplots
Cnetplot
Heatplot
Treeplot
Enrichment map
Upsetplot
sessionInfo()
Before you start…
Welcome to Biostatsquid’s easy and step-by-step tutorial where you will learn how to visualize your pathway enrichment results. In this guide, we will explore different ways of plotting the gene sets and their genes after performing functional enrichment analysis with clusterProfiler.
If you haven’t yet, check out my blogpost on performing pathway enrichment analysis with clusterProfiler. This tutorial picks up where we left off, so it will ensure you have your results in the correct format. You can also check out the step-by-step tutorial in R for pathway enrichment analysis on Youtube.
If you performed pathway enrichment analysis with a different tool or package, don’t worry! This tutorial will still be useful. Just check that your results are an R data.frame with the following columns:
- ID
- Description
- GeneRatio
- BgRatio
- pvalue
- p.adjust
- qvalue
- geneID
- Count
- minuslog10padj
- diffexpressed
Throughout this tutorial, I will show you some nice ways to visualise your pathway enrichment analysis results. This will make them easier to interpret. You will learn how to create a heatmap, dotplot, cnetplot, treeplot and many more cool plots!
For this tutorial you will need R, or Rstudio, and you will need to install the following packages:
# For data management
install.packages('tidyverse')
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")
# For visualisation
install.packages('pheatmap')
install.packages("DOSE")
install.packages("enrichplot")
install.packages("ggupset")
If you are a beginner in R, don’t be overwhelmed! This tutorial will go step-by-step and I will explain (almost!) every line of code so you know what is happening at each point of the workflow.
You might also want to check out my Youtube tutorial on how to visualise your pathway enrichment analysis results. Or if you prefer a written step-by-step guide, then continue reading and don’t forget you can just copy the code by clicking on the top right corner of the code snippets in this post.
But enough introduction! Ready to enter the world of functional enrichment analysis with ClusterProfiler? Let’s dive in!
Squidtip
Want to know more about functional enrichment analysis? Check out my blogposts on Pathway Enrichment Analysis and Gene Set Enrichment Analysis – simply explained!
In this Youtube video, I explain how to visualise your Pathway Enrichment Analysis with the R package enrichplot()!
Step-by-step tutorial
Create an enrichResult object
For our following plots, we will use the package enrichplot(). For this, we first need to transform our results to an enrichResult() object. In this case, we will analyse pathways that were enriched for upregulated genes.
In other words, we will select the genes that were differentially upregulated in severe patients compared to healthy patients. Then, we will check which pathways are overrepresented in that list of genes, more than what would be expected by chance.
# Read in the data
res_df <- read.csv(paste0(out_path, 'clusterProfiler/', 'severevshealthy_reactome_resclusterp.csv'))
bg_genes <- readRDS(paste0(bg_path, 'reactome.RDS'))
# Convert clusterProfiler object to a new "enrichResult" object
# Select only upregulated genes in Severe
res_df <- res_df %>% filter(diffexpressed == 'UP') %>%
dplyr::select(!c('minuslog10padj', 'diffexpressed'))
rownames(res_df) <- res_df$ID
Squidtip
Important! The pathways that come up after selecting for ‘UP’ are not necessarily upregulated pathways. We are just selecting pathways that were enriched in our list of UPregulated genes. It could be that in our list there are a lot of genes that code for repressors or inhibitors of that pathway. In which case, the pathway would probably be downregulated compared to healthy patients.
If you would like to clear things up a bit more, try reading my other blogpost on pathway enrichment analysis – simply explained!
For visualisation purposes, let’s shorten the pathway names. Time for a gsub() ‘waterfall’!
(There are many more elegant ways of doing this, buf for the sake of simplicity, let’s just go ahead with this option).
# For visualisation purposes, let's shorten the pathway names
res_df$Description <- gsub('(H|h)iv', 'HIV',
gsub('pd 1', 'PD-1',
gsub('ecm', 'ECM',
gsub('(I|i)nterleukin', 'IL',
gsub('(R|r)na', 'RNA',
gsub('(D|d)na', 'DNA',
gsub(' i ', ' I ',
gsub('(A|a)tp ', 'ATP ',
gsub('(N|n)adh ', 'NADH ',
gsub('(N|n)ad ', 'NAD ',
gsub('t cell', 'T cell',
gsub('b cell', 'B cell',
gsub('built from .*', ' (...)',
gsub('mhc', 'MHC',
gsub('mhc class i', 'MHC I',
gsub('mhc class ii', 'MHC II',
stringr::str_to_sentence(
gsub('_', ' ',
gsub('GOBP_|KEGG_|REACTOME_', '', res_df$Description)))))))))))))))))))
Squidtip
Sometimes the column with the pathway names has very long names (too long!). This can make interpretation more difficult (or just make it look too clutered). There are many ways to fix this: you can edit the rownames to a shorter version (e.g., using gsub() or str_extract() functions like I do here), change the fontsize and/or the angle of the labels, only show the most significant pathways (e.g., top 5), show numbering and then provide a separate table with the list of pathways associated to each number…
And finally, let’s use the new() function from the enrichplot package to create our enrichResult object!
Just a few things to consider:
result
: our results data frame containing the results of our pathway enrichment analysis results outputted by clusterProfiler().pvalueCutoff
: The p-value cutoff used for selecting significant enrichment results.pAdjustMethod
: The method used for adjusting p-values, usually"BH"
for Benjamini-Hochberg correction.qvalueCutoff
: The q-value cutoff used for selecting significant enrichment results (similar to p-value but adjusted for multiple testing).gene
: A vector containing gene symbols or IDs associated with the analysis (so the names of our DEGs)universe
: A vector containing all possible genes that could be considered in the analysis. Important! This includes also genes that were not significantly differentially expressed between groups.gene2Symbol
: A character vector, used to map gene IDs to gene symbols. Since we are providing gene symbols already, we don’t need this!geneSets
: a named list, where the names are enriched GO terms (or KEGG, or whichever functional gene set terms you used) and the elements are DE genes annotated with that GO term.
enrichres <- new("enrichResult",
readable = FALSE,
result = res_df,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
qvalueCutoff = 0.2,
organism = "human",
ontology = "UNKNOWN",
gene = df$gene_symbol,
keytype = "UNKNOWN",
universe = unique(bg_genes$gene),
gene2Symbol = character(0),
geneSets = bg_genes)
class(enrichres)
Barplot
A barplot is a great way to visualize your enriched pathways. It basically shows the enrichment scores (e.g. p-values) and the gene count or ratio as bar height and color.
# Barplot
barplot(enrichres, showCategory = 20)
mutate(enrichres, qscore = -log(p.adjust, base = 10)) %>%
barplot(x = "qscore")
Below, we see 20 top significantly enriched pathways (based on p-values) from our upregulated gene list. Red pathways are more significantly enriched than blue pathways. The number of genes that were present in our list of upregulated genes for that pathway are shown on the x-axis. In other words, more genes associated to that pathway = longer bars.
We can use mutate to show a different enrichment score, for example, displaying the -log10(p-adj). Since we did not specify, here only 8 pathways are shown.
Dotplot
A dotplot is very similar to a barplot, except that the number of genes is not shown by the bar length, but by the dot size.
# Dotplot
dotplot(enrichres, showCategory = 15) + ggtitle("Severe vs Healthy")
Cnetplot
We’re getting into some serious plotting now! A cnetplot shows the linkages between genes and pathways as a network.
The light yellow big nodes are pathways (the names are too long for this to be pretty!). Ant from each of them we have lines ‘coming out’ of them which are the genes associated to that pathway.
This way, we can easily identify clusters of genes that may share common biological functions or participate in the same molecular pathways. For example, in the lower cluster, we see STAT1, STAT3 and JAK2 (upregulated genes in severe patients) which participate in both ‘IL-12 family signaling’ and ‘Signaling by CSF3 G-CSF’ pathways.
# Cnetplot
cnetplot(enrichres)
Heatplot
A heatplot is similar to cnetplot. In this case, the x axis shows the list of genes we have (in this case, our upregulated genes). In the y axis, we have our pathways. The bars show which genes participated in which pathways. It is useful when there is a large number of significant pathways, as it makes it easier to visualize.
For example, our first gene, ATP5F1A, participates in ‘Formation of ATP by chemiosmotic coupling’, ‘Respiratory electron transport and ATP synthesis…’ and ‘The citric acid TCA cycle…’.
# Heatplot
heatplot(enrichres, showCategory = 5)
Treeplot
The treeplot() function performs hierarchical clustering of enriched pathways.
Many enrichment analyses use hierarchical ontologies to organize functional terms or gene sets into a hierarchical structure. For example, the Gene Ontology consists of thre main branches: Biological Process, Cellular Component, and Molecular Function. Each branch contains more specific terms organized in a hierarchical manner.
The treeplot() function from the enrichplot
package takes the enrichment results and the hierarchical ontology as inputs. It creates a tree-like plot where each pathway is represented by a node, and the parent-child relationships between terms are shown by connecting lines.
The nodes in the treeplot can be customized to convey additional information. For example, in this case, the node colour represents the significance of enrichment. The node size number of genes associated with each term. This allows us to quickly visualise important or highly enriched terms.
# Treeplot
enrichres2 <- pairwise_termsim(enrichres) # calculate pairwise similarities of the enriched terms using Jaccard’s similarity index
treeplot(enrichres2)
Enrichment map
An enrichment map organizes enriched pathways into a network with edges connecting overlapping gene sets or pathways. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional module.
Just like in the treeplot, the node colour represents the significance of enrichment. The node size number of genes associated with each term. This allows us to quickly visualise important or highly enriched terms.
# Enrichment map
emapplot(enrichres2)
Upsetplot
An upsetplot is an alternative to cnetplot. It emphasizes the gene overlapping among different gene sets.
In an upsetplot, each gene set or pathway is represented by a column, and the height of the column indicates the number of genes associated with that set or term. The intersections between the gene sets or terms are shown as connected lines or bars.
# Upsetplot
upsetplot(enrichres)
sessionInfo()
Check my sessionInfo() here in case you have trouble reproducting my steps:
> sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_Ireland.utf8 LC_CTYPE=English_Ireland.utf8
[3] LC_MONETARY=English_Ireland.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Ireland.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggupset_0.3.0 enrichplot_1.18.4 DOSE_3.24.2 org.Hs.eg.db_3.16.0
[5] AnnotationDbi_1.60.2 IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0
[9] BiocGenerics_0.44.0 clusterProfiler_4.6.2 fgsea_1.24.0 mgsa_1.46.0
[13] gplots_3.1.3 pheatmap_1.0.12 RColorBrewer_1.1-3 lubridate_1.9.2
[17] forcats_1.0.0 stringr_1.5.0.9000 dplyr_1.0.10 purrr_1.0.1
[21] readr_2.1.4 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.2
[25] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] ggnewscale_0.4.9 colorspace_2.1-0 ggtree_3.6.2 gson_0.1.0
[5] ellipsis_0.3.2 qvalue_2.30.0 XVector_0.38.0 aplot_0.1.10
[9] rstudioapi_0.14 farver_2.1.1 graphlayouts_1.0.0 ggrepel_0.9.3
[13] bit64_4.0.5 scatterpie_0.2.1 fansi_1.0.3 codetools_0.2-19
[17] splines_4.2.3 cachem_1.0.8 GOSemSim_2.24.0 polyclip_1.10-4
[21] jsonlite_1.8.7 GO.db_3.16.0 png_0.1-8 ggforce_0.4.1
[25] compiler_4.2.3 httr_1.4.6 Matrix_1.5-3 fastmap_1.1.1
[29] lazyeval_0.2.2 cli_3.6.0 tweenr_2.0.2 tools_4.2.3
[33] igraph_1.5.0 gtable_0.3.3 glue_1.6.2 GenomeInfoDbData_1.2.9
[37] reshape2_1.4.4 fastmatch_1.1-3 Rcpp_1.0.11 vctrs_0.5.1
[41] Biostrings_2.66.0 ape_5.7-1 nlme_3.1-162 ggraph_2.1.0
[45] timechange_0.2.0 lifecycle_1.0.3 gtools_3.9.4 zlibbioc_1.44.0
[49] MASS_7.3-58.2 scales_1.2.1 tidygraph_1.2.3 hms_1.1.3
[53] parallel_4.2.3 memoise_2.0.1 gridExtra_2.3 downloader_0.4
[57] ggfun_0.1.1 HDO.db_0.99.1 yulab.utils_0.0.6 stringi_1.7.12
[61] RSQLite_2.3.1 tidytree_0.4.2 caTools_1.18.2 BiocParallel_1.32.6
[65] GenomeInfoDb_1.34.9 rlang_1.1.1 pkgconfig_2.0.3 bitops_1.0-7
[69] lattice_0.20-45 labeling_0.4.2 treeio_1.22.0 patchwork_1.1.2
[73] shadowtext_0.1.2 cowplot_1.1.1 bit_4.0.5 tidyselect_1.2.0
[77] plyr_1.8.8 magrittr_2.0.3 R6_2.5.1 generics_0.1.3
[81] DBI_1.1.3 pillar_1.9.0 withr_2.5.0 KEGGREST_1.38.0
[85] RCurl_1.98-1.12 crayon_1.5.2 KernSmooth_2.23-20 utf8_1.2.2
[89] tzdb_0.4.0 viridis_0.6.3 grid_4.2.3 data.table_1.14.6
[93] blob_1.2.4 digest_0.6.33 gridGraphics_0.5-1 munsell_0.5.0
[97] viridisLite_0.4.2 ggplotify_0.1.1
And that is the end of this tutorial!
In this post, I explained how to visualise your functional enrichment analysis results using enrichplot(). Hope you found it useful!
Before you go, you might want to check:
Squidtastic!
You made it till the end! Hope you found this post useful.
If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.
Otherwise, have a very nice day and… see you in the next one!
Hi, thank you very much for this helpful workflow! I am having issues being able to plot, I am unsure if it is a package/installation issue.
Ex:
barplot(enrichres.down, showCategory = 20)
Error in labels(…) :
lazy-load database ‘/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/yulab.utils/R/yulab.utils.rdb’ is corrupt
In addition: Warning messages:
1: In labels(…) : restarting interrupted promise evaluation
2: In labels(…) : internal error -3 in R_decompress1
Do you happen to know potentially what this issue may be a result from?
Also, when making the enrich object and it asks what organism to put, is that a mandatory parameter to fill out? I am looking at soils so don’t think that it will be on the list.
Thank you!
Hi again, I actually just found a simple fix was to restart my R studios and reload all the packages 🙂
Hi Lily, glad you found a solution! Sometimes restarting does the trick, yes;) Good luck and thanks for the feedback!:)
Such a great tutorial. I’ve been struggling with the GO analysis and your tutorials have been helping a lot. I was trying to use this workflow for my compareCluster result, however the heatplot doesn’t work 🙁
Another question, do you have try improve the network using the clusterProfile output as an input in clueGO? That would be a great tutorial 🙂
Hi Valeria, thanks for your feedback! I’m glad to hear my tutorials have been helping you:) What kind of error are you getting?
I’ve never used clueGO but I checked it out and looks like an interesting thing to look into! Thanks so much for your suggestion!
hello i want ask can we analysis use EnrichGO n EnrichKEGG only Gene Symbol ? cause i don’t know where i get the statistical value (ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count, minuslog10padj, diffexpressed
) can you give me inform where i get the statistical value this?
Hi Arsy, sorry for the late reply – could you rephrase? Not sure I understand your question. You get all those columns after you run the tool. For that you need the inputs I describe in the tutorial – just make sure you have the dataframes in the right format. Let me know if you still have a question about this.
Fantastic tutorial! thanks so much for making biostats so accessible.
I wonder if you have any code or intention of doing a multiple group comparison for visualisation to showcase cluster profiler and others, as often what we want to show as biologist is the difference between up and down regulated or across conditions or both?
Hi! Thanks so much for your suggestion, it’s a great idea! Are you thinking of any visualisation in particular? A heatmap would probably cover multisample/multi-condition comparisons, and then you could annotate the genes/pathways based on if they are downregulated or upregulated – is this what you were thinking of?
Amazing tutorial thanks so much!
I was just wondering how to combine the visualisations esp dotplot to compare the down and up regulated genes. I see clusterprfiler has that ability but not sure how to incorporate the code.
Hi! Thanks for your comment! That’s a great question. I’m not sure you can do it with the clusterProfiler function -> if you want to merge your PEA results from upregulated and downregulated genes you have two different dataframes. You have several options though:
– build two different dotplots and merge them (e.g., with cowplot, patchwork or ggarrange – I have a tutorial and video on how to do this if you need help)
– merge the two clusterProfiler results while adding a new column ‘original_dataframe’ that marks the results from upregulated and downregulated genes. Then use the dotplot function and use the argument split=’original_dataframe’. This splits it into two windows, see the last plot of this discussion@ https://support.bioconductor.org/p/9146375/ Might be a bit tricky though since some pathways might be repeated, not sure how well this will work. So if you want this I would go with option 1.
– Otherwise subset the pathways you want to show from each analysis into a new dataframe and just build a dotplot yourself using ggplot. You can find some tips here http://www.sthda.com/english/wiki/ggplot2-dot-plot-quick-start-guide-r-software-and-data-visualization
Hope this helps!
Hi,
Thank you so much for this workflow – I was just wondering if there was a way to customise the colours on the emapplot, ideally I would like to colour scale the nodes by p.adjust but I have tried a number of codes and am not able to do this.
Hi! Thanks for your comment, great question. I haven’t tried it myself but it seems like it’s a standard ggplot object so you can just use + scale_colour_manual() or similar.
Maybe this will help: https://github.com/YuLab-SMU/enrichplot/issues/66
Cheers!
Hi ,
thank you so much for this great tutorial 🙂
I have a problem with plotting , when I used this code , I got this error:
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] :
replacement has length zero
In addition: Warning message:
In rep(yes, length.out = len) : 'x' is NULL so the result will be NULL
Do you have any idea what is the problem?
Hi, not sure but I think if you are using rep() to repeat the character value ‘yes’ , you need to add the quotation marks – rep(‘yes’, 5)
this was a super helpful tutorial, thank you!
I want to use the CompareCluster function from the clusterprofiler package to create a double dot plot with two groups and each group has the plot split in downregulated and upregulated pathways. can i do this using the enrichresult objects created from your tutorial? I’m having difficulty doing so.
thanks
Hi thanks! I don’t have any experience with comparecluster but try following this tutorial from the developer: https://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-comparecluster.html
hello.thank you very much for your great tutorial
when i want to get cnet plot i got this error?
Error in graph_from_data_frame(d = d, directed = directed, vertices = vertices) :
the data frame should contain at least two columns
could you please help me?
Thanks for your comment! I don’t think I can help you without seeing your inputs, but looks like your dataframe (d?) is missing columns!
Hi!
I’m getting the same error. Were you able to resolve the issue?
Many Thanks!
This tutorial was really helpful, Thank you.
But when running it i have a problem at start , I have installed the libraries using install.packages(“enrichplot”, dependencies=TRUE) but when i try to load them using library command it is showing that no such package exist in the system and yes i have tried uninstalling and installing it many times.
Can you suggest some way to solve this problem
Hi, enrichplot is a Bioconductor package, try this:
BiocManager::install(“enrichplot”)
https://www.bioconductor.org/packages/release/bioc/html/enrichplot.html
Hi, thank you………Its so helpful for me
You’re very welcome! Glad it was helpful:)
Hi, Thank you very much for your awsome tutorial!
I got a problem when following this process. I got 6 down and 3 up reactome ID, and trying to visualize with emapplot(enrichres2) seperately. However, the output is completely blank, didn’t show any gene netweork.
I’m very new for GSEA analysis, do you know what’s wrong the problem?
Thank you!