Easy Gene Set Enrichment Analysis in R with fgsea()

A step-by-step R tutorial on performing Gene Set Enrichment Analysis with R package fgsea

In this tutorial you will learn how to conduct Gene Set Enrichment Analysis (GSEA) using R-package fgsea. This package implements an algorithm for fast gene set enrichment analysis.

Following this easy, step-by-step tutorial, you will find out how to:

- Install and start fgsea()
- Prepare your dataset to perform GSEA
- Set the analysis parameters and run the analysis.
- View the GSEA results and get some nice plots!

If you are more of a video-based learner, you can check out my Youtube video on how to perform GSEA with fgsea() in R. Otherwise, just keep reading! And remember you can just copy the code by clicking on the top right corner of the code snippets in this post.

Let’s dive in!

Check out my Youtube video to follow this fgsea R tutorial with me!

Before you start…

In this tutorial we will learn how to perform Gene Set Enrichment Analysis on differential gene expression results. We will use the R package fgsea() for fast preranked gene set enrichment analysis (GSEA).

If you are not that familiar with GSEA, you might want to check my other post on GSEA – simply explained! I go over the basic concepts behind GSEA algorithms and how to interpret the results.

If you are a beginner in R, don’t be overwhelmed! This tutorial will go step-by-step and I will explain (almost!) every line of code so you know what is happening at each point of the workflow.

For this tutorial you will need R, or Rstudio, and you will need to install fgsea.

Install fgsea()

You can read more about fgsea() package here. To install fgsea, use the following command:

BiocManager::install("fgsea")

Once installed, we can load the package:

library("fgsea")

Set up your environment

It’s good practice to clear up your environment before you load in any data. This is the general structure I like to keep in all my scripts:

# ----------------------
# GSEA tutorial
# ----------------------

# Setting up environment ===================================================

# Clean environment
rm(list = ls(all.names = TRUE)) # will clear all objects including hidden objects
gc() # free up memory and report the memory usage
options(max.print = .Machine$integer.max, scipen = 999, stringsAsFactors = F, dplyr.summarise.inform = F) # avoid truncated output in R console and scientific notation

# Set seed
set.seed(123456)

# Set project library
.libPaths('C:/Users/laura/Documents/Biostatsquid/Scripts/R4.2.3')
# Loading relevant libraries 
library(tidyverse) # includes ggplot2, for data visualisation. dplyr, for data manipulation.
library(RColorBrewer) # for a colourful plot
library(fgsea)

# Set relevant paths
list.files()
in_path <- "Datasets/"
out_path <- "PEA/Results/"

What do I need to run fgsea?

Essentially, there are two things you need to perform preranked gene set enrichment analysis with fgsea.

The fgsea package takes in two main arguments.

The first one is pathways: a list of gene sets or pathways to check

The second one is stats: a named vector of our genes of interest we want to perform GSEA on. The gene names must be the same as the ones in pathways! So make sure you are using the same nomenclature (i.e., gene IDs or Ensemble IDs).

That’s it!

There are additional parameters you may be interested in such as:

minSize: minimal size of a gene set to test (all pathways below the threshold are excluded)
maxSize: the same, but a maximum threshold.
scoreType: the default is ‘std’, where the enrichment score is computed normally, but you can also use one-tailed tests if you are only interested in positive (‘pos’) enrichment – so only pathways that are overrepresented – or negative (‘neg’) enrichment – to only get pathways that are underrepresented in your list of genes.

1. Prepare your gene sets / pathways

As we already saw in the Pathway Enrichment Analysis with clusterProfiler() tutorial, we need to prepare a list of pathways.

1. Get your gene set .gmt file. For example, you can download it from:

https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp

Just select ‘Gene Symbols’ (or Entrez IDs if you want to stick to Entrez IDs). Download the file, and copy it into your desired folder (I keep them in a separate folder, ‘PEA/Background_genes’).

2. Filter the .gmt file and convert it to a format accepted by fgsea. As I explain in my previous post, we need to use a set of background genes that contain only the genes present in our dataframe, to avoid bias. You can use the following code:

# Function: Adjacency matrix to list -------------------------
matrix_to_list <- function(pws){
  pws.l <- list()
  for (pw in colnames(pws)) {
    pws.l[[pw]] <- rownames(pws)[as.logical(pws[, pw])]
  }
  return(pws.l)
}

# Get all the genes in your dataset and assign them to my_genes 
my_genes <- df$gene_symbol

# Download gene sets .gmt files
#https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp
# Copy the .gmt file to your folder, in my case, its 'PEA/Background_genes/'
# Then read in the .gmt file
gmt_file <- "PEA/Background_genes/c2.cp.kegg.v7.5.1.symbols.gmt"
hidden <- unique(unlist(gmt))
  
# Convert gmt file to a matrix with the genes as rows and for each go annotation (columns) the values are 0 or 1
mat <- matrix(NA, dimnames = list(hidden, names(gmt)),
       nrow = length(hidden), ncol = length(gmt))
for (i in 1:dim(mat)[2]){
       mat[,i] <- as.numeric(hidden %in% gmt[[i]])
}
#Subset to the genes that are present in our data to avoid bias
hidden1 <- intersect(genes_in_data, hidden)
mat <- mat[hidden1, colnames(mat)[which(colSums(mat[hidden1,])>5)]] # filter for gene sets with more than 5 genes annotated
# And get the list again using the function we previously defined
final_list <- matrix_to_list(mat)

Alternatively, if you want to run GSEA on many gene sets (e.g., KEGG, Reactome, Gene Ontology…), you can use a function:

# Set relevant paths
list.files()
in_path <- "Datasets/"
out_path <- "PEA/Results/"
bg_path <- "PEA/Background_genes/"

# Functions ===================================================
## Function: Adjacency matrix to list -------------------------
matrix_to_list <- function(pws){
  pws.l <- list()
  for (pw in colnames(pws)) {
    pws.l[[pw]] <- rownames(pws)[as.logical(pws[, pw])]
  }
  return(pws.l)
}

## Function: prepare_gmt --------------------------------------
prepare_gmt <- function(gmt_file, genes_in_data, savefile = FALSE){
  # for debug
  #file <- gmt_files[1]
  #genes_in_data <- df$gene_symbol
  
  # Read in gmt file
  gmt <- gmtPathways(gmt_file)
  hidden <- unique(unlist(gmt))
  
  # Convert gmt file to a matrix with the genes as rows and for each go annotation (columns) the values are 0 or 1
  mat <- matrix(NA, dimnames = list(hidden, names(gmt)),
                nrow = length(hidden), ncol = length(gmt))
  for (i in 1:dim(mat)[2]){
    mat[,i] <- as.numeric(hidden %in% gmt[[i]])
  }
  
  #Subset to the genes that are present in our data to avoid bias
  hidden1 <- intersect(genes_in_data, hidden)
  mat <- mat[hidden1, colnames(mat)[which(colSums(mat[hidden1,])>5)]] # filter for gene sets with more than 5 genes annotated
  # And get the list again
  final_list <- matrix_to_list(mat) # for this we use the function we previously defined
  
  if(savefile){
    saveRDS(final_list, file = paste0(gsub('.gmt', '', gmt_file), '_subset_', format(Sys.time(), '%d%m'), '.RData'))
  }
  
  print('Wohoo! .gmt conversion successfull!:)')
  return(final_list)
}

# Analysis ====================================================

## 1. Read in data -----------------------------------------------------------
list.files(in_path)
df <- read.csv(paste0(in_path, 'severevshealthy_degresults.csv'), row.names = 1)

## 2. Prepare background genes -----------------------------------------------

# Download gene sets .gmt files
#https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp

# For GSEA
# Filter out the gmt files for KEGG, Reactome and GOBP
my_genes <- df$gene_symbol
list.files(bg_path)
gmt_files <- list.files(path = bg_path, pattern = '.gmt', full.names = TRUE)
gmt_files
bg_genes <- prepare_gmt(gmt_files[2], genes_in_data, savefile = FALSE)

This way, if I want to use a different background gene set, I can just change the parameters of my custom function, prepare_gmt():

# gmt_files is just a list of filenames of all my .gmt files (I have 3):
> gmt_files
[1] "PEA/Background_genes/c2.cp.kegg.v7.5.1.symbols.gmt"     "PEA/Background_genes/c2.cp.reactome.v7.5.1.symbols.gmt"
[3] "PEA/Background_genes/c5.go.bp.v7.5.1.symbols.gmt

# You can use a list of filenames, and just change [1], [2], [3]...
bg_genes <- prepare_gmt(gmt_files[1], my_genes, savefile = FALSE)
            
# Or write it out
bg_genes <- prepare_gmt("PEA/Background_genes/c5.go.bp.v7.5.1.symbols.gmt", my_genes, savefile = FALSE)



Squidtip

Did you know? The fgsea package already provides a function that returns a list of Reactome pathways for a given vector of Entrez gene IDs.

Just call reactomePathways(genes).

Your bg_genes object should be a list with your pathway or gene set names, and the genes that make up each pathway.

2. Prepare your ranked list of genes

We would like to perform GSEA on our differentially expressed genes. Unlike pathway enrichment analysis, GSEA provides the advantage of not needing to subset our gene list. We can provide all the genes that resulted from our differential expression analysis. Even genes that were not significantly differentially expressed. Offering popular women’s necklaces such as pendants, chokers and. Shop for jewelry in a variety of metals and gemstones to suit any occasion.

We just need to rank our genes first.

The way it works is, you need to give fgsea a named vector with the rankings and the genes symbols as names.

If you are not very clear on this, you might want to check out my post on GSEA – simply explained!



Squidtip

How do you rank your genes?

There are many ways of ranking your genes, so you need to decide based on your analysis. Here, I use the sign(df$log2fc)*(-log10(df$pval))) as a ranking metric.

This way, the magnitude of the ranking will be given by the p value (since I use -log10(), the more significant a gene is differentially expressed, the larger that value will be). The sign of the ranking (positive or negative) will be given by the fold-change: upregulated genes will have positive rankings, downregulated genes will have negative rankings.

rankings <- sign(df$log2fc)*(-log10(df$pval)) # we will use the signed p values from spatial DGE as ranking
names(rankings) <- df$gene_symbol # genes as names#

Easy! We’ve just created our ranked list of genes.

You can have a look at the values:

> head(rankings)
    MIR1302-2HG         FAM138A           OR4F5      AL627309.1      AL627309.3      AL627309.2 
 0.000061381930 -0.000005164293 -0.000005164293 -0.154354567687 -0.000005164293 -0.000005164293 
> rankings <- sort(rankings, decreasing = TRUE) # sort genes by ranking
> plot(rankings)

Yes, it’s an ugly plot, I know.

In the x axis, we see our genes (each dot represents a gene) – they are indexed so instead of gene symbols we just see numbers from 1 to 33538.

In the y axis, we see the ranking (sign(df$log2fc)*(-log10(df$pval)))) for each gene.



Squidtip

Another note on rankings… You can just use the log2(fold-change) as a ranking – but it will not take into account genes with a large fold change but non-significant. However, if you have already selected your significant genes, this may be a good option for you.

Most genes have a ranking value between -100 and 100.

A few ones seem to have higher rankings. Let’s have a closer look…

> max(rankings)
[1] Inf
> min(rankings)
[1] -Inf

It seems like some genes have a ranking of infinity. This is due to the fact that some p-values were really small, so computing -log10(p-value) returned infinity.

fgsea() does not accept non-finite numbers as rankings, so we need to fix that.

In this case, we will just convert them to really large (or small) values:

# Some genes have such low p values that the signed pval is +- inf, we need to change it to the maximum * constant to avoid problems with fgsea
max_ranking <- max(rankings[is.finite(rankings)])
min_ranking <- min(rankings[is.finite(rankings)])
rankings <- replace(rankings, rankings > max_ranking, max_ranking * 10)
rankings <- replace(rankings, rankings < min_ranking, min_ranking * 10)
rankings <- sort(rankings, decreasing = TRUE) # sort genes by ranking

By replacing the infinite values by ten times the maximum or minimum ranking, we avoid errors when running fgsea().

Now, if we plot the rankings:

> max(rankings)
[1] 2152.518
> min(rankings)
[1] -2206.971

You can also check the rankings of a specific set of genes and a nicer plot with ggplot.

Let’s have a look at the first 50 genes (if you don’t have many genes you can try with all but since I have 33538 I don’t want it to crash – again).

ggplot(data.frame(gene_symbol = names(rankings)[1:50], ranks = rankings[1:50]), aes(gene_symbol, ranks)) + 
	geom_point() +
	theme_classic() + 
	theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

3. Run GSEA

Wohoo! After all this preparation, we are ready to run GSEA with the fgsea() package! Excited?

To perform gene set enrichment analyis, you just need to call the function fgsea():

## 4. Run GSEA ---------------------------------------------------------------
# Easy peasy! Run fgsea with the pathways 
GSEAres <- fgsea(pathways = bg_genes, # List of gene sets to check
                 stats = rankings,
                 scoreType = 'std', # in this case we have both pos and neg rankings. if only pos or neg, set to 'pos', 'neg'
                 minSize = 10,
                 maxSize = 500,
                 nproc = 1) # for parallelisation

You might get a warning saying:

Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (48.8% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

As it states there, there are many genes with ‘tied’ scores or rankings. Genes with the same score will just be ordered at random by the algorithm, and might slightly affect your results when computing the Enrichment Score. This just means that if genes #15 # 16 and #17 have a ranking of ’10’, they will be ordered at random (#17 #15 #16).

It shouldn’t have a very big effect on your results. In my case, it mostly affects genes with a ranking near 0 (they were not significant genes).



Squidtip

About the scoreType… If you have only positive rankings, you can change this to ‘pos’. For example, if you are using the absolute value of logFC. It is not very common, but you can read more about use cases here.

If everything went well, you now have a GSEAres object in your R environment.

fgsea returns a dataframe with the names of your pathways or gene sets, p-values and p-adjusted values, the ES, NES, the size of the gene sets and the leadingEdge.

> head(GSEAres)
                                                                                pathway      pval      padj    log2err
1:                                           REACTOME_ABACAVIR_TRANSPORT_AND_METABOLISM 0.3633218 0.9085975 0.08862611
2:                                      REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT 0.4397394 0.9085975 0.11573445
3:                                       REACTOME_ABC_TRANSPORTERS_IN_LIPID_HOMEOSTASIS 0.9746835 0.9856000 0.05773085
4:                                                   REACTOME_ABC_TRANSPORTER_DISORDERS 0.3471338 0.9085975 0.13077714
5:            REACTOME_ABERRANT_REGULATION_OF_MITOTIC_EXIT_IN_CANCER_DUE_TO_RB1_DEFECTS 0.9820051 0.9882600 0.05809836
6: REACTOME_ABERRANT_REGULATION_OF_MITOTIC_G1_S_TRANSITION_IN_CANCER_DUE_TO_RB1_DEFECTS 0.5102564 0.9085975 0.09167952
           ES        NES size                             leadingEdge
1: -0.7980657 -1.1339616   10                                   ABCB1
2:  0.4951437  0.9782602  103  PSMB8,SEM1,PSMB9,PSMB3,PSMA6,PSME1,...
3:  0.3593687  0.5781693   18     PEX3,ABCG1,APOA1,ABCG8,ABCA3,ABCA10
4:  0.5572234  1.0724269   78  PSMB8,SEM1,PSMB9,PSMB3,PSMA6,PSME1,...
5:  0.3580019  0.5778174   20 ANAPC5,ANAPC2,CDC26,CDC27,UBE2C,RB1,...
6:  0.6152041  0.9756658   17                       CCND1,CCND3,CCNE1

4. Visualise your GSEA results

Let’s have a look at our top 6 enriched pathways, ordered by p-value or p-adjusted value.

## 6. Check results ------------------------------------------------------
# Top 6 enriched pathways (ordered by p-val)
head(GSEAres[order(pval), ])

How many significant pathways did we get?

> sum(GSEAres[, padj < 0.01])
[1] 0
> sum(GSEAres[, pval < 0.01])
[1] 8

Clearly, our results are not great. There are only 8 significant pathways (p-value < 0.05), but when correcting for multiple testing, we have no significant pathways (p-adjusted values < 0.05).

Oh, well. For the sake of this tutorial, let’s keep going! Hopefully your results will be prettier than mine…

Let’s plot our top pathways. I ordered them by p-value because my pathway enrichment analysis did not yield very interesting results… but you can change it to p-adjusted values if you would like to be more restrictive and correct for multiple testing…

topPathwaysUp <- GSEAres[ES > 0][head(order(pval), n = number_of_top_pathways_up), pathway]
topPathwaysDown <- GSEAres[ES < 0][head(order(pval), n = number_of_top_pathways_down), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
#pdf(file = paste0(filename, '_gsea_top30pathways.pdf'), width = 20, height = 15)
plotGseaTable(bg_genes[topPathways], stats = rankings, fgseaRes = GSEAres, gseaParam = 0.5)
#dev.off()

You can also select only independent pathways, removing redundancies or similar pathways with the function collapsePathways():

# Select only independent pathways, removing redundancies/similar pathways
collapsedPathways <- collapsePathways(GSEAres[order(pval)][pval < 0.05], bg_genes, rankings)
mainPathways <- GSEAres[pathway %in% collapsedPathways$mainPathways][order(-NES), pathway]
#pdf(file = paste0('GSEA/Selected_pathways/', paste0(filename, background_genes, '_gsea_mainpathways.pdf')), width = 20, height = 15)
plotGseaTable(bg_genes[mainPathways], rankings, GSEAres, gseaParam = 0.5)
#dev.off()



Squidtip

If you’d like to export the tables, just uncomment the 2 lines above. You can also export to .png, or other formats:

png(file = paste0(filename, ‘_gsea_mainpathways.png’), width = 1500, height = 800)
plotGseaTable(bg_genes[mainPathways], rankings, GSEAres, gseaParam = 0.5)
dev.off()

Now let’s plot the ES of the most significantly enriched pathway.

# plot the most significantly enriched pathway
plotEnrichment(bg_genes[[head(GSEAres[order(padj), ], 1)$pathway]],
               rankings) + 
  labs(title = head(GSEAres[order(padj), ], 1)$pathway)

You can also edit the plot and make it prettier;)

plotEnrichment(bg_genes[['REACTOME_FORMATION_OF_FIBRIN_CLOT_CLOTTING_CASCADE']],
               rankings) + 
  labs(title = 'Reactome pathway: Formation of fibrin clot / clotting cascade') + 
  theme_classic() +
  scale_x_continuous('Rank', breaks = seq(0, 32000, 5000)) +
  scale_y_continuous('Enrichment score (ES)') +
  geom_line(col = 'purple', linewidth = 2)

5. Save your results

Remember to save your results!

## 5. Save the results -----------------------------------------------
name_of_comparison <- 'severevshealthy'
background_genes <- 'gobp'
filename <- paste0(out_path, 'GSEA/', name_of_comparison, '_', background_genes) 
#saveRDS(GSEAres, file = paste0(filename, '_gsea_results.RDS'))
#data.table::fwrite(GSEAres, file = paste0(filename, '_gsea_results.tsv'), sep = "\t", sep2 = c("", " ", ""))

sessionInfo()

Check my sessionInfo() here in case you have trouble reproducting my steps:

> sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_Ireland.utf8  LC_CTYPE=English_Ireland.utf8   
[3] LC_MONETARY=English_Ireland.utf8 LC_NUMERIC=C                    
[5] LC_TIME=English_Ireland.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fgsea_1.24.0       RColorBrewer_1.1-3 lubridate_1.9.2    forcats_1.0.0     
 [5] stringr_1.5.0.9000 dplyr_1.1.2        purrr_1.0.1        readr_2.1.4       
 [9] tidyr_1.3.0        tibble_3.2.1       ggplot2_3.4.2      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.11         pillar_1.9.0        compiler_4.2.3      tools_4.2.3        
 [5] lattice_0.20-45     lifecycle_1.0.3     gtable_0.3.3        timechange_0.2.0   
 [9] pkgconfig_2.0.3     rlang_1.1.1         Matrix_1.5-3        fastmatch_1.1-3    
[13] cli_3.6.0           rstudioapi_0.14     parallel_4.2.3      withr_2.5.0        
[17] generics_0.1.3      vctrs_0.6.3         hms_1.1.3           cowplot_1.1.1      
[21] grid_4.2.3          tidyselect_1.2.0    data.table_1.14.8   glue_1.6.2         
[25] R6_2.5.1            fansi_1.0.4         BiocParallel_1.32.6 farver_2.1.1       
[29] tzdb_0.4.0          magrittr_2.0.3      codetools_0.2-19    scales_1.2.1       
[33] colorspace_2.1-0    labeling_0.4.2      utf8_1.2.3          stringi_1.7.12     
[37] munsell_0.5.0

And that is the end of this tutorial!

In this post, I explained how to perform functional enrichment analysis using clusterProfiler. Hope you found it useful!

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and… see you in the next one!

Squids don't care much for coffee,

but Laura loves a hot cup in the morning!

If you like my content, you might consider buying me a coffee.

Get the squid a coffee

You can also leave a comment or a 'like' in my posts or Youtube channel, knowing that they're helpful really motivates me to keep going:)

Cheers and have a 'squidtastic' day!

28 Comments

Claude Pelletier on September 11, 2023 at 9:57 am

Great tutorial! Much appreciated!

in Python and/or R, how can we visualize timeseries (Day 1 to 5) of the difference of gene expression by outcome (Survivor / Non-Survivor)?
Reply
- Biostatsquid on September 12, 2023 at 12:15 pm
  
  Hi Claude, thanks for your feedback! That’s a really good question. I don’t have experience with timeseries data, but here are a few resources that may inspire you:
  https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1647-3 -> has some visualisation plots
  https://academic.oup.com/bib/article/20/1/288/4364840 -> has some bioinformatic tools
  https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Can_I_use_GSEA_to_analyze_time_series_data.3F –> just found this, might answer your question?
  Hope it helps!:)
  PS. If you are looking at a particular gene (or just a few of them), it might be as simple as plotting a lineplot grouped by outcome, with log2FC of the specific DEG in the y axis, time in the x axis, and you could also have the colour as a function of the p-values (or maybe if it is significantly differentially expressed at all timepoints, you don’t need this) – you could do this with ggplot quite easily.
  Reply
María on November 27, 2023 at 1:17 pm

great tutorial!! but I have a problem with the last line of the final graphic:

geom_line(col = ‘purple’, linewidth = 2), error in `compute_geom_1()`:
! `geom_line()` requires the following missing aesthetics: x and y
Reply
- Biostatsquid on December 8, 2023 at 12:12 pm
  
  Hi! Thanks so much for your comment. Hmm, that’s strange – did you specify x and y in the main function’s aesthetics? I mean when you call ggplot(… aes(x = , y = )).
  Otherwise, you can also specify them again in the geom_line() function. This is what I mean: https://community.rstudio.com/t/error-geom-point-requires-the-following-missing-aesthetics-x-y/26224
  Hope it helps!
  Reply
Kay on December 7, 2023 at 6:05 pm

Great tutorial ! Do you know of a citation for using a combination of pvalue and fold change as a ranking criteria ?
Reply
- Biostatsquid on December 8, 2023 at 12:19 pm
  
  Thank you for your comment! Yes, I’m pretty sure there’s more publications but here’s one where they propose the method I use in the tutorial: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957066/
  Cheers!
  Reply
Jess on December 21, 2023 at 7:15 pm

Hi, thank you for your tutorial! I got the following error while running the prepare_gmt function for bg_genes:

Error in colSums(mat[hidden1, ]) :
‘x’ must be an array of at least two dimensions
In addition: Warning message:
In readLines(gmt.file) :
incomplete final line found on ‘PEA/Background_genes//HALLMARK_NOTCH_SIGNALING.v2023.2.Hs.gmt’

Am I doing something wrong here?
Reply
- Biostatsquid on January 31, 2024 at 8:12 am
  
  Hi! Thanks for your comment – are you use that’s the path were the file is located?
  Try list.files(‘PEA/Background_genes/’) – does R return that file? If no, then you need to provide the folder/path were the file is. If yes, then maybe try checking if the file looks ok size-wise and if not try downloading the file again? Code seems to work for me!
  Reply
Julia on January 25, 2024 at 7:06 pm

Hi thanks for the tutorial! I’m not very experienced with R but I’m trying to perform this fgsea. I’m getting stuck on step 1.2 when it comes to reading the gmt file and then intersecting. Are the objects “gmt” and “genes_in_data” defined elsewhere?
Reply
- Biostatsquid on January 31, 2024 at 8:05 am
  
  Hey Julia! Thanks for your question. So the first step is to download a gmt file, which I explain in 1.1. Then you have to save the file(s) you download in whichever folder you want. In my case, I saved it in PEA/Background_Genes.
  You need to make sure R is able to read in this file, so make sure the folder is defined properly (you might need to give it the full path, e.g., ‘C:/Users/…). You can check if R has access to the folder with
  foldername <- 'C:/Path/to/my/folder' list.files(foldername) This should return all the files in your folder. About genes_in_data - so sorry! That was on my end. I copied the lines from the function I post a few lines below (prepare_gmt) and genes_in_data is the argument for that function. If you see at the very end of that snippet when I call the function, I pass my_genes as the argument genes_in_data. It's a bit confusing if you are not used to functions, but in summary, use my_genes. I will correct it in the snippet so it's not confusing - thanks for your comment!!
  Reply
Amit on February 17, 2024 at 6:52 pm

Hi,
I have facing issue in collapsedPathways function as follow :

> collapsedPathways <- collapsePathways(GSEAres[order(padj)][padj < 0.01], bg_genes, rankings)
Error in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
Not all stats values are finite numbers

could you please help me to sort this out ?
Reply
- Biostatsquid on March 9, 2024 at 6:38 am
  
  Hi – would need more info to reproduce the error but maybe this will help: https://www.biostars.org/p/9557480/
  Reply
mehrsam on February 18, 2024 at 5:02 pm

Hi.thank you for your great tutorial.
id did not underestand one thing.
how can i annotate most significant genes of every pathways in R(leading edge subset of genes)
Reply
- mehrsam on February 18, 2024 at 6:37 pm
  
  im so sorry.it was stored in the results.is did not see it at first 🙁
  Reply
Rafi on March 11, 2024 at 2:55 pm

Hello, i am having trouble to understand why don’t we have a .cls phenotype file in order to choose what we want to compare ? In my case i have 4 phenotypes and i want to compare two. How does this work?

Here are my file:
Gene expression data .gct/.txt
geneset .gmx/.gmt
Phenotype data .cls
pre-ranked genelist .rnk
Reply
- Biostatsquid on March 23, 2024 at 5:47 am
  
  Hi, thanks for your comment! Not sure I understand your question, but if there is only two groups you want to compare, you might want to filter them first.
  Reply
Maya on April 9, 2024 at 6:16 pm

Hello. Thanks for the great tutoring. I am running the code, but I am getting the bg_genes back as a list() – so it does not have any components to it. When I then run the fgsea, I am getting 0 observations. Do you have any comments on this? Thanks so much.
Reply
- Biostatsquid on April 13, 2024 at 11:39 pm
  
  Hi – thanks for your comment! You need to download the background gene set and read it into R – I explain it in the first steps of the tutorial. It seems like you are not reading this into R properly – maybe just a quick fix with your paths? To make sure that the data is there and R can “see” it – you can use list.files(/your/path/to/wherever/you/saved/the/background/files).
  Reply
Kay on April 10, 2024 at 4:42 pm

Hi, Great tutorial, thank you so much! I was wondering whether you have any comments on how to plot the results in a bar chart format? Is this possible?

Thank you!
Reply
- Biostatsquid on April 19, 2024 at 10:08 am
  
  Hi thanks for your comment, you’re very welcome:) For sure it’s possible – but what would you like to plot? Barplots are usually useful to show a numerical distribution of categorical data – for example, in the x axis you could have your top significant pathways, and in the y axis the gene count or ratio as bar heigh/colour? You have an example here, with clusterProfiler: https://biostatsquid.com/pathway-enrichment-analysis-plots/
  But you would have to build it with ggplot (really easily!) using geom_bar()
  Let me know if you think it would be useful to have a tutorial on this:)
  Reply
sam on April 23, 2024 at 11:30 am

Hii, im really stuck while doing the analysis of the plots

The plot that was obtained consist of all the erichment scores plotted with the gene rank for a single significantly enriched pathway ? But as the fgsea result consist of only a single enrichment score for a single pathway, the how come the plot has different peaks? Sorry im a bit confused.
Reply
- Biostatsquid on May 1, 2024 at 12:58 pm
  
  Hi, I’m not quite sure what you mean. If you are referring to the last plot, the x axis shows the genes, in order (ranked by sign(df$log2fc)*(-log10(df$pval)) ), the y axis shows the ES for each of the gene. This is, as you correctly pointed out, for a single pathway. You would be able to get a plot like this one for each of the pathways. Hopefully this helped! You might want to watch my youtube video on GSEA or read the blogpost https://biostatsquid.com/gene-set-enrichment-analysis/, it might be clearer then:)
  Reply
Young Hwa Kim on May 6, 2024 at 2:33 pm

Hi, I thoroughly enjoyed the video; it proved to be incredibly insightful. I am grateful for the valuable information it provided. Furthermore, I have a question to pose.

Below is the content I received this error message. How can I resolve it?

> plotEnrichment(bg_genes[[‘REACTOME_TRANSLATION’]],
+ rankings) +
+ labs(title = ‘Reactome pathway: REACTOME_TRANSLATION’) +
+ theme_classic() +
+ scale_x_continuous(‘Rank’, breaks = seq(0, 32000, 5000)) +
+ scale_y_continuous(‘Enrichment score (ES)’) +
+ geom_line(col = ‘purple’, linewidth = 2)

Error in `geom_line()`:
! Problem while setting up geom.
ℹ Error occurred in the 6th layer.
Caused by error in `compute_geom_1()`:
! `geom_line()` requires the following missing aesthetics: x and y.
Run `rlang::last_trace()` to see where the error occurred.
Reply
- Biostatsquid on May 26, 2024 at 12:15 am
  
  Hi! Thanks for your comment! Hmm, not sure why it’s not working for you, looks like you need to specify the x and y variables inside aes() – could you check that you are using the same versions of the packages I use? Otherwise just delete the geom_line() line and it should work without colouring it purple:)
  Reply
Leima on May 17, 2024 at 11:47 am

Hi,

great tutorial. I want to do fgsea, but from the resulting gene list is it possible to separate the upregulated and downregulated genes/pathways? I’m confused how to do this, would it be by separating using the normalised enrichment score, so if its less than 0 this would indicate genes in that pathway have been down regulated and if larger, these are pathways with genes that are upregulated? I intend to make a 3 way venn diagram comparing the number of up and downregulated pathways across 3 different groups, but wasn’t sure the most efficient way to do so. Unless pathway enrichment is easier?

Thanks,
Reply
- Biostatsquid on May 26, 2024 at 12:06 am
  
  Hi, thanks for your comment! Not sure what you mean – could you explain further? One thing is upregulated/downregulated pathways, another is genes.
  But, if I get your question right, I would do dge analysis on the 3 groups separately (vs a control group?) – then do fgsea on the 3 gene lists separately.
  Reply
Thu Le on June 13, 2024 at 12:52 pm

Hello, thank you for your tutorial. I am having trouble to find out gene sets/pathway for rat. The page you suggested have only human and mouse. Do you know else where I can find that file (ie. get) for other species (especially Rat in my case)?
Many thanks
Reply
- Biostatsquid on June 16, 2024 at 12:24 am
  
  Hi! Thanks so much for your comment. I’ve never worked with rat samples but here are some interesting papers/tools that may be useful:
  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8982048/
  https://www.genome.jp/kegg-bin/show_organism?menu_type=pathway_maps&org=rno
  https://rgd.mcw.edu/GO/
  https://reactome.org/content/schema/instance/browser/48895
  Good luck!
  Reply

Easy Gene Set Enrichment Analysis in R with fgsea()

Before you start…

Install fgsea()

Set up your environment

What do I need to run fgsea?

1. Prepare your gene sets / pathways

Squidtip

2. Prepare your ranked list of genes

Squidtip

Squidtip

3. Run GSEA

Squidtip

4. Visualise your GSEA results

Squidtip

5. Save your results

sessionInfo()

Squidtastic!

28 Comments

Submit a Comment Cancel reply

Recent posts

Easy Gene Set Enrichment Analysis in R with fgsea()

Before you start…

Install fgsea()

Set up your environment

What do I need to run fgsea?

1. Prepare your gene sets / pathways

Squidtip

2. Prepare your ranked list of genes

Squidtip

Squidtip

3. Run GSEA

Squidtip

4. Visualise your GSEA results

Squidtip

5. Save your results

sessionInfo()

Squidtastic!

28 Comments

Submit a Comment Cancel reply

Popular Posts

Recent posts