/**
* jQuery Plugin: Sticky Tabs
*
* @author Aidan Lister
// Set the correct tab when the page loads showStuffFromHash(context);
// Set the correct tab when a user uses their back/forward button $(window).on('hashchange', function() { showStuffFromHash(context); });
// Change the URL when tabs are clicked $('a', context).on('click', function(e) { history.pushState(null, null, this.href); showStuffFromHash(context); });
return this; }; }(jQuery));
window.buildTabsets = function(tocID) {
// build a tabset from a section div with the .tabset class function buildTabset(tabset) {
// check for fade and pills options var fade = tabset.hasClass("tabset-fade"); var pills = tabset.hasClass("tabset-pills"); var navClass = pills ? "nav-pills" : "nav-tabs";
// determine the heading level of the tabset and tabs var match = tabset.attr('class').match(/level(\d) /); if (match === null) return; var tabsetLevel = Number(match[1]); var tabLevel = tabsetLevel + 1;
// find all subheadings immediately below var tabs = tabset.find("div.section.level" + tabLevel); if (!tabs.length) return;
// create tablist and tab-content elements var tabList = $('
'); $(tabs[0]).before(tabList); var tabContent = $('
'); $(tabs[0]).before(tabContent);
// build the tabset var activeTab = 0; tabs.each(function(i) {
// get the tab div var tab = $(tabs[i]);
// get the id then sanitize it for use with bootstrap tabs var id = tab.attr('id');
// see if this is marked as the active tab if (tab.hasClass('active')) activeTab = i;
// remove any table of contents entries associated with // this ID (since we'll be removing the heading element) $("div#" + tocID + " li a[href='#" + id + "']").parent().remove();
// sanitize the id for use with bootstrap tabs id = id.replace(/[.\/?&!#<>]/g, '').replace(/\s/g, '_'); tab.attr('id', id);
// get the heading element within it, grab it's text, then remove it var heading = tab.find('h' + tabLevel + ':first'); var headingText = heading.html(); heading.remove();
// build and append the tab list item var a = $('' + headingText + ''); a.attr('href', '#' + id); a.attr('aria-controls', id); var li = $('
'); li.append(a); tabList.append(li);
// set it's attributes tab.attr('role', 'tabpanel'); tab.addClass('tab-pane'); tab.addClass('tabbed-pane'); if (fade) tab.addClass('fade');
// move it into the tab content div tab.detach().appendTo(tabContent); });
// set active tab $(tabList.children('li')[activeTab]).addClass('active'); var active = $(tabContent.children('div.section')[activeTab]); active.addClass('active'); if (fade) active.addClass('in');
if (tabset.hasClass("tabset-sticky")) tabset.rmarkdownStickyTabs(); }
// convert section divs with the .tabset class to tabsets var tabsets = $("div.section.tabset"); tabsets.each(function(i) { buildTabset($(tabsets[i])); }); };
MSigDB gene sets: easy msigdbr in R
Welcome to this comprehensive guide on MSigDB (Molecular Signatures Database) and the msigdbr R package! If you’ve ever wondered which gene sets to use for your pathway enrichment analysis, or felt overwhelmed by the sheer number of options available, this tutorial is for you.
What is MSigDB? The Molecular Signatures Database is a collection of annotated gene sets curated by the Broad Institute. It contains thousands of gene sets organized into different categories (called “collections”), each designed for specific types of biological analyses.
What does msigdbr do? The msigdbr R package provides easy access to MSigDB gene sets directly from R, eliminating the need to manually download .gmt files. Even better - it supports multiple species!
In this blogpost we will learn:
- The different MSigDB collections and when to use each
- How to access gene sets using the
msigdbrpackage - Practical examples of filtering and preparing gene sets
- How to integrate msigdbr with pathway enrichment tools like clusterProfiler or fgsea
If you are a beginner in R, don’t be overwhelmed! This tutorial will go step-by-step and I will explain (almost!) every line of code so you know what is happening at each point of the workflow. Just have fun with it!
So if you are ready… let’s dive in!
Understanding MSigDB Collections
As I mentioned, the Molecular Signatures Database (MSigDB) is a collection of annotated gene sets curated by the Broad Institute. It contains thousands of gene sets organized into different categories (called “collections”), each designed for specific types of biological analyses. You can read more about it here.
MSigDB organizes gene sets into 9 major collections, each with a specific focus. Let’s explore each one to help you choose the right collection for your analysis.
You can access MSigDB’s main page here. On your left, you will see “Human Collections” and “Mouse Collections”. You can read more about them here:
Nice! If you explore the database, you’ll see how you can access the lists of gene sets. There’s information on what kind of gene sets they are, the genes they include, and if they have been curated or not. You can download them manually (as .gmt files) or get the gene IDs. If you’re new to MSigDB you may need to register before downloading (it just needs a registration email).
Another way of accessing MSigDB gene sets (much easier than downloading manually if you ask me!), is to use the R package msigdbr. This is particularly useful if you are going to use different gene sets for pathway enrichment analysis or gene set enrichment analysis (GSEA).
So I’ll combine both the R tutorial and an explanation of the different gene sets in MSigDB bellow. If you are not interested in the coding parts, just skip the code snippets!
Installation
Before we begin, you’ll need to install the necessary packages:
# Install msigdbr from CRAN
install.packages('msigdbr')
# For data manipulation and visualization
install.packages('tidyverse')
install.packages('DT')
# For pathway enrichment analysis examples
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
Ready to dive into the world of gene sets? Let’s go!
Setting up our environment
Let’s clean our environment first, load the relevant libraries, and start by checking the available species in msigdbr.
# Clean environment
rm(list = ls(all.names = TRUE))
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 621289 33.2 1418377 75.8 705659 37.7
## Vcells 1159363 8.9 8388608 64.0 1876567 14.4
# Load libraries
suppressPackageStartupMessages(library(tidyverse, quiet = T))
suppressPackageStartupMessages(library(msigdbr, quiet = T))
## Warning: package 'msigdbr' was built under R version 4.4.3
suppressPackageStartupMessages(library(DT, quiet = T))
## Warning: package 'DT' was built under R version 4.4.3
# Check available species
available_species <- msigdbr_species()
print(head(available_species))
## # A tibble: 6 × 2
## species_name species_common_name
## <chr> <chr>
## 1 Anolis carolinensis Carolina anole, green anole
## 2 Bos taurus bovine, cattle, cow, dairy cow, domestic cattle, dome…
## 3 Caenorhabditis elegans <NA>
## 4 Canis lupus familiaris dog, dogs
## 5 Danio rerio leopard danio, zebra danio, zebra fish, zebrafish
## 6 Drosophila melanogaster fruit fly
print(available_species$species_name)
## [1] "Anolis carolinensis" "Bos taurus"
## [3] "Caenorhabditis elegans" "Canis lupus familiaris"
## [5] "Danio rerio" "Drosophila melanogaster"
## [7] "Equus caballus" "Felis catus"
## [9] "Gallus gallus" "Homo sapiens"
## [11] "Macaca mulatta" "Monodelphis domestica"
## [13] "Mus musculus" "Ornithorhynchus anatinus"
## [15] "Pan troglodytes" "Rattus norvegicus"
## [17] "Saccharomyces cerevisiae" "Schizosaccharomyces pombe 972h-"
## [19] "Sus scrofa" "Xenopus tropicalis"
As you can see, there’s more than just human and mouse!
How is MSigDB structured? Collections and subcollections.
Before we dive in, it’s crucial to understand the structure of MSigDB:
- Collections are the top-level categories (like main folders). There are 9 major collections labeled C1–C8 and H.
- Subcollections are subdivisions within certain collections (like subfolders). Not all collections have subcollections — some stand alone!
Let’s have a look at msigdbr’s collections:
# Get information about all collections
collections <- msigdbr_collections()
# Display as a nice table
head(collections)
## # A tibble: 6 × 4
## gs_collection gs_subcollection gs_collection_name num_genesets
## <chr> <chr> <chr> <int>
## 1 C1 "" Positional 302
## 2 C2 "CGP" Chemical and Genetic Perturbatio… 3538
## 3 C2 "CP" Canonical Pathways 19
## 4 C2 "CP:BIOCARTA" BioCarta Pathways 292
## 5 C2 "CP:KEGG_LEGACY" KEGG Legacy Pathways 186
## 6 C2 "CP:KEGG_MEDICUS" KEGG Medicus Pathways 658
# datatable(collections,
# options = list(pageLength = 10, scrollX = TRUE),
# caption = "MSigDB Collections Overview")
Nice! If we want to fetch data from a particular collection, we just need to run the following line:
gene_sets_df <- msigdbr(species = 'Homo sapiens', collection = 'H')
If we want a specific subcollection, then we can get it with:
gene_sets_df <- msigdbr(species = 'Homo sapiens', collection = 'C5', subcollection = 'GO:BP')
SquidTip! When you specify only a collection without subcollection, you get ALL gene sets in that collection (including all subcollections). When you specify both, you get only that specific subset!
Nice! Let’s talk about some of the most widely used collections.
H: Hallmark Gene Sets
What are they? Hallmark gene sets represent well-defined biological states or processes. They were created by aggregating multiple MSigDB gene sets to reduce redundancy and improve interpretability.
When to use: These are perfect for getting a high-level overview of biological processes in your data. Great for initial exploratory analysis!
Number of gene sets: 50
Let’s see an example in R:
# Get Hallmark gene sets for humans
hallmark <- msigdbr(species = "Homo sapiens", collection = "H")
cat("Number of Hallmark gene sets:", length(unique(hallmark$gs_name)), "\n")
## Number of Hallmark gene sets: 50
cat("Total genes across all Hallmark sets:", nrow(hallmark), "\n\n")
## Total genes across all Hallmark sets: 7333
# View first few rows
hallmark %>%
select(gs_name, gene_symbol, ensembl_gene) %>%
head(10)
## # A tibble: 10 × 3
## gs_name gene_symbol ensembl_gene
## <chr> <chr> <chr>
## 1 HALLMARK_ADIPOGENESIS ABCA1 ENSG00000165029
## 2 HALLMARK_ADIPOGENESIS ABCB8 ENSG00000197150
## 3 HALLMARK_ADIPOGENESIS ACAA2 ENSG00000167315
## 4 HALLMARK_ADIPOGENESIS ACADL ENSG00000115361
## 5 HALLMARK_ADIPOGENESIS ACADM ENSG00000117054
## 6 HALLMARK_ADIPOGENESIS ACADS ENSG00000122971
## 7 HALLMARK_ADIPOGENESIS ACLY ENSG00000131473
## 8 HALLMARK_ADIPOGENESIS ACO2 ENSG00000100412
## 9 HALLMARK_ADIPOGENESIS ACOX1 ENSG00000161533
## 10 HALLMARK_ADIPOGENESIS ADCY6 ENSG00000174233
# Summary of genes per Hallmark pathway
hallmark_summary <- hallmark %>%
group_by(gs_name) %>%
summarise(n_genes = n()) %>%
arrange(desc(n_genes))
# Plot
ggplot(hallmark_summary, aes(x = reorder(gs_name, n_genes), y = n_genes)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Number of Genes in Each Hallmark Gene Set",
x = "Gene Set",
y = "Number of Genes") +
theme_bw(base_size = 10)
SquidTip! Start with Hallmark gene sets if you’re new to pathway enrichment analysis. They’re well-curated, non-redundant, and easy to interpret!
C1: Positional Gene Sets
What are they? Gene sets based on chromosomal location. These include genes located in specific cytogenetic bands.
When to use: Useful for identifying regional chromosomal amplifications or deletions, or when studying copy number variations.
Number of gene sets: ~300
# Get C1 positional gene sets
c1_sets <- msigdbr(species = "Homo sapiens", collection = "C1")
cat("Number of C1 gene sets:", length(unique(c1_sets$gs_name)), "\n")
# Example: genes on chromosome 10
chr1_sets <- c1_sets %>%
filter(grepl("chr10q|chr10p", gs_name)) %>%
select(gs_name, gene_symbol) %>%
distinct()
head(chr1_sets, 10)
C2: Curated Gene Sets
What are they? This is the largest collection! It includes curated gene sets from online pathway databases, publications, and knowledge of domain experts.
Subcategories:
- CGP (Chemical and Genetic Perturbations): Gene sets from chemical/genetic perturbation experiments
-
CP (Canonical Pathways): Well-defined biochemical and metabolic pathways
- CP:BIOCARTA: Gene sets from BioCarta pathways
- CP:KEGG: Gene sets from KEGG pathways
- CP:PID: Gene sets from PID pathways
- CP:REACTOME: Gene sets from Reactome pathways
- CP:WIKIPATHWAYS: Gene sets from WikiPathways
When to use: Use C2:CP (canonical pathways) for most standard pathway enrichment analyses. KEGG and Reactome are particularly popular choices.
# Get KEGG pathways
kegg_sets <- msigdbr(species = "human",
collection = "C2",
subcollection = "CP:KEGG_MEDICUS")
cat("Number of KEGG gene sets:", length(unique(kegg_sets$gs_name)), "\n\n")
## Number of KEGG gene sets: 658
# Get Reactome pathways
reactome_sets <- msigdbr(species = "human",
collection = "C2",
subcollection = "CP:REACTOME")
cat("Number of Reactome gene sets:", length(unique(reactome_sets$gs_name)), "\n")
## Number of Reactome gene sets: 1787
SquidTip! KEGG pathways are great for metabolic processes, while Reactome provides more detailed molecular mechanisms!
C5: Ontology Gene Sets (GO)
What are they? Gene sets derived from Gene Ontology (GO) terms. This is one of the most popular collections!
Subcategories:
- GO:BP – Biological Process
- GO:CC – Cellular Component
- GO:MF – Molecular Function
- HPO – Human Phenotype Ontology
When to use: GO terms are essential for functional annotation. Use GO:BP for processes, GO:CC for localization, and GO:MF for molecular activities.
# Get GO Biological Process terms
go_bp <- msigdbr(species = "Homo sapiens",
collection = "C5",
subcollection = "GO:BP")
cat("Number of GO:BP gene sets:", length(unique(go_bp$gs_name)), "\n\n")
## Number of GO:BP gene sets: 7583
# Get GO Cellular Component terms
go_cc <- msigdbr(species = "Homo sapiens",
collection = "C5",
subcollection = "GO:CC")
cat("Number of GO:CC gene sets:", length(unique(go_cc$gs_name)), "\n\n")
## Number of GO:CC gene sets: 1042
# Get GO Molecular Function terms
go_mf <- msigdbr(species = "Homo sapiens",
collection = "C5",
subcollection = "GO:MF")
cat("Number of GO:MF gene sets:", length(unique(go_mf$gs_name)), "\n")
## Number of GO:MF gene sets: 1855
We’ve only covered some gene sets here, but you can also check out other MSigDB collections like C6 (Oncogenic Signature Gene Sets), C7 (Immunologic Signature Gene Sets) and C8 (Cell Type Signature Gene Sets).
Practical Guide: Using msigdbr in R
Now that we understand the different collections, let’s see how to use msigdbr in practice!
Basic Usage
We’ve already covered some of the basics already. To fetch a specific collection, just call msigdbr():
# Basic syntax: Get gene sets for a specific species and collection
gene_sets <- msigdbr(species = "Homo sapiens", collection = "H")
# View the structure
head(gene_sets)
## # A tibble: 6 × 20
## gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ABCA1 19 ENSG0000016… ABCA1 19 ENSG00000165029
## 2 ABCB8 11194 ENSG0000019… ABCB8 11194 ENSG00000197150
## 3 ACAA2 10449 ENSG0000016… ACAA2 10449 ENSG00000167315
## 4 ACADL 33 ENSG0000011… ACADL 33 ENSG00000115361
## 5 ACADM 34 ENSG0000011… ACADM 34 ENSG00000117054
## 6 ACADS 35 ENSG0000012… ACADS 35 ENSG00000122971
## # ℹ 14 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
## # gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
## # gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
## # gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
## # db_target_species <chr>
The output is a table which has information about the gene sets and genes. It obviously depends on what you want to do, but I find gene_symbol (the symbol for each gene, use ensembl_id if you work with ENSEMBL IDs) and gs_name (gene set name) the most useful.
Be aware that one gene can be part of many different gene sets:
gene_sets[gene_sets$gene_symbol == 'ABCB1', c('gene_symbol', 'gs_name')]
## # A tibble: 4 × 2
## gene_symbol gs_name
## <chr> <chr>
## 1 ABCB1 HALLMARK_IL2_STAT5_SIGNALING
## 2 ABCB1 HALLMARK_KRAS_SIGNALING_UP
## 3 ABCB1 HALLMARK_PEROXISOME
## 4 ABCB1 HALLMARK_UV_RESPONSE_UP
Working with Different Species
Nice! As we already saw, one of the best features of msigdbr is multi-species support! You can specify the species with the argument species.
# Get available species
species_df <- msigdbr_species()
# datatable(species_df, options = list(pageLength = 10))
head(species_df)
## # A tibble: 6 × 2
## species_name species_common_name
## <chr> <chr>
## 1 Anolis carolinensis Carolina anole, green anole
## 2 Bos taurus bovine, cattle, cow, dairy cow, domestic cattle, dome…
## 3 Caenorhabditis elegans <NA>
## 4 Canis lupus familiaris dog, dogs
## 5 Danio rerio leopard danio, zebra danio, zebra fish, zebrafish
## 6 Drosophila melanogaster fruit fly
# Example: Get mouse Hallmark gene sets
mouse_hallmark <- msigdbr(species = "Mus musculus", collection = "H")
## Using human MSigDB with ortholog mapping to mouse. Use `db_species = "MM"` for mouse-native gene sets.
## This message is displayed once per session.
cat("Number of genes in human Hallmark:", nrow(hallmark), "\n")
## Number of genes in human Hallmark: 7333
cat("Number of genes in mouse Hallmark:", nrow(mouse_hallmark), "\n")
## Number of genes in mouse Hallmark: 7379
Filtering and Preparing Gene Sets
You may need your gene sets in different formats depending on the tool you’d like to use. A popular choice for pathway enrichment analysis in clusterProfiler, for which I have an blogpost here.
I’ll briefly cover how to format your gene sets for pathway enrichment analysis with clusterProfiler!
Format for clusterProfiler
# Convert to the format needed by clusterProfiler
# We need two columns: gs_name (pathway) and gene_symbol (gene)
gene_sets_for_cp <- hallmark %>%
select(gs_name, gene_symbol)
head(gene_sets_for_cp)
## # A tibble: 6 × 2
## gs_name gene_symbol
## <chr> <chr>
## 1 HALLMARK_ADIPOGENESIS ABCA1
## 2 HALLMARK_ADIPOGENESIS ABCB8
## 3 HALLMARK_ADIPOGENESIS ACAA2
## 4 HALLMARK_ADIPOGENESIS ACADL
## 5 HALLMARK_ADIPOGENESIS ACADM
## 6 HALLMARK_ADIPOGENESIS ACADS
Filtering by Gene Set Size
Sometimes you want to exclude very small or very large gene sets:
# Calculate gene set sizes
pathway_sizes <- gene_sets_for_cp %>%
group_by(gs_name) %>%
summarise(n_genes = n())
# Filter pathways with 15-500 genes
filtered_pathways <- pathway_sizes %>%
filter(n_genes >= 15 & n_genes <= 500)
gene_sets_filtered <- gene_sets_for_cp %>%
filter(gs_name %in% filtered_pathways$gs_name)
cat("Original number of pathways:", length(unique(gene_sets_for_cp$gs_name)), "\n")
## Original number of pathways: 50
cat("Filtered number of pathways:", length(unique(gene_sets_filtered$gs_name)), "\n")
## Filtered number of pathways: 50
But you can also filter for gene size when using clusterProfiler’s functions, so I wouldn’t recommend filtering at this stage.
Nice! Let’s put it all together with a complete example!
msigdbr and clusterProfiler minitutorial
As you can see, I’m using the hallmark dataframe I downloaded using msigdbr and then just getting some “pretend” DGEs (differentially expressed genes) to run pathway enrichment analysis on them.
I’m using clusterProfiler’s enricher function to carry out pathway enrichment analysis. You can read more about it in the clusterProfiler blogpost, but I’m using the defaults. Depending on what you want to do; you may want to edit the filters. You can read more about the different arguments by running ?enricher.
TERM2GENE is the argument we’re interested in: we’ll pass on the hallmark gene sets we downloaded from msigdbr. Easy!
suppressPackageStartupMessages(library(clusterProfiler))
# Simulate some differentially expressed genes
set.seed(123)
all_genes <- unique(hallmark$gene_symbol)
deg_genes <- sample(all_genes, 200)
cat("Total genes in Hallmark collection:", length(all_genes), "\n")
## Total genes in Hallmark collection: 4384
cat("Number of DE genes for testing:", length(deg_genes), "\n\n")
## Number of DE genes for testing: 200
# Prepare gene sets
gene_sets_for_enrichment <- hallmark %>%
select(gs_name, gene_symbol)
# Run enrichment analysis
enrichment_result <- enricher(
gene = deg_genes,
TERM2GENE = gene_sets_for_enrichment,
universe = all_genes,
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
minGSSize = 10,
maxGSSize = 500
)
# View results
cat("Found", nrow(enrichment_result@result), "enriched pathways!\n\n")
## Found 49 enriched pathways!
enrichment_result@result %>%
arrange(p.adjust) %>%
select(ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, Count) %>%
head(10) %>%
print()
## ID
## HALLMARK_IL2_STAT5_SIGNALING HALLMARK_IL2_STAT5_SIGNALING
## HALLMARK_MITOTIC_SPINDLE HALLMARK_MITOTIC_SPINDLE
## HALLMARK_CHOLESTEROL_HOMEOSTASIS HALLMARK_CHOLESTEROL_HOMEOSTASIS
## HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2
## HALLMARK_G2M_CHECKPOINT HALLMARK_G2M_CHECKPOINT
## HALLMARK_KRAS_SIGNALING_DN HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_UV_RESPONSE_UP HALLMARK_UV_RESPONSE_UP
## HALLMARK_ESTROGEN_RESPONSE_EARLY HALLMARK_ESTROGEN_RESPONSE_EARLY
## HALLMARK_SPERMATOGENESIS HALLMARK_SPERMATOGENESIS
## HALLMARK_PANCREAS_BETA_CELLS HALLMARK_PANCREAS_BETA_CELLS
## Description GeneRatio
## HALLMARK_IL2_STAT5_SIGNALING HALLMARK_IL2_STAT5_SIGNALING 13/200
## HALLMARK_MITOTIC_SPINDLE HALLMARK_MITOTIC_SPINDLE 13/200
## HALLMARK_CHOLESTEROL_HOMEOSTASIS HALLMARK_CHOLESTEROL_HOMEOSTASIS 6/200
## HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2 5/200
## HALLMARK_G2M_CHECKPOINT HALLMARK_G2M_CHECKPOINT 13/200
## HALLMARK_KRAS_SIGNALING_DN HALLMARK_KRAS_SIGNALING_DN 13/200
## HALLMARK_UV_RESPONSE_UP HALLMARK_UV_RESPONSE_UP 10/200
## HALLMARK_ESTROGEN_RESPONSE_EARLY HALLMARK_ESTROGEN_RESPONSE_EARLY 12/200
## HALLMARK_SPERMATOGENESIS HALLMARK_SPERMATOGENESIS 8/200
## HALLMARK_PANCREAS_BETA_CELLS HALLMARK_PANCREAS_BETA_CELLS 3/200
## BgRatio pvalue p.adjust Count
## HALLMARK_IL2_STAT5_SIGNALING 199/4384 0.1197624 0.9314402 13
## HALLMARK_MITOTIC_SPINDLE 199/4384 0.1197624 0.9314402 13
## HALLMARK_CHOLESTEROL_HOMEOSTASIS 74/4384 0.1198720 0.9314402 6
## HALLMARK_MYC_TARGETS_V2 58/4384 0.1231679 0.9314402 5
## HALLMARK_G2M_CHECKPOINT 200/4384 0.1231687 0.9314402 13
## HALLMARK_KRAS_SIGNALING_DN 200/4384 0.1231687 0.9314402 13
## HALLMARK_UV_RESPONSE_UP 158/4384 0.1829955 0.9314402 10
## HALLMARK_ESTROGEN_RESPONSE_EARLY 200/4384 0.2002102 0.9314402 12
## HALLMARK_SPERMATOGENESIS 135/4384 0.2733060 0.9314402 8
## HALLMARK_PANCREAS_BETA_CELLS 40/4384 0.2745888 0.9314402 3
You can also use msigdbr to fetch gene sets for other tools, such as GSEA.
Best Practices and Tips
Nice! We covered the basics on the MSigDB database of gene sets and how to use it using msigdbr. Just leaving a few more tips which I hope will help you when deciding on which gene sets to use!
Choosing the Right Collection
This, of course, depends on what you’re looking for. I find Hallmark, Reactome and GO:BP the most useful for a quick overview, but if I’m doing some specific analysis (i.e., cancer), I might explore other pathways.
Here’s a quick decision guide:
- Start with Hallmark (H) - Best for exploratory analysis and high-level overview
- Use C2:CP for detailed pathways
- KEGG: Metabolic and signaling pathways
- Reactome: Detailed molecular mechanisms
- Use C5:GO:BP for functional processes - Most comprehensive functional annotation
- Use C7 for immune studies - Specialized immune cell signatures
- Use C8 for cell type identification - Single-cell analysis
Gene ID Considerations
Make sure your gene identifiers match! If you have Ensembl IDs, use ensembl_gene. If you have symbols, use gene_symbol.
Saving Your Gene Sets
Of course, you can easily save your gene sets, as .rds objects or a .csv / .xlsx table… If you download them from the website directly, they’ll be in .gmt format, which you can also read into R.
# Save as RDS for R
saveRDS(gene_sets_for_cp, "hallmark_genesets.rds")
# Save as CSV
write.csv(gene_sets_for_cp, "hallmark_genesets.csv", row.names = FALSE)
Nice! And that’s the end of this tutorial.
In this post, we covered MSigDB gene sets and how to use the msigdbr package to fetch different gene set collections. Hope you found it useful!
As you saw, there’s many different gene sets to choose from, and the msigdbr package makes it really easy to access gene sets in R. You can choose from different collections (and subcollections) and integrate this with pathway enrichment analysis tools such as clusterProfiler. Hope this makes your life easier, it definitely did mine! (I was once painstakingly downloading each gene set .gmt file individually…)
Thanks for reading, hope this was helpful and a clear explanation on how to use msigdbr in R! If you have questions or suggestions, feel free to reach out. Happy analyzing!
Additional resources
You might be interested in…
- MSigDB website
- msigdbr documentation
- clusterProfiler book
- Biostatsquid’s GSEA tutorial with fgsea
- Biostatsquid’s PEA tutorial with clusterProfiler
Squidtastic! You made it till the end! Hope you found this post useful.
If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.
Otherwise, have a very nice day and… see you in the next one!
Before you go, you might want to check:
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.odd').parent('tbody').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });