MSigDB gene sets: easy msigdbr in R

Welcome to this comprehensive guide on MSigDB (Molecular Signatures Database) and the msigdbr R package! If you’ve ever wondered which gene sets to use for your pathway enrichment analysis, or felt overwhelmed by the sheer number of options available, this tutorial is for you.

What is MSigDB? The Molecular Signatures Database is a collection of annotated gene sets curated by the Broad Institute. It contains thousands of gene sets organized into different categories (called “collections”), each designed for specific types of biological analyses.

What does msigdbr do? The msigdbr R package provides easy access to MSigDB gene sets directly from R, eliminating the need to manually download .gmt files. Even better - it supports multiple species!

In this blogpost we will learn:

The different MSigDB collections and when to use each
How to access gene sets using the msigdbr package
Practical examples of filtering and preparing gene sets
How to integrate msigdbr with pathway enrichment tools like clusterProfiler or fgsea

If you are a beginner in R, don’t be overwhelmed! This tutorial will go step-by-step and I will explain (almost!) every line of code so you know what is happening at each point of the workflow. Just have fun with it!

So if you are ready… let’s dive in!

Understanding MSigDB Collections

As I mentioned, the Molecular Signatures Database (MSigDB) is a collection of annotated gene sets curated by the Broad Institute. It contains thousands of gene sets organized into different categories (called “collections”), each designed for specific types of biological analyses. You can read more about it here.

MSigDB organizes gene sets into 9 major collections, each with a specific focus. Let’s explore each one to help you choose the right collection for your analysis.

You can access MSigDB’s main page here. On your left, you will see “Human Collections” and “Mouse Collections”. You can read more about them here:

Nice! If you explore the database, you’ll see how you can access the lists of gene sets. There’s information on what kind of gene sets they are, the genes they include, and if they have been curated or not. You can download them manually (as .gmt files) or get the gene IDs. If you’re new to MSigDB you may need to register before downloading (it just needs a registration email).

Another way of accessing MSigDB gene sets (much easier than downloading manually if you ask me!), is to use the R package msigdbr. This is particularly useful if you are going to use different gene sets for pathway enrichment analysis or gene set enrichment analysis (GSEA).

So I’ll combine both the R tutorial and an explanation of the different gene sets in MSigDB bellow. If you are not interested in the coding parts, just skip the code snippets!

Installation

Before we begin, you’ll need to install the necessary packages:

# Install msigdbr from CRAN
install.packages('msigdbr')

# For data manipulation and visualization
install.packages('tidyverse')
install.packages('DT')

# For pathway enrichment analysis examples
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("clusterProfiler")

Ready to dive into the world of gene sets? Let’s go!

Setting up our environment

Let’s clean our environment first, load the relevant libraries, and start by checking the available species in msigdbr.

# Clean environment
rm(list = ls(all.names = TRUE))
gc()

##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  621289 33.2    1418377 75.8   705659 37.7
## Vcells 1159363  8.9    8388608 64.0  1876567 14.4

# Load libraries
suppressPackageStartupMessages(library(tidyverse, quiet = T))
suppressPackageStartupMessages(library(msigdbr, quiet = T))

## Warning: package 'msigdbr' was built under R version 4.4.3

suppressPackageStartupMessages(library(DT, quiet = T))

## Warning: package 'DT' was built under R version 4.4.3

# Check available species
available_species <- msigdbr_species()
print(head(available_species))

## # A tibble: 6 × 2
##   species_name            species_common_name                                   
##   <chr>                   <chr>                                                 
## 1 Anolis carolinensis     Carolina anole, green anole                           
## 2 Bos taurus              bovine, cattle, cow, dairy cow, domestic cattle, dome…
## 3 Caenorhabditis elegans  <NA>                                                  
## 4 Canis lupus familiaris  dog, dogs                                             
## 5 Danio rerio             leopard danio, zebra danio, zebra fish, zebrafish     
## 6 Drosophila melanogaster fruit fly

print(available_species$species_name)

##  [1] "Anolis carolinensis"             "Bos taurus"                     
##  [3] "Caenorhabditis elegans"          "Canis lupus familiaris"         
##  [5] "Danio rerio"                     "Drosophila melanogaster"        
##  [7] "Equus caballus"                  "Felis catus"                    
##  [9] "Gallus gallus"                   "Homo sapiens"                   
## [11] "Macaca mulatta"                  "Monodelphis domestica"          
## [13] "Mus musculus"                    "Ornithorhynchus anatinus"       
## [15] "Pan troglodytes"                 "Rattus norvegicus"              
## [17] "Saccharomyces cerevisiae"        "Schizosaccharomyces pombe 972h-"
## [19] "Sus scrofa"                      "Xenopus tropicalis"

As you can see, there’s more than just human and mouse!

How is MSigDB structured? Collections and subcollections.

Before we dive in, it’s crucial to understand the structure of MSigDB:

Collections are the top-level categories (like main folders). There are 9 major collections labeled C1–C8 and H.
Subcollections are subdivisions within certain collections (like subfolders). Not all collections have subcollections — some stand alone!

Let’s have a look at msigdbr’s collections:

# Get information about all collections
collections <- msigdbr_collections()

# Display as a nice table
head(collections)

## # A tibble: 6 × 4
##   gs_collection gs_subcollection  gs_collection_name                num_genesets
##   <chr>         <chr>             <chr>                                    <int>
## 1 C1            ""                Positional                                 302
## 2 C2            "CGP"             Chemical and Genetic Perturbatio…         3538
## 3 C2            "CP"              Canonical Pathways                          19
## 4 C2            "CP:BIOCARTA"     BioCarta Pathways                          292
## 5 C2            "CP:KEGG_LEGACY"  KEGG Legacy Pathways                       186
## 6 C2            "CP:KEGG_MEDICUS" KEGG Medicus Pathways                      658

# datatable(collections, 
#           options = list(pageLength = 10, scrollX = TRUE),
#           caption = "MSigDB Collections Overview")

Nice! If we want to fetch data from a particular collection, we just need to run the following line:

gene_sets_df <- msigdbr(species = 'Homo sapiens', collection = 'H')

If we want a specific subcollection, then we can get it with:

gene_sets_df <- msigdbr(species = 'Homo sapiens', collection = 'C5', subcollection = 'GO:BP')

SquidTip!
When you specify only a collection without subcollection, you get ALL gene sets in that collection (including all subcollections). When you specify both, you get only that specific subset!

Nice! Let’s talk about some of the most widely used collections.

H: Hallmark Gene Sets

What are they? Hallmark gene sets represent well-defined biological states or processes. They were created by aggregating multiple MSigDB gene sets to reduce redundancy and improve interpretability.

When to use: These are perfect for getting a high-level overview of biological processes in your data. Great for initial exploratory analysis!

Number of gene sets: 50

Let’s see an example in R:

# Get Hallmark gene sets for humans
hallmark <- msigdbr(species = "Homo sapiens", collection = "H")

cat("Number of Hallmark gene sets:", length(unique(hallmark$gs_name)), "\n")

## Number of Hallmark gene sets: 50

cat("Total genes across all Hallmark sets:", nrow(hallmark), "\n\n")

## Total genes across all Hallmark sets: 7333

# View first few rows
hallmark %>%
  select(gs_name, gene_symbol, ensembl_gene) %>%
  head(10)

## # A tibble: 10 × 3
##    gs_name               gene_symbol ensembl_gene   
##    <chr>                 <chr>       <chr>          
##  1 HALLMARK_ADIPOGENESIS ABCA1       ENSG00000165029
##  2 HALLMARK_ADIPOGENESIS ABCB8       ENSG00000197150
##  3 HALLMARK_ADIPOGENESIS ACAA2       ENSG00000167315
##  4 HALLMARK_ADIPOGENESIS ACADL       ENSG00000115361
##  5 HALLMARK_ADIPOGENESIS ACADM       ENSG00000117054
##  6 HALLMARK_ADIPOGENESIS ACADS       ENSG00000122971
##  7 HALLMARK_ADIPOGENESIS ACLY        ENSG00000131473
##  8 HALLMARK_ADIPOGENESIS ACO2        ENSG00000100412
##  9 HALLMARK_ADIPOGENESIS ACOX1       ENSG00000161533
## 10 HALLMARK_ADIPOGENESIS ADCY6       ENSG00000174233

# Summary of genes per Hallmark pathway
hallmark_summary <- hallmark %>%
  group_by(gs_name) %>%
  summarise(n_genes = n()) %>%
  arrange(desc(n_genes))

# Plot
ggplot(hallmark_summary, aes(x = reorder(gs_name, n_genes), y = n_genes)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Number of Genes in Each Hallmark Gene Set",
       x = "Gene Set",
       y = "Number of Genes") +
  theme_bw(base_size = 10)

SquidTip!
Start with Hallmark gene sets if you’re new to pathway enrichment analysis. They’re well-curated, non-redundant, and easy to interpret!

C1: Positional Gene Sets

What are they? Gene sets based on chromosomal location. These include genes located in specific cytogenetic bands.

When to use: Useful for identifying regional chromosomal amplifications or deletions, or when studying copy number variations.

Number of gene sets: ~300

# Get C1 positional gene sets
c1_sets <- msigdbr(species = "Homo sapiens", collection = "C1")

cat("Number of C1 gene sets:", length(unique(c1_sets$gs_name)), "\n")

# Example: genes on chromosome 10
chr1_sets <- c1_sets %>%
  filter(grepl("chr10q|chr10p", gs_name)) %>%
  select(gs_name, gene_symbol) %>%
  distinct()

head(chr1_sets, 10)

C2: Curated Gene Sets

What are they? This is the largest collection! It includes curated gene sets from online pathway databases, publications, and knowledge of domain experts.

Subcategories:

CGP (Chemical and Genetic Perturbations): Gene sets from chemical/genetic perturbation experiments
CP (Canonical Pathways): Well-defined biochemical and metabolic pathways
- CP:BIOCARTA: Gene sets from BioCarta pathways
- CP:KEGG: Gene sets from KEGG pathways
- CP:PID: Gene sets from PID pathways
- CP:REACTOME: Gene sets from Reactome pathways
- CP:WIKIPATHWAYS: Gene sets from WikiPathways

When to use: Use C2:CP (canonical pathways) for most standard pathway enrichment analyses. KEGG and Reactome are particularly popular choices.

# Get KEGG pathways
kegg_sets <- msigdbr(species = "human", 
                     collection = "C2", 
                     subcollection = "CP:KEGG_MEDICUS")

cat("Number of KEGG gene sets:", length(unique(kegg_sets$gs_name)), "\n\n")

## Number of KEGG gene sets: 658

# Get Reactome pathways
reactome_sets <- msigdbr(species = "human", 
                         collection = "C2", 
                         subcollection = "CP:REACTOME")

cat("Number of Reactome gene sets:", length(unique(reactome_sets$gs_name)), "\n")

## Number of Reactome gene sets: 1787

SquidTip!
KEGG pathways are great for metabolic processes, while Reactome provides more detailed molecular mechanisms!

C5: Ontology Gene Sets (GO)

What are they? Gene sets derived from Gene Ontology (GO) terms. This is one of the most popular collections!

Subcategories:

GO:BP – Biological Process
GO:CC – Cellular Component
GO:MF – Molecular Function
HPO – Human Phenotype Ontology

When to use: GO terms are essential for functional annotation. Use GO:BP for processes, GO:CC for localization, and GO:MF for molecular activities.

# Get GO Biological Process terms
go_bp <- msigdbr(species = "Homo sapiens", 
                 collection = "C5", 
                 subcollection = "GO:BP")

cat("Number of GO:BP gene sets:", length(unique(go_bp$gs_name)), "\n\n")

## Number of GO:BP gene sets: 7583

# Get GO Cellular Component terms
go_cc <- msigdbr(species = "Homo sapiens", 
                 collection = "C5", 
                 subcollection = "GO:CC")

cat("Number of GO:CC gene sets:", length(unique(go_cc$gs_name)), "\n\n")

## Number of GO:CC gene sets: 1042

# Get GO Molecular Function terms
go_mf <- msigdbr(species = "Homo sapiens", 
                 collection = "C5", 
                 subcollection = "GO:MF")

cat("Number of GO:MF gene sets:", length(unique(go_mf$gs_name)), "\n")

## Number of GO:MF gene sets: 1855

We’ve only covered some gene sets here, but you can also check out other MSigDB collections like C6 (Oncogenic Signature Gene Sets), C7 (Immunologic Signature Gene Sets) and C8 (Cell Type Signature Gene Sets).

Practical Guide: Using msigdbr in R

Now that we understand the different collections, let’s see how to use msigdbr in practice!

Basic Usage

We’ve already covered some of the basics already. To fetch a specific collection, just call msigdbr():

# Basic syntax: Get gene sets for a specific species and collection
gene_sets <- msigdbr(species = "Homo sapiens", collection = "H")

# View the structure
head(gene_sets)

## # A tibble: 6 × 20
##   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
##   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
## 1 ABCA1       19        ENSG0000016… ABCA1          19           ENSG00000165029
## 2 ABCB8       11194     ENSG0000019… ABCB8          11194        ENSG00000197150
## 3 ACAA2       10449     ENSG0000016… ACAA2          10449        ENSG00000167315
## 4 ACADL       33        ENSG0000011… ACADL          33           ENSG00000115361
## 5 ACADM       34        ENSG0000011… ACADM          34           ENSG00000117054
## 6 ACADS       35        ENSG0000012… ACADS          35           ENSG00000122971
## # ℹ 14 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
## #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
## #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
## #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
## #   db_target_species <chr>

The output is a table which has information about the gene sets and genes. It obviously depends on what you want to do, but I find gene_symbol (the symbol for each gene, use ensembl_id if you work with ENSEMBL IDs) and gs_name (gene set name) the most useful.

Be aware that one gene can be part of many different gene sets:

gene_sets[gene_sets$gene_symbol == 'ABCB1', c('gene_symbol', 'gs_name')]

## # A tibble: 4 × 2
##   gene_symbol gs_name                     
##   <chr>       <chr>                       
## 1 ABCB1       HALLMARK_IL2_STAT5_SIGNALING
## 2 ABCB1       HALLMARK_KRAS_SIGNALING_UP  
## 3 ABCB1       HALLMARK_PEROXISOME         
## 4 ABCB1       HALLMARK_UV_RESPONSE_UP

Working with Different Species

Nice! As we already saw, one of the best features of msigdbr is multi-species support! You can specify the species with the argument species.

# Get available species
species_df <- msigdbr_species()
# datatable(species_df, options = list(pageLength = 10))
head(species_df)

## # A tibble: 6 × 2
##   species_name            species_common_name                                   
##   <chr>                   <chr>                                                 
## 1 Anolis carolinensis     Carolina anole, green anole                           
## 2 Bos taurus              bovine, cattle, cow, dairy cow, domestic cattle, dome…
## 3 Caenorhabditis elegans  <NA>                                                  
## 4 Canis lupus familiaris  dog, dogs                                             
## 5 Danio rerio             leopard danio, zebra danio, zebra fish, zebrafish     
## 6 Drosophila melanogaster fruit fly

# Example: Get mouse Hallmark gene sets
mouse_hallmark <- msigdbr(species = "Mus musculus", collection = "H")

## Using human MSigDB with ortholog mapping to mouse. Use `db_species = "MM"` for mouse-native gene sets.
## This message is displayed once per session.

cat("Number of genes in human Hallmark:", nrow(hallmark), "\n")

## Number of genes in human Hallmark: 7333

cat("Number of genes in mouse Hallmark:", nrow(mouse_hallmark), "\n")

## Number of genes in mouse Hallmark: 7379

Filtering and Preparing Gene Sets

You may need your gene sets in different formats depending on the tool you’d like to use. A popular choice for pathway enrichment analysis in clusterProfiler, for which I have an blogpost here.

I’ll briefly cover how to format your gene sets for pathway enrichment analysis with clusterProfiler!

Format for clusterProfiler

# Convert to the format needed by clusterProfiler
# We need two columns: gs_name (pathway) and gene_symbol (gene)
gene_sets_for_cp <- hallmark %>%
  select(gs_name, gene_symbol)
head(gene_sets_for_cp)

## # A tibble: 6 × 2
##   gs_name               gene_symbol
##   <chr>                 <chr>      
## 1 HALLMARK_ADIPOGENESIS ABCA1      
## 2 HALLMARK_ADIPOGENESIS ABCB8      
## 3 HALLMARK_ADIPOGENESIS ACAA2      
## 4 HALLMARK_ADIPOGENESIS ACADL      
## 5 HALLMARK_ADIPOGENESIS ACADM      
## 6 HALLMARK_ADIPOGENESIS ACADS

Filtering by Gene Set Size

Sometimes you want to exclude very small or very large gene sets:

# Calculate gene set sizes
pathway_sizes <- gene_sets_for_cp %>%
  group_by(gs_name) %>%
  summarise(n_genes = n())

# Filter pathways with 15-500 genes
filtered_pathways <- pathway_sizes %>%
  filter(n_genes >= 15 & n_genes <= 500)

gene_sets_filtered <- gene_sets_for_cp %>%
  filter(gs_name %in% filtered_pathways$gs_name)

cat("Original number of pathways:", length(unique(gene_sets_for_cp$gs_name)), "\n")

## Original number of pathways: 50

cat("Filtered number of pathways:", length(unique(gene_sets_filtered$gs_name)), "\n")

## Filtered number of pathways: 50

But you can also filter for gene size when using clusterProfiler’s functions, so I wouldn’t recommend filtering at this stage.

Nice! Let’s put it all together with a complete example!

msigdbr and clusterProfiler minitutorial

As you can see, I’m using the hallmark dataframe I downloaded using msigdbr and then just getting some “pretend” DGEs (differentially expressed genes) to run pathway enrichment analysis on them.

I’m using clusterProfiler’s enricher function to carry out pathway enrichment analysis. You can read more about it in the clusterProfiler blogpost, but I’m using the defaults. Depending on what you want to do; you may want to edit the filters. You can read more about the different arguments by running ?enricher.

TERM2GENE is the argument we’re interested in: we’ll pass on the hallmark gene sets we downloaded from msigdbr. Easy!

suppressPackageStartupMessages(library(clusterProfiler))

# Simulate some differentially expressed genes
set.seed(123)
all_genes <- unique(hallmark$gene_symbol)
deg_genes <- sample(all_genes, 200)

cat("Total genes in Hallmark collection:", length(all_genes), "\n")

## Total genes in Hallmark collection: 4384

cat("Number of DE genes for testing:", length(deg_genes), "\n\n")

## Number of DE genes for testing: 200

# Prepare gene sets
gene_sets_for_enrichment <- hallmark %>%
  select(gs_name, gene_symbol)

# Run enrichment analysis
enrichment_result <- enricher(
  gene = deg_genes,
  TERM2GENE = gene_sets_for_enrichment,
  universe = all_genes,
  pvalueCutoff = 0.05,
  qvalueCutoff = 0.2,
  minGSSize = 10,
  maxGSSize = 500
)

# View results
cat("Found", nrow(enrichment_result@result), "enriched pathways!\n\n")

## Found 49 enriched pathways!

enrichment_result@result %>%
    arrange(p.adjust) %>%
    select(ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, Count) %>%
    head(10) %>% 
  print()

##                                                                ID
## HALLMARK_IL2_STAT5_SIGNALING         HALLMARK_IL2_STAT5_SIGNALING
## HALLMARK_MITOTIC_SPINDLE                 HALLMARK_MITOTIC_SPINDLE
## HALLMARK_CHOLESTEROL_HOMEOSTASIS HALLMARK_CHOLESTEROL_HOMEOSTASIS
## HALLMARK_MYC_TARGETS_V2                   HALLMARK_MYC_TARGETS_V2
## HALLMARK_G2M_CHECKPOINT                   HALLMARK_G2M_CHECKPOINT
## HALLMARK_KRAS_SIGNALING_DN             HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_UV_RESPONSE_UP                   HALLMARK_UV_RESPONSE_UP
## HALLMARK_ESTROGEN_RESPONSE_EARLY HALLMARK_ESTROGEN_RESPONSE_EARLY
## HALLMARK_SPERMATOGENESIS                 HALLMARK_SPERMATOGENESIS
## HALLMARK_PANCREAS_BETA_CELLS         HALLMARK_PANCREAS_BETA_CELLS
##                                                       Description GeneRatio
## HALLMARK_IL2_STAT5_SIGNALING         HALLMARK_IL2_STAT5_SIGNALING    13/200
## HALLMARK_MITOTIC_SPINDLE                 HALLMARK_MITOTIC_SPINDLE    13/200
## HALLMARK_CHOLESTEROL_HOMEOSTASIS HALLMARK_CHOLESTEROL_HOMEOSTASIS     6/200
## HALLMARK_MYC_TARGETS_V2                   HALLMARK_MYC_TARGETS_V2     5/200
## HALLMARK_G2M_CHECKPOINT                   HALLMARK_G2M_CHECKPOINT    13/200
## HALLMARK_KRAS_SIGNALING_DN             HALLMARK_KRAS_SIGNALING_DN    13/200
## HALLMARK_UV_RESPONSE_UP                   HALLMARK_UV_RESPONSE_UP    10/200
## HALLMARK_ESTROGEN_RESPONSE_EARLY HALLMARK_ESTROGEN_RESPONSE_EARLY    12/200
## HALLMARK_SPERMATOGENESIS                 HALLMARK_SPERMATOGENESIS     8/200
## HALLMARK_PANCREAS_BETA_CELLS         HALLMARK_PANCREAS_BETA_CELLS     3/200
##                                   BgRatio    pvalue  p.adjust Count
## HALLMARK_IL2_STAT5_SIGNALING     199/4384 0.1197624 0.9314402    13
## HALLMARK_MITOTIC_SPINDLE         199/4384 0.1197624 0.9314402    13
## HALLMARK_CHOLESTEROL_HOMEOSTASIS  74/4384 0.1198720 0.9314402     6
## HALLMARK_MYC_TARGETS_V2           58/4384 0.1231679 0.9314402     5
## HALLMARK_G2M_CHECKPOINT          200/4384 0.1231687 0.9314402    13
## HALLMARK_KRAS_SIGNALING_DN       200/4384 0.1231687 0.9314402    13
## HALLMARK_UV_RESPONSE_UP          158/4384 0.1829955 0.9314402    10
## HALLMARK_ESTROGEN_RESPONSE_EARLY 200/4384 0.2002102 0.9314402    12
## HALLMARK_SPERMATOGENESIS         135/4384 0.2733060 0.9314402     8
## HALLMARK_PANCREAS_BETA_CELLS      40/4384 0.2745888 0.9314402     3

You can also use msigdbr to fetch gene sets for other tools, such as GSEA.

Best Practices and Tips

Nice! We covered the basics on the MSigDB database of gene sets and how to use it using msigdbr. Just leaving a few more tips which I hope will help you when deciding on which gene sets to use!

Choosing the Right Collection

This, of course, depends on what you’re looking for. I find Hallmark, Reactome and GO:BP the most useful for a quick overview, but if I’m doing some specific analysis (i.e., cancer), I might explore other pathways.

Here’s a quick decision guide:

Start with Hallmark (H) - Best for exploratory analysis and high-level overview
Use C2:CP for detailed pathways
- KEGG: Metabolic and signaling pathways
- Reactome: Detailed molecular mechanisms
Use C5:GO:BP for functional processes - Most comprehensive functional annotation
Use C7 for immune studies - Specialized immune cell signatures
Use C8 for cell type identification - Single-cell analysis

Gene ID Considerations

Make sure your gene identifiers match! If you have Ensembl IDs, use ensembl_gene. If you have symbols, use gene_symbol.

Saving Your Gene Sets

Of course, you can easily save your gene sets, as .rds objects or a .csv / .xlsx table… If you download them from the website directly, they’ll be in .gmt format, which you can also read into R.

# Save as RDS for R
saveRDS(gene_sets_for_cp, "hallmark_genesets.rds")

# Save as CSV
write.csv(gene_sets_for_cp, "hallmark_genesets.csv", row.names = FALSE)

Nice! And that’s the end of this tutorial.

In this post, we covered MSigDB gene sets and how to use the msigdbr package to fetch different gene set collections. Hope you found it useful!

As you saw, there’s many different gene sets to choose from, and the msigdbr package makes it really easy to access gene sets in R. You can choose from different collections (and subcollections) and integrate this with pathway enrichment analysis tools such as clusterProfiler. Hope this makes your life easier, it definitely did mine! (I was once painstakingly downloading each gene set .gmt file individually…)

Thanks for reading, hope this was helpful and a clear explanation on how to use msigdbr in R! If you have questions or suggestions, feel free to reach out. Happy analyzing!

MSigDB gene sets: easy msigdbr in R

MSigDB gene sets: easy msigdbr in R

Understanding MSigDB Collections

Installation

Setting up our environment

How is MSigDB structured? Collections and subcollections.

H: Hallmark Gene Sets

C1: Positional Gene Sets

C2: Curated Gene Sets

C5: Ontology Gene Sets (GO)

Practical Guide: Using msigdbr in R

Basic Usage

Working with Different Species

Filtering and Preparing Gene Sets

Format for clusterProfiler

Filtering by Gene Set Size

msigdbr and clusterProfiler minitutorial

Best Practices and Tips

Choosing the Right Collection

Gene ID Considerations

Saving Your Gene Sets

Additional resources

Submit a Comment Cancel reply

Recent posts

MSigDB gene sets: easy msigdbr in R

MSigDB gene sets: easy msigdbr in R

Understanding MSigDB Collections

Installation

Setting up our environment

How is MSigDB structured? Collections and subcollections.

H: Hallmark Gene Sets

C1: Positional Gene Sets

C2: Curated Gene Sets

C5: Ontology Gene Sets (GO)

Practical Guide: Using msigdbr in R

Basic Usage

Working with Different Species

Filtering and Preparing Gene Sets

Format for clusterProfiler

Filtering by Gene Set Size

msigdbr and clusterProfiler minitutorial

Best Practices and Tips

Choosing the Right Collection

Gene ID Considerations

Saving Your Gene Sets

Additional resources

Submit a Comment Cancel reply

Popular Posts

Recent posts