What is gene set enrichment analysis and how can you use it to summarise your differential gene expression analysis results?

This post will give you a simple and practical explanation of Gene Set Enrichment Analysis, or GSEA for short.

You will find out:

    •  What is Gene Set Enrichment Analysis (GSEA) and how does it differ from other Pathway Enrichment Analysis methods
    • What do you need to perform GSEA
    • The main idea behind all GSEA algorithms
    • How to interpret GSEA results

If you are more of a video-based learner, you can check out my Youtube video, otherwise, just keep reading!

Let’s dive in!

Check out my Youtube video to learn the basics of GSEA!

What is GSEA?

Before we explain Gene Set Enrichment Analysis, you need to be familiar with Pathway Enrichment Analysis (PEA) methods.

In this other post, you can read more about PEA.

I will give you a quick recap.

Pathway enrichment analysis methods take a list of differentially expressed (DE) genes as input, and identify the sets in which the DE genes are over-represented or under-represented.

So they basically summarise long lists of genes into a shorter list of pathways.

The significance of each pathway is measured by calculating the probability that the observed number of DE genes in a given pathway are simply observed by chance. Lower p-values means that the pathway is actually overrepresented and it was not just by chance.

These approaches are known as Over-Representation Analysis (ORA).

The problem in overrepresentation analysis methods is that we first need to select differentially expressed genes.

Traditionally, we use thresholds of fold change > +2 or < -2, and p-value < 0.05.

But what if a gene has a fold change of 1.99 and p value of 0.051?

They are still important genes to consider, and by removing them they can modify the results from over representation analysis a lot.

And it is especially bad if you have few differentially expressed genes to start with.

So the issue with overrepresentation analysis methods is that the results depend a lot in our criteria.

In one sentence,

Overrepresentation Analysis Methods depend a lot in the criteria used to select differentially-expressed genes.

To solve this problem, a second generation of methods was designed.

They are called Functional Class Scoring methods (FCS).

How did they eliminate this dependency on gene selection criteria? Basically, by taking all genes into consideration.

But they do it smartly, because they are not only looking for significant changes in sets of functionally related genes, but also genes with large expression changes.

One of the most popular approaches is Gene set enrichment analysis, or GSEA.

How does GSEA work?

The steps to perform Gene Set Enrichment Analysis are very similar to overrepresentation analysis methods. The big difference is that the input is not a list of genes, but a ranked list genes.

Wait a minute. Ranking genes?

This basically means that genes are ranked by some score.

A common way of ranking genes is by level of differential expression. The p-values tell us how significant the change is. The log2fold changes tell us the direction and strength of the change, basically if they are upregulated or downregulated.

We can combine both to get a ranked list of genes. This will order the genes both by significance and the direction of change.

 At the top of the list you’ll have the most upregulated significant genes and at the bottom the most downregulated significant genes.

So this ranked list of genes is our input.

A ranked list of genes is an ordered list of genes by some metric. For example, by level of differential expression.

Then, GSEA basically checks how the genes for a specific gene set are distributed in your list.

It checks whether members of a gene set tend to occur toward the top (or bottom) of the list.

GSEA easily explained with an example

Time for an example!

Imagine we are studying the gene expression differences between liver cells of healthy liver tissue, and liver cells from alcoholic liver disease.

We compared the gene expression in alcoholic liver disease vs healthy cells.

On your left (in blue) you can see your results from differential gene expression analysis.

On your right (in orange), your ranked gene list.

Now let’s take the gene set involved in interleukin 6 production. There are 181 genes involved in IL-6 production.

To check if IL-6 production is overrepresented in our list of genes, we basically check how it is distributed in our ranked list.

This is just a simple example. But if we had thousands of genes, it might look a bit more like this:

Did you notice a pattern?

Most genes involved in IL-6 production are at the top of the list, since most genes involved in IL-6 production pathway were upregulated in our dataset.

Remember that our ranked list means upregulated, significant genes are at the top. So upregulated pathways will have most of their genes towards the top of our list.

You can probably figure out how the distributions look if we had downregulated pathways.

That’s right! The opposite will happen for downregulated pathways.

For example, genes involved in cellular division, which we found to be downregulated in alcoholic liver vs healthy cells, will mostly be at the bottom of the list.

What about pathways that are not overrepresented? For example, genes involved in heart contraction, will look like this, they are evenly distributed our list. 

Obviously, we need a proper statistical test to identify which pathways show a non-random distribution across this sorted list.

The most often used is the Kolmogorov-Smirnov test.

 The result is basically the same as with overrepresentation analysis.

We will obtain a p-value, which we need to correct for multiple testing since we are repeatedly testing thousands of gene ontology terms.

This way, we were able to reduce our long list of genes into a more manageable list of biological pathways.

So as you can see, gene set enrichment analysis has the advantage that you don’t filter out genes prior to the analysis, and also it takes into account how significant the changes are, and in which direction.

So, in summary

Gene set enrichment analysis takes your ranked gene list of interest and compares it to a list of background genes. By statistically testing the distribution in your list, it determines which pathways are overrepresented.

Just to wrap up, there is a new generation of pathway enrichment analysis Topology-based (TB) methods which also take into account dependencies and interactions between genes.

But that is a story for another day.

If you are interested in learning how to interpret GSEA plots and GSEA results, check out my other post!


Submit a Comment

Your email address will not be published. Required fields are marked *