Cell type annotation for scRNAseq

Once you preprocess your single-cell RNA sequencing (scRNAseq) data, it is time for one of the biggest challenges in a standard scRNAseq pipeline: annotating cell types.

The scientific community has not agreed on a standard workflow to annotate cell types. Cell type annotation, in fact, largely depends on the dataset itself.

In this post, we will discuss some of the best approaches and methods to carry out cell type annotation for scRNAseq. I’ll share some tips and tricks on cell type annotation, as well as resources and tools that really help me when I have to annotate my cells!

So if you are ready… let’s dive in!

Annotating cell types: overview

So you finished your preprocessing steps (quality control, filtering, normalisation, PCA, UMAP…) and you have a normalised gene expression matrix. Now what?

In most workflows, you would like to annotate your cell types. That is, label our cells by assigning a cell type to each of them. Cell type annotation is often the most complicated part of a scRNAseq workflow. And it is especially important to do this step correctly as it will strongly influence all downstream analysis.

So how do we annotate our cells?

In a nutshell:

1. Visualise your non-annotated ‘single-cell map’. After quality control and filtering, identify groups of similar cells using a clustering algorithm. This will enable you to visualize all cells in two dimensions using methods such as t-SNE or UMAP. The idea is to group your cells based on their gene expression profile (make sure to correct any technical variation and batch effects!). And since gene expression determines the cell type, we can say that cells are clustered together based on cell type*.

2. Automatic cell type annotation. Automatic annotation uses a predefined set of ‘marker genes’ (genes that are specifically expressed in a known cell type) or reference single-cell data (an existing expertly annotated single-cell map) to identify and label individual cells or cell clusters by matching their gene expression patterns (signatures) to those of known cell types.

3. Manual cell type annotation. This involves studying genes and gene functions specific to each cell cluster or pattern to verify automatic cell annotations and identify novel cell types.

* Which is not necessarily true. Cells of the same cell type in different states may be detected in separate clusters. That is why it is best to use the term “cell identities” rather than “cell types”. Before clustering and annotating clusters, you should decide which level of annotation detail, and therefore which cluster resolution, you need for your analysis.

What is a cell type?

Although to some, the question might seem obvious, when assigning cell type identity to cells there is no clear answer as to what is the ‘cell type’ you are looking for.

There is no consensus on what a ‘cell type’ is. That’s one of the main reasons why there isn’t a consensus on cell type annotation either.

Let’s take as an example, a sample from tumour tissue. If you are looking into cancer cells, you might be fine just assigning ‘immune cells’ to a cluster, without the need to refine the cell type annotation. On the other hand, if you are investigating the immune response in tumour tissue, you might want to distinguish T cells, B cells, macrophages… Within each group, you can further distinguish CD8+ T cells, CD4+ T cells, regulatory T cells… You could also further divide them in ‘active’ or ‘resting’, for example.

In summary, the granularity or resolution of your cell identities depends on you. Specifically, it depends on your hypothesis, and the question(s) you want to answer with that dataset.

Assigning cell types is also tricky because some cells have intermediate phenotypes. There is often a gradient between different cell subtypes and it is not always easy to define the border between them. For example, differentiating cells might have a gene expression profile that is somewhere in between two ‘classic’ cell types.

To sum up, you must decide what level of ‘cell identity’ you are looking for.

This is the first step in cell type annotation.

At which stage in the pipeline should I annotate cells types?

Usually cell type annotation is done after dimensionality reduction and scRNAseq clustering.

To give you a rough overview of the steps in a common scRNAseq pipeline:

Get raw reads
Quality control
Filtering, doublet detection, checking for cell cycle and other sources of variation
Normalisation and batch correction
Dimensionality reduction
Clustering
Cell type annotation.

Once you grouped your cells in informative clusters, you can then annotate the clusters and assign them a cell type.

So what is an informative cluster? In the next section, we’ll talk about estimating the number of clusters.



Squidtip

It is very important to normalise and batch correct your data before cell type annotation – make sure your cells are clustering together by similarity in gene expression profiles and not, for example, because they belong to the same patient!

How many clusters should I group my cells into?

Once you decide on the level of cell identity you are looking for (see section above: What is a cell type?) , you can roughly estimate the amount of clusters you want to group your cells in.

However, you don’t want to split the dataset too much from the start. Too many clusters can make annotation more challenging, especially if you have many different cell types in your dataset.

Let’s visualise this with a specific example. Imagine you have scRNAseq data from damaged skin tissue with immune infiltration. If we are looking into the immune compartment, it might be wise to first separate normal skin cells (keratinocytes, dermis, endothelial cells…) from a few rough immune subclusters (ideally a ‘myeloid metacluster’ with macrophages, dendritic cells, etc) and a ‘lymphoid metacluster’, with a few clusters separating T cells from B cells.

If we just look at immune cells (CD45), highly variable genes will only capture big differences (e.g., a B cell is very different from a T cell) but not more subtle differences (e.g., plasma cells and plasmablasts are probably going to belong to the same cluster!).

We might want to take the ‘lymphoid metacluster’ and treat it as a whole new dataset, re-clustering it. This time we might get a few more clusters, perhaps distinguishing CD4+ and CD8+ T cells.

But what if this is not enough? We might be interested in different subtypes or states of CD4+ T cells: again, we can take the cluster(s) that we annotated as CD4+ T cells, and re-cluster it.

This trick is called subclustering. Basically, first cluster and annotate at a more rough level, then select some interesting clusters, group them, and treat this new group as a new dataset, getting new (sub)clusters. By focusing on just a subset of the data, your resolution will increase and you can easily explore more specific cell subtypes.



Squidtip

Use sub-clustering to increase resolution in interesting subsets.

What are some bioinformatic tools to perform cell type annotation?

What is the best tool to perform cell type annotation?

You probably new the answer already: there is no perfect tool. They all have pros and cons, and the tool that will work best largely depends on your dataset.

Here are a few resources you might find useful:

Automated methods for cell type annotation on scRNA-seq data (Pasquini et al). An amazing review which compares different annotation tools.
Current best practices in single-cell RNA-seq analysis: a tutorial (Luecken, Theis). This is a great review on preprocessing scRNAseq data, including some nice insights into cell identity annotation. A bit dated but still great.
Guided clustering tutorial from Seurat. A great tutorial to start with to learn how to carry out scRNAseq (pre)processing with R package Seurat: from filtering, to normalisation, to clustering and cell type annotation!
Automatic cell type identification methods for single-cell RNA sequencing (Xie et al). This publication compares different automatic annotation tools that are out there, highly recommended if you would like a general overview before choosing a method!

What are some tips and tricks I use for cell type annotation?

Personally, I like to first take a look at the main differentially expressed genes between clusters, and maybe do some rough GSEA/PEA to get an idea of the identity of ‘general’ or ‘high-level’ clusters. Then, I use a combination of referenced-based and marker-based tools, often using a few different references (if I have them!) and marker lists, and get a consensus. Some tools I use are SingleR and scType.

Once my clusters are annotated, I group clusters I’m interested in (e.g., all myeloid cells) and repeat the analysis to get a finer annotation (e.g., macrophages, monocytes…). I often have to repeat this a few times! If you have more specific cells you want to distinguish (e.g., M1 macrophages vs M2 macrophages), it might be necessary to correct for certain genes (ribosomal, mitochondrial…) to get more informative highly variable genes.

Some general tips/recommendations:

Marker genes for the same cell identity that work for one dataset might not work in another dataset! – you must tailor your cell identity annotation strategy to your dataset.
If you are using reference-based automatic annotation tools, choose your reference dataset wisely!
You don’t need to annotate all your clusters at once – subclustering particular cell clusters is a valid approach to focus on more detailed substructures in a dataset.
Performing an unbiased differential gene expression analysis can help you identifying up-regulated genes in each cluster and help you annotate more fine clusters. You can also get some functional insights about each cluster by running gene set enrichment analysis or pathway enrichment analysis (using gene signatures from MSigDB for instance). Of course, when you analyze very small cell clusters, they may stop having (enough) biological relevance.
All in all, you’ll probably need to try out different methods, references and markers until you’re satisfied with your result and have cell identities ‘you can trust’ – if it doesn’t work on the first try… don’t give up! It’s completely normal. Cell type annotation is often one of the most time-consuming steps but it’s important to do it properly – after all, most (if not all) of your downstream analysis will be based on the identities you assign to your cells!

If you are currently annotating your cells – good luck! 🙂

Ending notes

Wohoo! You made it til the end!

In this post, I shared some insights on how to carry out cell type annotation for scRNAseq. If you read it until the end (or better still, if you are struggling like me to get those cells annotated!!) you’ve probably realised it can get quite complicated. Hopefully you found some of my notes and resources useful! Don’t hesitate to leave a comment if there is anything unclear, that you would like explained differently/ further, or if you’re looking for more resources on cell type annotation! Your feedback is really appreciated and it helps me create more useful content:)

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and… see you in the next one!

Squids don't care much for coffee,

but Laura loves a hot cup in the morning!

If you like my content, you might consider buying me a coffee.

Get the squid a coffee

You can also leave a comment or a 'like' in my posts or Youtube channel, knowing that they're helpful really motivates me to keep going:)

Cheers and have a 'squidtastic' day!

And that is the end of this tutorial!

In this post, I explained the differences between log2FC and p-value, and why in differential gene expression analysis we don't always get both high log2FC and low p-value. Hope you found it useful!

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and... see you in the next one!

Cell type annotation for scRNAseq

Annotating cell types: overview

What is a cell type?

At which stage in the pipeline should I annotate cells types?

Squidtip

How many clusters should I group my cells into?

Squidtip

What are some bioinformatic tools to perform cell type annotation?

What are some tips and tricks I use for cell type annotation?

Ending notes

Squidtastic!

Squidtastic!

Submit a Comment Cancel reply

Recent posts

Cell type annotation for scRNAseq

Annotating cell types: overview

What is a cell type?

At which stage in the pipeline should I annotate cells types?

Squidtip

How many clusters should I group my cells into?

Squidtip

What are some bioinformatic tools to perform cell type annotation?

What are some tips and tricks I use for cell type annotation?

Ending notes

Squidtastic!

Squidtastic!

Submit a Comment Cancel reply

Popular Posts

Recent posts