A step-by-step easy R tutorial to preprocess scRNAseq data with Seurat v5

In this easy, step-by-step tutorial you will learn how a Seurat object is structured and how to preprocess scRNAseq data using the standard workflow with Seurat v5.

This is a hands-on continuation of my previous workflow on my easy explanation of a Seurat object structure. So if you haven’t yet, check out my other blogpost to understand how Seurat objects are structured and what types of data we usually handle when preprocessing single-cell data.

In this tutorial, we will learn how to:

  • navigate a Seurat object’s structure, particularly in Seurat version 5
  • understand how the different steps in a standard workflow (normalisation, scaling, dimensionality reduction, nearest neighbours…) create different slots and layers in the Seurat object, and how to access the data
  • use common functions and useful plots in scRNAseq analysis

We will use the R packages:

  • tidyverse
  • Seurat

If you are more of a video-based learner, you can check out my Youtube video on Seurat v5 structure and easy single-cell tutorial in R. Otherwise, just keep reading! And remember you can just copy the code by clicking on the top right corner of the code snippets in this post. Or check out the script in Biostatsquid’s github!

Let’s dive in!

Check out my Youtube tutorial on Seurat structure and easy single-cell tutorial with Seurat v5 in R!

Before you start…

We will use a script to process single-cell RNA-seq data using Seurat in R. It starts by cleaning up the R environment and loading all the necessary libraries, like Seurat.

# ======================== # 
# Seurat_object.R
# ======================== # 

# Setting up environment ===================================================
# Clean environment
rm(list = ls(all.names = TRUE)) # will clear all objects including hidden objects
gc() # free up memory and report the memory usage
options(max.print = .Machine$integer.max, scipen = 999, stringsAsFactors = F, dplyr.summarise.inform = F) # avoid truncated output in R console and scientific notation

# Loading relevant libraries 
library(Seurat)
library(SeuratData) # we'll be using a dataset from seurat data https://github.com/satijalab/seurat-data 
library(tidyverse)
library(pheatmap)
library(patchwork)
# Install dataset. only need to run this once..
#InstallData("pbmc3k")

We’ll be working with a built-in dataset called pbmc3k which can be installed from the package SeuratData. It is a classic dataset of peripheral blood mononuclear cells. But feel free to use your own dataset for this tutorial! 

As always, you can find the code I am using in this tutorial at biotatsquid’s github page.

Squidtip

Remember that we will not be covering other important preprocessing steps like doublet detection or filtering for low quality cells and lowly expressed genes. The aim is to better understand what a Seurat object looks like and become familiar with the different slots and layers where the data is stored!

Read in the data

Let’s load our sample dataset.

# Read in data ===================================================
pbmc <- LoadData("pbmc3k")

Great! You’ll see your seurat object loaded as “pbmc” in the Environment window of your RStudio interface.

Once the data’s loaded, we take a quick look at the raw and unnormalized counts. Notice that when we first create a Seurat object, both the counts and data layers have the same values – I believe this is so that other Seurat functions work if they try to pull data from the data layer, but once we normalise the counts, these values will be substituted be real normalised counts.

# Counts contains the raw counts
pbmc@assays$RNA$counts[c("CD3D", "TCL1A", "MS4A1"), 1:5]
# data should contain the normalised counts (but we haven't normalised yet!)
pbmc@assays$RNA$data[c("CD3D", "TCL1A", "MS4A1"), 1:5]

Explore the metadata

Let’s have a look at the metadata. The metadata slot refers to cell information. You can see each row has a cell ID. For now, there’s a few default columns:

  • nCount_RNA which is the number of reads in that cell,
  • nFeature_RNA, which is the number of unique genes that were detected in that cell.

In this case, there’s an additional column called seurat_annotations which has the cell type annotations. If you are using your own data, you will have to annotate your cells yourself, so this column won’t be present.

Let’s add some useful columns to the metadata. One of the first steps in a scRNAseq pipeline is to add some quality control metrics — specifically the percentage of mitochondrial and ribosomal gene content per cell — which can tell us about cell viability or potential technical noise.

# Metadata ======================================================
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc@meta.data %>% head()
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc$percent.ribo <- PercentageFeatureSet(pbmc, pattern = "^RB-")
pbmc@meta.data %>% head()

Seurat standard workflow: normalisation, highly variable genes, and scaling

Next comes the standard Seurat workflow. First, we normalize the data to adjust for things like sequencing depth. Let’s have a look if our normalised gene counts have been updated.

Next comes the standard Seurat workflow. First, we normalize the data to adjust for things like sequencing depth. Let’s have a look if our normalised gene counts have been updated.

# Standard Seurat workflow ======================================================

## Normalisation and scaling -------------------
# Normalised counts are stored in the "data" layer
pbmc <- NormalizeData(object = pbmc, vars.to.regress = 'percent.mt')

While processing our single cell data, we often want to visualise our cells. Common dimensionality reduction techniques are PCA, UMAP and t-SNE. Before we do that, it is recommended that we identify the most variable genes in our dataset, which differ the most across cells and are likely to be biologically meaningful. Highly variable genes are the most informative genes as they will help us identify different cell states or types.

The function FindVariableFeatures() in the Seurat package aims to identify genes that are more variable across cells than would be expected by chance, and thus are likely to be biologically significant. It helps to focus the analysis on genes that contribute the most to the observed heterogeneity between cells. I won’t go into the details of it, but there are different methods to calculate the variable genes. The most common and default one is vst, which uses raw counts from our RNA assay to compute neighbours, but you can also use svp or disp to find variable genes, which take the normalised counts from the “data” layer in the RNA assay.

Squidtip

The default method of FindVariableFeatures is “vst” which uses raw counts to calculate the highly variable genes. You can change it with the argument “method”. Run ?FindVariableFeatures to find out more!

pbmc <- FindVariableFeatures(object = pbmc)
pbmc@assays$RNA@meta.data[21:31,]

When we run FindVariableGenes, the function will store the variable genes in the gene metadata. Remember that anything having to do with genes or counts is stored in the slot “assays”. Similar to cell metadata, the gene metadata is essentially a dataframe, where each row has information about a gene. Since we run FindVariableGenes() using the vst method, we have results for that method.

# This is what you may see in your console:
> pbmc@assays$RNA@meta.data[21:31,]
   vf_vst_counts_mean vf_vst_counts_variance vf_vst_counts_variance.expected vf_vst_counts_variance.standardized
21        0.024074074            0.024985248                     0.030692787                           0.8140430
22        0.023703704            0.024632443                     0.030139361                           0.8172848
23        0.094814815            0.674963841                     0.173716946                           1.4894346
24        0.028888889            0.028805731                     0.038086571                           0.7563225
25        0.001851852            0.001849107                     0.001921811                           0.9621691
26        0.001851852            0.001849107                     0.001921811                           0.9621691
27        0.368888889            0.537453378                     0.977788760                           0.5496621

Finding variable genes involves calculating the mean and variance of the expression of each gene. We can also see a column that contains true or false depending on if the gene is expressed or not, as well as the ranking.

How many variable genes do we have? By default, the FindVariableFeatures() will return the top 2000 most variable genes, but you can change this in the function itself if needed.

table(pbmc@assays$RNA@meta.data$vf_vst_counts_variable) # How many variable genes? This can be changed in FindVariableFeatures()
pbmc@assays$RNA@meta.data$var.features %>% tail(50)

You can also visualise your variable features with the function VariableFeaturePlot(). 

# You can also visualise them
VariableFeaturePlot(pbmc) 
  • Each point represents a gene.

  • X-axis: Average expression of each gene across all cells (on a log scale).

  • Y-axis: The variance-to-mean ratio or a similar metric (often standardized), showing how much a gene’s expression varies across cells.

  • Highlighted points: The top variable genes selected as “highly variable features.” These are often used for downstream steps like PCA, clustering, and UMAP.

After that, we scale the data so everything’s on the same footing for downstream analysis. Scaled, normalised counts are stored in the “scale.data” layer, as we already know.

# Scaled, normalised counts are stored in the "scale.data" layer
# By default ScaleData() will scale the normalised counts in the "data" layer.
# If you haven't normalised then it will use raw counts.
pbmc <- ScaleData(object = pbmc)

Let’s just compare the raw, normalised and scaled counts for a few genes.

> # layer counts contains the raw counts
> pbmc@assays$RNA$counts[c("CD8A", "TCL1A", "MS4A1"), 20:25]
3 x 6 sparse Matrix of class "dgCMatrix"
      AAAGCCTGTATGCG AAAGGCCTGTCTAG AAAGTTTGATCACG AAAGTTTGGGGTGA AAAGTTTGTAGAGA AAAGTTTGTAGCGT
CD8A               .              1              .              .              .              .
TCL1A              .              .              1              .              .              .
MS4A1              .             36              1              2              .              .
> # layer data contains the normalised counts 
> pbmc@assays$RNA$data[c("CD8A", "TCL1A", "MS4A1"), 20:25]
3 x 6 sparse Matrix of class "dgCMatrix"
      AAAGCCTGTATGCG AAAGGCCTGTCTAG AAAGTTTGATCACG AAAGTTTGGGGTGA AAAGTTTGTAGAGA AAAGTTTGTAGCGT
CD8A               .       1.102225       .              .                     .              .
TCL1A              .       .              2.184526       .                     .              .
MS4A1              .       4.295800       2.184526       1.959489              .              .
> # layer scale.data contains the normalised counts 
> pbmc@assays$RNA$scale.data[c("TCL1A", "MS4A1"), 20:25]
      AAAGCCTGTATGCG AAAGGCCTGTCTAG AAAGTTTGATCACG AAAGTTTGGGGTGA AAAGTTTGTAGAGA AAAGTTTGTAGCGT
TCL1A     -0.3174204     -0.3174204       2.347952     -0.3174204     -0.3174204     -0.3174204
MS4A1     -0.4099719      4.5732561       2.124129      1.8630813     -0.4099719     -0.4099719

Why is CD8A missing from our scaled data? Looks like it’s not a highly variable gene! By default, ScaleData() only scales the highly variable genes, if you’ve previously run FindVariableFeatures(). This is mostly for efficiency and noise reduction — you’re typically interested in the genes that show meaningful differences across cells. Scaling everything can slow things down and introduce unnecessary noise into downstream steps like PCA.

If you want to scale all genes, use “all” instead of “hgv” inside the ScaleData() function.

And talking about PCA… time to do some dimensionality reduction on our dataset!

Seurat standard workflow: PCA and nearest neighbours

What is dimensionality reduction and why do we need it? As explained in my previous blogpost, we can plot our cells based on the expression of two genes. For example, here we see cells that express high CD3D, a known T cell marker, and MS4A1, a known B cell marker. So we could suspect these cells being T cells, and these cells being B cells. However, cell type identities rarely depend on just 2 genes. We need a way to account for the gene expression profile as a whole. That’s why we use dimensionality reduction techniques like PCA. Easy!

PCA simplifies the dataset while keeping most of the important variation. Next, we can run PCA on our scaled data which will create the slot “pca” inside “reductions”. The coordinates of the PCA scores for each cell are stored in “cell.embeddings” and loadings will be stored in “feature.loadings”. As you know, PCA reduces the dimensionality of our dataset, so choosing the top PCs will explain a good % of variance in our dataset. Check out my other blogpost on PCA if you’re interested in knowing more! The question is, how many components should we choose to include? 10? 20? 100? One way to do it is by generating ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one.

## PCA -------------------
pbmc <- RunPCA(object = pbmc)
pbmc@reductions$pca %>% head()
#  In this example, we can observe an ‘elbow’ around PC9-10, 
# suggesting that the majority of true signal is captured in the first 10 PCs.
ElbowPlot(pbmc)

In this example, we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs although in larger datasets it’s usually more like 50 or 100. You have more tips on how to identify the dimensionality of the dataset in the Seurat webpage. And if you’re using SCTransform, which we will cover in a bit, the value doesn’t impact the results much.

Next, we’ll compute nearest neighbours based on those top principal components and use it to cluster the cells. The FindNeighbors() function in Seurat is used to compute the nearest neighbours of cells based on their gene expression profiles and store the data in the “graphs” slot. RNA_nn stands for nearest neighbours, RNA_snn stands for shared nearest neighbours.

We can take a look at the kNN and SNN graphs.

The kNN graph is a matrix where every connection between cells is represented as 1s. This is called a unweighted graph (default in Seurat). Let’s have a look at how this would look by plotting the first 200 cells. The black spots on the heatmap represent 1s – those cells are connected.

In the SNN graph the connections between cells are weighted by how many neighbours they share. This is called a weighted graph. If we plot the heatmap again, now we don’t have discrete 0s and 1s, but actually a scale of values from 0 to 1. In this case, 0 is white, so no connection at all, and then there’s a colour scale that goes through dark red to black, with black being the strongest connection, as you can see the darkest points are in the diagonal which is each cell with itself.

## Nearest neighbours --------------
# Find nearest neighbours for each cell
pbmc <- FindNeighbors(object = pbmc, dims = 1:10)
names(pbmc@graphs) # Where is the data stored?
 
# unweighted graph
pheatmap(pbmc@graphs$RNA_nn[1:200, 1:200],
         col = c("white", "black"), border_color = "grey90", main = "KNN graph",
         legend = F, cluster_rows = F, cluster_cols = F, fontsize = 2
)

# weighted graph
pheatmap::pheatmap(pbmc@graphs$RNA_snn[1:200, 1:200],
         col = colorRampPalette(c("white", 'darkred', "black"))(100),
         border_color = "grey90", main = "SNN graph",
         legend = F, cluster_rows = F, cluster_cols = F, fontsize = 2
)
Squidtip

Both weighted and unweighted graphs are suitable for clustering, but clustering on unweighted graphs is faster for large datasets (> 100k cells).

Seurat standard workflow: clustering and UMAP

Once the graph is built, we can now perform graph clustering. We’ve computed the distances between cells, now we can decide which cells are part of a cluster. We can group or cluster the cells at different resolutions. The higher the resolution, the higher number of clusters. Usually, you should test different resolutions and see which one makes more sense for your data and your purposes.

In Seurat, the function FindClusters() will do a graph-based clustering using “Louvain” algorithm by default (algorithm = 1). To use the Leiden algorithm, you need to set it to algorithm = 4. See ?FindClusters for additional options. By default, we will use the SNN graph we created in the previous step.

Ok, so we’ve clustered our cells based on their overall gene expression profile. Now we want to visualise them in a plot! For that we’ll use a dimensionality reduction algorithm which reduces the data to two dimensions we can plot. My algorithm of choice is UMAP, but I will run both UMAP and t-SNE in this tutorial.

## Clustering --------------
?FindClusters 
# Clustering with louvain (algorithm 1) and a few different resolutions
pbmc <- FindClusters(pbmc, graph.name = "RNA_snn", resolution = c(0.1, 0.25, .5, 1, 1.5, 2), algorithm = 1)

# Where are the clusters stored?
# Each time you run clustering, the data is stored in metadata columns:
# seurat_clusters - lastest results only
# RNA_snn_res.XX - for each different resolution you test.
colnames(pbmc@meta.data)
head(pbmc@meta.data)
unique(pbmc$RNA_snn_res.0.1)
unique(pbmc$RNA_snn_res.2)

## Dimensionality reduction ------------
pbmc <- RunUMAP(object = pbmc, dims = 1:10)
pbmc <- RunTSNE(object = pbmc, dims = 1:10)
names(pbmc@reductions)
pbmc@reductions$umap@cell.embeddings %>% head()

We’ll run UMAP and t-SNE using the first 10 principal components. The data is stored in the reductions slot. You can access the cell embeddings, which are UMAP coordinates, or just plot them with the function DimPlot()

Ok, so this is an important step. Visualising all cells in a plot like this one helps us see how well the clusters separate and whether the patterns make biological sense.

Let’s compare the different resolutions we computed. We see how the higher the resolution, the more granularity we get. There’s no right or wrong here, you would choose the resolution which best fits the cell populations you want to identify. For example, if you are looking to distinguish between myeloid cells, T cell and B cells, you might go with resolution 0.1, because it’s enough to identify these higher-level cell type annotations. If you’d like to distinguish CD14+ from FCGR3A+ monocytes, you might need a higher resolution that further divides this cluster 1 in resolution 0.1, into clusters 1 and 3 in resolution 0.25.

# Plot the UMAP
DimPlot(object = pbmc, reduction = "umap") 
DimPlot(object = pbmc, reduction = "umap") + theme_bw() + 
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank())
DimPlot(object = pbmc, reduction = "umap", group.by = 'RNA_snn_res.0.1') + 
  theme_bw() + 
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank())
DimPlot(object = pbmc, reduction = "umap", 
        group.by = c('RNA_snn_res.0.1', 'RNA_snn_res.0.25', 'RNA_snn_res.0.5', 
                     'RNA_snn_res.1', 'RNA_snn_res.2')) & 
  theme_bw() &
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank())

DimPlot(object = pbmc, reduction = "umap", 
        group.by = c('RNA_snn_res.0.1', 'RNA_snn_res.0.25', 'RNA_snn_res.0.5', 
                     'RNA_snn_res.1', 'RNA_snn_res.2', 'seurat_annotations')) & 
  theme_bw() &
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank())

We’re going to cheat a little bit, and use Seurat’s annotations for this dataset, which if you remember, are stored in the metadata column “seurat_annotations”. If we wanted to achieve similar-level annotations in this dataset, so 9 different cell types, we would go with a resolution of something between 0.5 (8 clusters) and 1 (10 clusters). That is not to say that we might need a higher resolution if we wanted to distinguish between activated and naive CD8+ T cells, for example. Or lower one if we didn’t care too much about the subtypes of T cells.

Ok, so if you’re analysing your own data, you’ll have to annotate the cells yourself.

How do we do that?

First, a useful Seurat function to know is DotPlot() and FeaturePlot() which we can use to check for the expression of certain markers. You can see, for example, that cluster 1 using resolution 0.5 has a high expression of CD14, and in fact, they correspond to CD14+ monocytes.

Another way of visualising markers is directly on the UMAP plot, which you can do with FeaturePlot(). I won’t go into the details of these functions, so check out the documentation if you’d like to know more.

# Check for the expression of different markers
Idents(pbmc) <- 'RNA_snn_res.0.5'
p1 <- DotPlot(pbmc, c("CD8A", "FCGR3A", "MS4A1", "CD14"))
p2 <- FeaturePlot(pbmc, c("CD8A", "FCGR3A", "MS4A1", "CD14"))
p1 | p2

Checking for the expression of certain markers can be useful, but when you are trying to annotate your clusters of cells, it might make more sense to find markers for each cluster that distinguish it from the others. In other words, you’re doing differential gene expression.

There’s different ways of doing this and I encourage you to read Seurat’s documentation, but I’ll just run through the very basics of it here.

Finding markers per cluster

Ok, so first, we set the identity for the Seurat object. This is the column of the metadata that contains the identity of your groups of interest. For example, if you wanted to compare B versus T cells, then we would use the column with seurat annotations. But we’re assuming that we don’t have the annotations yet, so we will just decide on a resolution and compute markers for each of our clusters.

I’m going to go with the lowest resolution, which identified 4 clusters, for simplicity’s sake. We can then compute markers for each cluster versus the rest with FindAllMarkers().

## DGE analysis -------------------------------
# Set the active identity 
Idents(pbmc) <- 'RNA_snn_res.0.1'
DimPlot(object = pbmc, reduction = "umap")
# DGE all clusters vs rest
dge_all <- FindAllMarkers(pbmc)
colnames(dge_all)
head(dge_all)

The result is stored in dge_all which is a dataframe. We’ll have a quick look at the top positive differentially expressed genes per cluster, so genes that are upregulated per cluster compared to the others. You can use these genes to then annotate your clusters.

dge_all %>% group_by(cluster) %>%
  arrange(desc(avg_log2FC), p_val) %>%
  slice_head(n = 5) %>%
  print(n = 50)

You should get something like this:

# A tibble: 20 × 7
# Groups:   cluster [4]
      p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene        
      <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>       
 1 4.68e- 6       6.53 0.014 0      6.42e- 2 0       C2orf40     
 2 6.31e- 5       5.96 0.011 0      8.66e- 1 0       RP11-161M6.2
 3 1.70e- 6       5.77 0.017 0.001  2.33e- 2 0       FAM153B     
 4 1.04e-11       5.44 0.035 0.001  1.43e- 7 0       KRT1        
 5 8.94e- 7       5.06 0.018 0.001  1.23e- 2 0       GPR146      
 6 1.26e-38       9.81 0.083 0      1.73e-34 1       C19orf59    
 7 1.41e-40       9.70 0.087 0      1.94e-36 1       FCGR1B      
 8 4.38e-34       9.22 0.073 0      6.01e-30 1       OLIG1       
 9 5.49e-11       9.06 0.021 0      7.53e- 7 1       AP001189.4  
10 1.69e-31       8.96 0.067 0      2.31e-27 1       RP11-65J3.1 
11 1.67e- 9       9.07 0.016 0      2.29e- 5 2       YES1        
12 4.00e-14       8.31 0.025 0      5.48e-10 2       SLC1A7      
13 8.10e-12       7.84 0.021 0      1.11e- 7 2       LIM2        
14 2.42e- 8       7.75 0.014 0      3.32e- 4 2       COLGALT2    
15 1.16e-10       7.64 0.018 0      1.59e- 6 2       BIRC5       
16 7.14e-27       9.38 0.049 0      9.79e-23 3       FAM177B     
17 6.91e-24       8.72 0.043 0      9.48e-20 3       RP11-16E12.2
18 4.77e-31       8.57 0.06  0      6.54e-27 3       MACROD2     
19 6.24e-12       7.99 0.02  0      8.55e- 8 3       PKHD1L1     
20 2.11e-29       7.95 0.06  0.001  2.89e-25 3       KCNG1 

I will briefly mention that you can also compare one cluster versus another by using FindMarkers(). For example, if we’re wondering what are the differences between clusters 0 and 2,  we can run FindMarkers() to get the differentially expressed genes. The results is a similar dataframe with the gene IDs, p-value, average log2 fold-change… The genes with higher positive log2 fold-change are upregulated in cluster 0 versus cluster 2. Genes with more negative log2 fold-change are downregulated in cluster 0 versus cluster 2, or seeing it the other way, their upregulated in cells from cluster 2 compared to cells from 0.

# Or compare one cluster against another
DimPlot(pbmc)
dge_0vs2 <- FindMarkers(pbmc, ident.1 = '0', ident.2 = '2')
colnames(dge_0vs2)
dge_0vs2 %>% arrange(desc(avg_log2FC), p_val) %>% head(10)

So obviously this is very time-consuming. There’s many tools out there that have been developed for automatic cell type annotation which help with this task. Some use markers, some use reference datasets that have already been annotated. And you can use those tools to get an automatic annotation of your cells. For example, you can find an overview of cell type annotation tools here as well as tutorials on scType and SingleR here.

So image we’ve annotated our cell types, and we have them in the column Seurat annotations. It’s always good to check if the annotations make sense by checking the expression of known markers for that cell type. And of course you can also compare one cell type versus another to do differential gene expression analysis.

# Once you have your annotations, you can compare one cell type against another
pbmc$seurat_annotations %>% unique()
Idents(pbmc) <- 'seurat_annotations'
DimPlot(object = pbmc, reduction = "umap")
dge_CD4naivevsmem <- FindMarkers(pbmc, ident.1 = 'Naive CD4 T', ident.2 = 'Memory CD4 T')
head(dge_CD4naivevsmem)

So we’ve covered the main steps in a single cell pipeline. Great!

If you’ve used Seurat before, you might have heard of sctransform. It’s a specific type of normalisation tailored to scRNAseq data which accounts for technical noise and variability across cells. This improves common downstream analyses like variable gene selection, dimensional reduction, and differential expression. You can read more about it in the Seurat page.

SCTransform

If you would like to know more about SCTransform, check out my blogpost with a simple explanation of SCTransform. 

When we run sctransform on our Seurat object, this will create a new assay (named SCT by default) with the transformed data. Counts are (corrected) counts, data is log1p(counts), scale.data has the Pearson residuals. Importantly, this is set as the default assay after running sctransform. So if you want to use your log normalised counts, you will have to specify assay = “RNA”.

Let’s run sctransform. Note that this single command replaces NormalizeData()ScaleData(), and FindVariableFeatures() so you do not need to run them if you use SCTransform. During normalization, we can also remove confounding sources of variation, for example, mitochondrial mapping percentage.

We then follow the same steps as before: compute PCA, FindNeighbours and get our UMAP projections.

# SCTransform ==================================
pbmc <- SCTransform(object = pbmc, vars.to.regress = 'percent.mt')
pbmc <- RunPCA(object = pbmc, dims = 1:10)
pbmc <- FindNeighbors(object = pbmc, dims = 1:30)
pbmc <- FindClusters(pbmc, resolution = c(0.1, 0.25, .5, 1, 1.5, 2))
pbmc <- RunUMAP(object = pbmc, dims = 1:30)
DimPlot(object = pbmc, reduction = "umap", group.by = "SCT_snn_res.0.1")
colnames(pbmc@meta.data)
# layer counts contains the raw counts
pbmc@assays # Now we have 2 assays!

Now, if we check the assays, we should have two of them: RNA, and SCT. Each of them has different layers for raw, normalised and scaled gene expression values. If you run SCT, it will be the default assay in your Seurat object, so make sure to explicitly set the RNA assay as default if you need to use the original counts instead of the SCT-corrected ones.

# The raw, normalised and scaled counts are different in the SCTransform assay.
pbmc@assays$RNA$counts[c("CD8A", "TCL1A", "MS4A1"), 20:25]
pbmc@assays$SCT$counts[c("CD8A", "TCL1A", "MS4A1"), 20:25]
# layer data contains the normalised counts 
pbmc@assays$RNA$data[c("CD8A", "TCL1A", "MS4A1"), 20:25]
pbmc@assays$SCT$data[c("CD8A", "TCL1A", "MS4A1"), 20:25]
# layer scale.data contains the normalised counts 
pbmc@assays$RNA$scale.data[c("TCL1A", "MS4A1"), 20:25]
pbmc@assays$SCT$scale.data[c("TCL1A", "MS4A1"), 20:25]

Before we wrap up, I should mention that if you’re working with more than one sample, you will need to integrate them into a single Seurat object. Seurat integration is a powerful framework designed to align and combine multiple single-cell datasets—such as from different batches, conditions, technologies, or even species—into a shared space. It works by identifying “anchors” (pairs of similar cells across datasets) using mutual nearest neighbours, then uses these anchors to correct for batch effects and enable joint analyses. Seurat has an introduction to integration here. If you’d like to see a video on Seurat integration do let me know in the comments below!

But otherwise, we’ll leave that story for another day! Squidtastic!

sessionInfo()

Check my sessionInfo() here in case you have trouble reproducing my steps:

> sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_Ireland.utf8  LC_CTYPE=English_Ireland.utf8    LC_MONETARY=English_Ireland.utf8
[4] LC_NUMERIC=C                     LC_TIME=English_Ireland.utf8    

time zone: Europe/Dublin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.3.0         pheatmap_1.0.12         lubridate_1.9.4         forcats_1.0.0          
 [5] stringr_1.5.1           dplyr_1.1.4             purrr_1.0.4             readr_2.1.5            
 [9] tidyr_1.3.1             tibble_3.2.1            ggplot2_3.5.1           tidyverse_2.0.0        
[13] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2.9002   Seurat_5.2.1            SeuratObject_5.0.2     
[17] sp_2.2-0               

loaded via a namespace (and not attached):
  [1] deldir_2.0-4           pbapply_1.7-2          gridExtra_2.3          rlang_1.1.5           
  [5] magrittr_2.0.3         RcppAnnoy_0.0.22       matrixStats_1.5.0      ggridges_0.5.6        
  [9] compiler_4.4.2         spatstat.geom_3.3-6    png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         crayon_1.5.3           pkgconfig_2.0.3        fastmap_1.2.0         
 [17] labeling_0.4.3         promises_1.3.2         tzdb_0.5.0             jsonlite_1.9.0        
 [21] goftest_1.2-3          later_1.4.1            spatstat.utils_3.1-3   irlba_2.3.5.1         
 [25] parallel_4.4.2         cluster_2.1.6          R6_2.6.1               ica_1.0-3             
 [29] stringi_1.8.4          RColorBrewer_1.1-3     spatstat.data_3.1-6    reticulate_1.41.0.1   
 [33] parallelly_1.42.0      spatstat.univar_3.1-2  lmtest_0.9-40          scattermore_1.2       
 [37] Rcpp_1.0.14            tensor_1.5             future.apply_1.11.3    zoo_1.8-13            
 [41] sctransform_0.4.1      timechange_0.3.0       httpuv_1.6.15          Matrix_1.7-1          
 [45] splines_4.4.2          igraph_2.1.4           tidyselect_1.2.1       rstudioapi_0.17.1     
 [49] abind_1.4-8            spatstat.random_3.3-3  codetools_0.2-20       miniUI_0.1.1.1        
 [53] spatstat.explore_3.4-2 listenv_0.9.1          lattice_0.22-6         plyr_1.8.9            
 [57] withr_3.0.2            shiny_1.10.0           ROCR_1.0-11            Rtsne_0.17            
 [61] future_1.34.0          fastDummies_1.7.5      survival_3.7-0         polyclip_1.10-7       
 [65] fitdistrplus_1.2-2     pillar_1.10.1          KernSmooth_2.23-24     plotly_4.10.4         
 [69] generics_0.1.3         RcppHNSW_0.6.0         hms_1.1.3              munsell_0.5.1         
 [73] scales_1.3.0           globals_0.16.3         xtable_1.8-4           glue_1.8.0            
 [77] lazyeval_0.2.2         tools_4.4.2            data.table_1.17.0      RSpectra_0.16-2       
 [81] RANN_2.6.2             dotCall64_1.2          cowplot_1.1.3          grid_4.4.2            
 [85] colorspace_2.1-1       nlme_3.1-166           cli_3.6.4              rappdirs_0.3.3        
 [89] spatstat.sparse_3.1-0  spam_2.11-1            viridisLite_0.4.2      uwot_0.2.3            
 [93] gtable_0.3.6           digest_0.6.37          progressr_0.15.1       ggrepel_0.9.6         
 [97] htmlwidgets_1.6.4      farver_2.1.2           htmltools_0.5.8.1      lifecycle_1.0.4       
[101] httr_1.4.7             mime_0.12              MASS_7.3-61

And that is the end of this tutorial!

This is a very simple workflow for preprocessing scRNAseq data with Seurat v5, as well as understanding Seurat v5 structure.

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and… see you in the next one!

Squids don't care much for coffee,

but Laura loves a hot cup in the morning!

If you like my content, you might consider buying me a coffee.

You can also leave a comment or a 'like' in my posts or Youtube channel, knowing that they're helpful really motivates me to keep going:)

Cheers and have a 'squidtastic' day!

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *