SCTransform (Single-Cell Transform) is a normalization method primarily used in scRNA-seq data analysis. It was developed to address limitations in standard normalization approaches when dealing with single-cell data.
You can check how to apply SCTransform on your single-cell data in Seurat’s tutorial for SCTransform.
In this post, we will cover:
- what is SCTranform
- differences with standard normalisation and scaling
- how does it work
So if you are ready… let’s dive in!
What is SCTransform?
SCTransform is an advanced normalization and transformation method specifically designed for single-cell RNA sequencing data. It is an alternative to traditional methods like log-normalization and scaling. The method is implemented in the Seurat package and uses a statistical model to perform normalization, accounting for both technical noise and biological variation in a more sophisticated way than simpler normalization techniques.
In a nutshell, SCTransform:
- uses a regularized negative binomial regression model to normalize gene expression data
- accounts for both sequencing depth (library size) and gene-specific effects simultaneously
- applies variance stabilization transformation to make the data more suitable for downstream analysis
- automatically removes technical variation while preserving biological variation
Squidtip
SCTransform has become particularly popular in the Seurat package for single-cell analysis, as it often produces more robust results than previous normalization methods, especially when dealing with complex or noisy single-cell datasets.
What is SCTransform?
SCTransform is an advanced normalization and transformation method specifically designed for single-cell RNA sequencing data. It is an alternative to traditional methods like log-normalization and scaling. The method is implemented in the Seurat package and uses a statistical model to perform normalization, accounting for both technical noise and biological variation in a more sophisticated way than simpler normalization techniques.
In a nutshell, SCTransform:
- uses a regularized negative binomial regression model to normalize gene expression data
- accounts for both sequencing depth (library size) and gene-specific effects simultaneously
- applies variance stabilization transformation to make the data more suitable for downstream analysis
- automatically removes technical variation while preserving biological variation
Why use SCTransform versus standard methods?
Before we dig into the differences between SCTransform and standard methods, let’s have a look at what standard methods are.
Typically, sequencing data is both normalised and scaled.
Standard normalization methods (like log-normalization):
Typically, we first normalize by library size (total UMI counts per cell) and then apply a log transformation to make the counts more “human-friendly” (as otherwise they would be really small numbers).
In scRNAseq, this standard normalisation doesn’t effectively handle the relationship between mean expression and variance. Moreover, it can struggle with zero-inflation (many genes have zero counts in many cells) which is very common in scRNAseq datasets.
Standard scaling methods (like z-score):
Scaling usually means centering the data around the mean and scaling by standard deviation. This way, we make sure that all genes are comparable, which is important for dimensionality reduction, for example. Using z-score scaling means each gene will have a mean expression of 0 across all cells and a standard deviation of 1.
This standard approach treats all features equally regardless of their expression characteristics and doesn’t account for the count-based nature of sequencing data.
Squidtip
Normalization addresses differences between cells by adjusting for sequencing depth or other technical variations, ensuring that gene expression values are comparable across cells.
Scaling addresses differences between genes by adjusting for varying ranges in expression levels across genes, making sure that no single gene with very high expression dominates the analysis.
Ok, so what does SCTransform do differently?
The advantages of using SCTransform in scRNAseq data are a few:
- Better handles the mean-variance relationship in count data
- Effectively normalizes across cells with vastly different library sizes
- Improves detection of variable genes
- Reduces the need for explicit regression of technical covariates
- Preserves biological heterogeneity while removing technical noise
- Performs better for downstream tasks like clustering and differential expression
How does SCTransform work? (no maths explanation)
In single-cell RNA data, there’s a strong relationship between a gene’s average expression and its variance. Higher expressed genes naturally show more variability, which can skew analysis.
This is the problem SCTransform tackles.
SCTransform uses a negative binomial regression where:
- Each gene count (Y) is modeled as a negative binomial distribution
- The expected count for a gene in a cell depends on:
- The cell’s sequencing depth
- The gene’s overall expression level
- Gene-specific technical factors
The Math Steps:
- For each gene, it fits a model: log(expected count) = log(sequencing depth) × β + other factors
- It estimates parameters like the dispersion parameter (θ), which controls how much variance there is beyond what’s expected from the mean
- It uses a regularization approach where genes with similar expression levels share information to improve estimation
Squidtip
The SCTransform method applies a regularized negative binomial regression to model the gene expression data. For each gene, we get an estimated “real expression” – it’s a corrected gene expression value that takes into account factors like the cell’s sequencing depth, the gene’s overall expression level and gene-specific technical factors.
After fitting the model, SCTransform calculates Pearson residuals:
Wait, what are Pearson residuals?
Pearson residuals are a way to measure how much each data point deviates from what we’d expect based on our model. The basic formula:
Pearson residual = (Observed value – Expected value) / √(Expected variance)
What does this mean?
- If a gene has exactly the expression level we’d expect based on the model → residual = 0
- If a gene is expressed higher than expected → positive residual
- If a gene is expressed lower than expected → negative residual
So, if we go back to our scRNAseq example, a redisual is defined as:
Residual = (observed count – expected count) ÷ √(expected variance)
These residuals become our normalised values and they’re useful for several reasons:
- Standardized scale: All genes end up on the same scale regardless of their expression level
- Variance stabilization: Unlike raw counts, these residuals have roughly the same variance across all expression levels. The transformation ensures that genes with different expression levels can be compared fairly – the variance no longer depends on the mean expression level.
- Interpretation: Positive residuals mean higher-than-expected expression; negative means lower-than-expected. A residual of +2 means “this gene is expressed about 2 standard deviations higher than expected”.
So, in practice for single-cell data:
- The model predicts how many counts we expect for each gene in each cell based on technical factors
- The residual tells us whether actual expression is higher or lower than this technical baseline
- This helps separate biological signal (what we care about) from technical noise.
Think of it as “normalizing” the data while accounting for the unique characteristics of gene expression counts rather than just dividing by something like the total counts in each cell.
The Pearson residuals are useful for downstream processing (including PCA/ neighbor finding/ dimensionality reduction/ integration… but not for DE). For differential gene expression analysis, the Seurat team recommends using corrected counts (stored in the counts layer of the SCT assay, as explained below). A vignette with this exact workflow is here.
To visualise gene expression, we use the scale.data layer of SCT assay, except for visualisations of differential analysis like FeaturePlot, VlnPlot, where it makes more sense to represent corrected counts and not the Pearson’s coefficients. See this post for more information!
How is SCTransform is applied and stored in a Seurat object
You can run SCTransform on a Seurat object using:
1 seurat_object <- SCTransform(seurat_object)
What gets computed:
- The model fits parameters for each gene
- Pearson residuals are calculated for each gene in each cell
- Variance of each gene is estimated
Where is what?
The Pearson residuals are stored in a new “SCT” assay within the Seurat object. You can access this with seurat_object[["SCT"]].
The raw count data remains untouched in the “RNA” assay. By default, subsequent analyses (PCA, UMAP, clustering) will use the SCT assay.
The SCT assay will now have 3 layers:
seurat_object[["SCT"]]@counts
contains the corrected counts that have been adjusted for sequencing depth. Note that Seurat uses pearson residuals for all downstream tasks and not the corrected counts.seurat_object[["SCT"]]@data
containslog1p(corrected counts).
This is the layer that should be used for visualization and differential expression analysis.seurat_object[["SCT"]]@scale.data
contains scaled Pearson residuals (only for variable features), which are centered and scaled further for dimensional reduction.
Additionally:
Variable features selected by SCTransform are found in: seurat_object[["SCT"]]@var.features
Model parameters are stored in: seurat_object@misc$sct.model
And that is the end of this tutorial!
In this post, I explained what is SCTransform and how it works, with no math. Hope you found it useful!
Before you go, you might want to check:
Squidtastic!
You made it till the end! Hope you found this post useful.
If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.
Otherwise, have a very nice day and… see you in the next one!