Comparing Gene Expression Across Integrated Single-Cell Datasets: NormalizeData(), SCTransform()

One of the most common questions in single-cell RNA sequencing analysis is deceptively straightforward: how many cells express a given gene, and are the expression levels truly comparable across datasets? If you have ever stared at your results wondering whether your normalization is hiding real biology — or whether you are genuinely comparing apples to apples — you are not alone. When comparing gene expression across integrated datasets, the devil is very much in the details of how you handle the math.

In this blogpost, we’ll cover NormalizeData() vs SCTransform() and how to compare gene expression across single cell datasets.

So if you’re ready… let’s dive in!

Step One: is your workflow actually valid?

A typical workflow looks something like this: normalize each dataset individually, integrate them, then calculate mean expression and the proportion of expressing cells per dataset. This is a reasonable starting point, but it comes with one critical caveat that is easy to overlook.

When you run an integration workflow — such as Seurat’s IntegrateData() — the resulting “integrated” assay contains residuals or anchor-corrected values. This mathematically transformed space is designed specifically to remove batch effects so that cell types align correctly across datasets in low-dimensional space. It is excellent for clustering and UMAP visualization. It is not appropriate for reporting gene expression.

This is the golden rule of integrated single-cell analysis: never use the integrated assay for differential expression testing or for comparing mean expression levels. The values in that slot no longer reflect biological expression in any interpretable sense. Instead, always perform your expression comparisons on the RNA assay — using the log-normalized counts from NormalizeData() — or on the SCT assay if you used SCTransform. If you are already pulling values from the data slot of the RNA assay after normalization, your workflow is valid for a first-pass comparison, provided sequencing depths are not drastically different across your batches.

NormalizeData()

NormalizeData() applies a straightforward log-normalization: each cell’s counts are divided by its total count, multiplied by a scale factor (typically 10,000), and log-transformed. The appeal of this approach is its interpretability. If a cell has even a single count of your gene of interest, it will produce a non-zero normalized value. Nothing gets quietly erased.

For comparing the proportion of cells expressing a gene across datasets, this transparency is genuinely useful. The detection threshold behaves consistently, the values are on an intuitive scale, and the method is well-validated for this kind of descriptive analysis. Its main weakness is that it does not fully account for the relationship between a gene’s mean expression and its variance, which means it can be less effective at removing technical noise driven by sequencing depth differences. If one dataset has substantially higher UMI counts per cell, apparent expression differences may be partly technical even after log-normalization.

SCTranform()

SCTransform takes a more statistically rigorous approach. It fits a regularized negative binomial regression for each gene, modeling and removing the relationship between sequencing depth and expression. This makes it more robust for integration and generally better at separating biological signal from technical noise.

However, this comes with a consequence that surprises many analysts: SCT typically produces fewer cells classified as “expressing” a gene, and the associated expression levels appear lower. This is not a bug, and it is not necessarily a biological finding either. What SCT is doing is calculating Pearson residuals and shrinking very low, noisy counts toward zero when the model determines they are more likely technical noise than true signal. Under log-normalization, those same observations might comfortably pass a detection threshold. The difference you observe between NormalizeData() and SCT is therefore largely a reflection of how aggressively each method handles low counts — not a direct window into biology.

For the specific goal of reporting what proportion of cells express a gene, many researchers continue to prefer the log-normalized RNA assay precisely because it is less aggressive with the raw data. SCT is the stronger choice for clustering, dimensionality reduction, and differential expression, but its Pearson residuals are less intuitive for simple descriptive comparisons of expression levels.

Can you compare expression using SCT across datasets?

Yes, but consistency is non-negotiable. You cannot mix normalization methods and draw meaningful comparisons — comparing an SCT-normalized dataset to a log-normalized one is not a valid apples-to-apples analysis. If you want to use SCT for cross-dataset expression comparisons, the workflow needs to be consistent throughout: run SCTransform() on each dataset individually, integrate using the SCT-compatible pipeline with PrepSCTIntegration(), and then use the SCT assay for all downstream comparisons. As long as the same method is applied uniformly, SCT comparisons are valid — and may actually be more honest if your datasets differ substantially in sequencing depth, since the model corrects for depth bias more aggressively than log-normalization does.

In practice, running both methods and treating the SCT result as a sensitivity analysis is a defensible strategy. If your conclusions about which datasets show higher expression or more expressing cells hold up under both normalizations, they are on considerably stronger ground.

Choosing the right assay for the right question

A useful way to think about this is to match the assay to the biological question:

Clustering and UMAP: use the integrated assay. It removes batch effects so cell types align correctly, but the values themselves are not interpretable as expression.
Proportion of cells expressing a gene: use the RNA assay with log-normalized counts. This gives the most transparent and consistent reflection of whether a transcript was detected.
Expression magnitude and cross-dataset comparisons: either SCT or RNA can work. SCT handles depth bias more rigorously; RNA is easier to interpret. Choose one and apply it consistently.

A practical sanity check is to go back to the raw counts. When you are uncertain whether an observed difference in expression is biological or technical, the most grounding thing you can do is look at the raw counts. If dataset A has 500 counts of your gene of interest and dataset B has 10, with similar total sequencing depth per cell, that difference is very likely biological. If the raw counts are similar but the normalized values diverge, your normalization parameters may be doing something unexpected. Similarly, if you observe lower expression or fewer expressing cells in certain datasets, the first question worth asking is whether those datasets also have lower median UMI counts or fewer detected genes per cell. A dataset with substantially lower sequencing depth will naturally show reduced detection, and this technical confound should be documented and accounted for before drawing biological conclusions.

Conclusions

Comparing gene expression across integrated single-cell datasets is entirely achievable, but it requires deliberate choices at every step. Use NormalizeData() and the RNA assay when you want interpretable, consistent detection thresholds and transparent expression values – this is the recommended approach for reporting the proportion of expressing cells. Use SCT when you want more rigorous noise correction and are comfortable with a less intuitive scale. Always integrate using the appropriate pipeline for your normalization method, and never report expression from the integrated assay itself. And when in doubt, let the raw counts tell you whether the difference you are seeing is worth believing.

And that is the end of this blogpost!

In this post, we covered NormalizeData() vs SCTransform and what’s the best approach when comparing gene expression across datasets. Hope you found it useful!

Before you go, you might want to check:

Squidtastic!

You made it till the end! Hope you found this post useful.

If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

Otherwise, have a very nice day and… see you in the next one!

Comparing Gene Expression Across Integrated Single-Cell Datasets: NormalizeData(), SCTransform()

Step One: is your workflow actually valid?

NormalizeData()

SCTranform()

Can you compare expression using SCT across datasets?

Choosing the right assay for the right question

Conclusions

Squidtastic!

Submit a Comment Cancel reply

Recent posts

Comparing Gene Expression Across Integrated Single-Cell Datasets: NormalizeData(), SCTransform()

Step One: is your workflow actually valid?

NormalizeData()

SCTranform()

Can you compare expression using SCT across datasets?

Choosing the right assay for the right question

Conclusions

Squidtastic!

Submit a Comment Cancel reply

Popular Posts

Recent posts