Skip to content

Cutting out the noise in dimensionality reduction

Deciding how many principal components (PCs) to retain when reducing the dimensionality of an omics dataset is a frequent challenge. This choice is often arbitrary, based on experience, or standard defaults. For example, Seurat (a widely used R toolkit for single-cell RNA-seq analysis) and Scanpy (a popular Python package for single-cell analysis) both default to 50 PCs. Rarely is there a theoretically grounded way to determine the optimal number of components for a given dataset.

A typical approach is to examine the cumulative variance explained by each PC and pick a cutoff, or to look for an “elbow” in the scree plot (which shows the variance captured by each principal component). The problem is that PCs can capture biological signals, structured technical variation, and random noise. This makes it difficult to determine the dataset’s rank (the number of meaningful PCs) based on intuition alone. Overestimating the rank incorporates random noise and risks diluting key conclusions, while underestimating it can obscure critical differences between cell types.

This is why I found the paper “Principled PCA Separates Signal from Noise in Omics Count Data” so interesting [1].

So, what do the authors mean by “principled” PCA? Or, conversely, what is “unprincipled” PCA? In omics, unprincipled PCA usually looks like this: log-transform the data, normalize by read counts, run PCA, and keep the first X components that explain a set percentage of variance. The problem is that not all variance is biological, and it is difficult to know which components still carry a meaningful signal (whether technical or biological) versus those that represent purely stochastic elements of the data and are therefore meaningless.

It’s also important to note that when one generates PCs from a randomized dataset, they will not be identical, and PCs like these presumably exist when we overextend the rank of our data.

This brings us to the Biwhitened PCA approach described in the preprint. The authors ask: what would the eigenvalue spectrum (the set of variance values captured by each principal component) look like if the data were purely noise? PCs whose eigenvalues (which measure how much variance each PC explains) exceed this noise threshold are retained. To do this, the data is transformed to be heteroscedastic (so that variance across rows and columns is roughly uniform). Under these conditions, we can use concepts from random matrix theory, which predicts how variance behaves in a completely random dataset. The eigenvalues of a purely random matrix follow a predictable pattern called the Marchenko-Pastur distribution. PCs whose eigenvalues exceed this pattern are likely to represent true biological signal (as well as technical variation) rather than random noise.

This approach lets the authors estimate the rank of a dataset based on theory rather than heuristics or prior experience. As noted, the default in many tools is 50 PCs, but the actual rank can be below or above this default. The key advantage of principled PCA is that it removes random noise and allows researchers to focus on the meaningful signals.

Two notes of caution: first, this approach likely does not remove batch effects, which can systematically influence mitochondrial read counts, cell quality, or cell cycle states. These effects are “noise” for many biological questions, but they are still real signals and will likely survive this approach. In such cases, individual components should be examined, and their relevance carefully assessed. In some situations, it may make sense to exclude specific components.

Second, while this approach could shape how principal components are calculated and chosen in the future, it is still a preprint and has not been fully vetted by the community. The authors’ reasoning seems sound, but this technique should be applied cautiously and validated further before widespread adoption.

For those who are interested, you can find the corresponding GitHub here: https://github.com/KlugerLab/bipca

Sean O’Toole

Reference:1. Stanley, J. S. et al. Principled PCA separates signal from noise in omics count data. Preprint at link (2025)