When does 0 equal 0?

back

The power of droplet-based single-cell RNA sequencing, as well as current gen-spatial transcriptomics techniques, is the ability to sample large numbers of cells. However, all current high-throughput transcriptomics technologies come with the caveat of high dropout rates. Typically, this isn’t a massive problem, and the high-throughput nature of the various platforms often makes up for it.

However, in some cases, we would like to measure low-abundance transcripts with high accuracy. This is especially important when a low-abundance transcript is being investigated as a potential therapeutic target, and investigators want to know where and when it is expressed with high fidelity to understand the feasibility of targeting that gene for potential treatments.

Why low-abundance transcripts can matter:

Gene expression at low levels whose dropout rates are high can be the result of transcriptional noise. However, they may also be necessary to subtype cells or describe disease-relevant states accurately. In some cases, these genes are extremely relevant.

For example, SARS-CoV-2 tissue susceptibility depends on viral entry factors. However, genes such as ACE2, which is one such factor, are expressed at very low levels. Low- and high-read-depth datasets can differ by a factor of 20 in their estimates of ACE2-positive basal cells [1]. These large margins mean that low-depth datasets underestimate which tissues and cells were most susceptible to SARS-CoV-2 infection [2].

So, the question then becomes, when is a 0 a 0? Can we infer the presence of a transcript we cannot detect to better estimate disease susceptibility and potential therapeutic targets?

Strategies to deal with dropout:

Valyaeva et al. benchmarked several different approaches to this problem². First, let’s describe the general way these approaches are set up.

Probabilistic inference

Probabilistic inference involves asking, given transcripts X and Y, when can we reasonably expect to see transcript Z despite dropout? Both SAVER [3], as well as scImpute [4] are packages which operate this way.

Smoothing approaches

Another approach involves grouping transcriptionally similar cells and smoothing expression states to account for zero entries. Examples of these approaches are the knn-smoothing [5] and MAGIC [6].

Machine learning

Finally, machine learning approaches can be used to generate low-dimensional representations, followed by the construction of a denoised, non-sparse matrix. This is typically performed using some variation of an autoencoder. Examples include scVI [7] and ALRA [8].

Benchmarking results:

Valuaeva et al. benchmarked these methods and others using simulated datasets and found that MAGIC, SAVER, scVI, DCA, and scBIG all tended to overinflate the positivity rate of lowly expressed genes, since almost all genes were assigned non-zero estimates. This was further complicated by the fact that the authors had not identified a clear way to calculate and determine an appropriate cutoff for zero expression.

This implies that these approaches require parameter tuning and, in some cases, do not produce raw or corrected count matrices.

It should also be noted that MAGIC, SAVER, scVI, and scBIG are not suitable for accurately inferring the percentage of positive cells without adjustments and supervision.

In cases where the data were not entirely synthetic, but instead generated by down-sampling high-depth datasets, ALRA, KNN-smoothing, and scImpute performed best in recovering information about low-expressed genes. ALRA showed the best overall performance independent of sequencing depth.

When ALRA was applied to another dataset containing matched low- and high-depth technologies with biological replicates, the ability of the algorithm to recover lowly expressed genes was impressive. This genuinely appears to be the best method for imputing high-dropout genes in single-cell transcriptomics datasets.

Take-home message:

Overall, for imputing low-expressed genes while preserving biological zeros, ALRA was clearly the highest performer.

Sean O’Toole

References:

Valyaeva, A. A., Zharikova, A. A., Kasianov, A. S., Vassetzky, Y. S. & Sheval, E. V. Expression of SARS-CoV-2 entry factors in lung epithelial stem cells and its potential implications for COVID-19. Sci. Rep. 10, 17772 (2020).
Valyaeva, A. A., Tochilkina, M. S. & Sheval, E. V. Evaluating imputation methods for accurate estimation of cell population fractions in single-cell RNA sequencing. NAR Genomics Bioinforma. 8, lqaf204 (2026).
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
Wagner, F., Yan, Y. & Yanai, I. K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data. 217737 Preprint at https://doi.org/10.1101/217737 (2018).
Dijk, D. van et al. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 174, 716-729.e27 (2018).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Linderman, G. C. et al. Zero-preserving imputation of single-cell RNA-seq data. Nat. Commun. 13, 192 (2022).

back