Understanding the limitations and promise of foundation models for drug target discovery
You’ve likely interacted with large language models and, while you may not be an expert in their architecture, you are skilled at evaluating their human-readable outputs.
Text generation by LLMs has already added value and boosted productivity. However, their greatest potential may lie in handling data that is not human-readable, such as hundreds of thousands of transcriptomes across numerous conditions. We cannot read such data like a novel or intuitively identify the best gene targets for a disease, but LLMs can potentially predict whether altering specific genes could shift a cell, organ, or person toward a healthier state.
This is where biological foundation models add value at the intersection of AI and biology. There is enormous promise, but pharma and biotech leaders must understand both their limitations and potential.
The current slate of perturbation prediction models
Let’s first clarify that there are several types of biologically relevant models[1]. Here, we focus on foundation models that predict how manipulating individual genes affects physiological outcomes.
Most of these models are based on transformer architectures like those behind ChatGPT and Gemini [2]. At a high level, they learn rich representations of a cell from its gene expression profile. The key utility is asking how a cell’s state shifts relative to a disease reference: does a genetic perturbation move the cell toward or away from a healthy phenotype?
The central assumption is that with enough training data, the right architecture, and a robust training regime, these representational shifts can meaningfully predict therapeutically relevant targets. Pre-training is crucial for generalization to new datasets and contexts.
Current examples include Geneformer [3], STATE [4], and scGPT [5]. Despite differences in encoding, training, and architecture [1], they share this core representational and perturbational approach.
Multi-gene perturbation and differential cell-type response predictions are strong cases for foundation models
Before discussing the limitations of foundation models for perturbation prediction and target discovery, it is useful to highlight two of their most promising applications.
First, screening every gene in a cell line or animal model is expensive but feasible. However, once combinations of two or more genes are considered, the problem quickly becomes intractable. Models that can meaningfully narrow this search space could dramatically accelerate therapeutic development. This is especially important for multigenic diseases, where gently modulating multiple genes may be more effective than strongly targeting a single gene.
Second, these models can help predict how a specific cell type or tissue responds to a perturbation. They can also anticipate off-target effects, such as when an AAV designed for the heart also affects the liver or kidney, improving safety and precision.
Foundation Models have a long way to go
There is a lot of hope for these models, but are they actually reliable for strong biological predictions? The current evidence is mixed.
One study found that predicting fold-changes across the transcriptome for double perturbations did not outperform simply adding the individual perturbations, and a linear model predicted unseen expression profiles better [6]. However, this analysis focused on transcriptome reconstruction, which is not the same as predicting biological outcomes.
Another study showed that predicting unseen perturbations is harder than expected. Models that seem accurate may exploit systematic biases in disease versus healthy datasets, so fine-tuning on small or less diverse datasets can reflect sample biases rather than true biology [7].
A different study found that while foundation models excel at technical tasks like batch effect correction, simpler methods such as scVI and PCA often produce embeddings that better capture biologically meaningful perturbation signals and generalize more reliably [8].
However, prior benchmarks often use metrics that dilute meaningful signals, since only a subset of genes is informative [9]. When metrics are properly calibrated to detect biologically relevant predictions, foundation models and deep learning generally outperform linear baselines, suggesting earlier limitations reflect evaluation rather than modeling.
Data scale and processing remain an issue
These models depend on large amounts of data. The latest foundation models trained on transcriptomic data typically use over 100 million cells, often drawn from public resources like CellXGene, maintained by the Chan Zuckerberg Initiative. While this seems substantial, limitations appear when considering context. CellXGene (link) currently hosts 2,044 datasets, so even training on 100 million cells may not provide enough diversity for the model to fully capture the intended biological relationships.
Moreover, how data are standardized, weighted, and filtered during training remains an unresolved challenge.
Forming intuitions on current limitations
To build intuition about the strengths and weaknesses of biological foundation models, consider GPT-2 small, an early ChatGPT predecessor with 124 million parameters. This is comparable to one of the versions of Geneformer (Geneformer_v2_104M), which has 104 million parameters. While the amount of data GPT-2 small was trained on is not public, it is far less than its successors, making it useful for illustrating issues related to data scale.
A fairer comparison would involve training on noisy or sparse text to reflect single-cell data. I do not do that here, which actually gives GPT-2 small an advantage, but the comparison remains instructive.
With that in mind, let us conduct a brief experiment, which you can try yourself (link to the app). For these examples, I selected GPT-2 small with default parameters and a maximum sequence length of 20.
If we give the following prompt
“The mitochondria are the powerhouse of the”
GPT-2 small produces:
➜ “body’s energy production. They help to keep our bodies running”
This is a reasonable continuation, although many people would expect the answer to begin with “the cell” given how familiar this phrase is to anyone who has taken a biology course.
Now, let us introduce sparsity:
“The mitochon___ is the pow__house of the”
GPT-2 small produces:
➜ “sun. It’s a place where you”
This result is clearly incorrect. Perhaps underscores are an unfair representation of sparsity, so instead we introduce simple misspellings:
“The mitokondria is the powrhouze of the”
GPT-2 small produces:
➜ “kyber-tree, and”
At this stage, the model fails decisively. A transformer with parameters comparable to current biological foundation models struggles when the input is sparse or noisy. Some of this behavior reflects tokenization choices, and this is not a rigorous evaluation or a direct claim about biological modeling. Still, it illustrates current limitations, which do not imply the approach is flawed. Improved results are expected as data quality, scale, and modeling methods advance.
Today, these models are best viewed as prioritization tools
These examples and criticisms do not mean we should reject biological foundation models for perturbation simulation. They highlight current limitations, but these models remain informative for examining individual gene impacts, especially when the tissue or disease is well represented in the training data. Fine-tuning on disease-relevant datasets is still important.
It is best to view these models as tools for exploration and prioritization. Like single-cell or histological atlases, they can help researchers identify promising targets, provided outputs are validated using SNP, eQTL, or expression data.
These models are not magic, but they are valuable engines for hypothesis generation. As model quality improves and we refine evaluation methods, there is reason for optimism about their future utility.
References
1. Ahlmann-Eltze, C. et al. Representation learning of single-cell RNA-seq data. RNA. link (2026)
2. Vaswani, A. et al. Attention Is All You Need. Neural Information Processing Systems. link (2017).
3. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature. link (2023).
4. Adduri, A. K. et al. Predicting cellular responses to perturbation across diverse contexts with State. Preprint at link (2025).
5. Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods. link (2024).
6. Ahlmann-Eltze, C., Huber, W. & Anders, S. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nat. Methods, link (2025).
7. Viñas Torné, R. et al. Systema: a framework for evaluating genetic perturbation response prediction beyond systematic variation. Nat. Biotechnol. link (2025).
8. Bendidi, I. et al. Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all. Preprint at link (2024).
9. Miller, H. E., Mejia, G. M., Leblanc, F. J. A., Wang, B. & Swain, B. Deep Learning-Based Genetic Perturbation Models Do Outperform Uninformative Baselines on Well-Calibrated Metrics. Preprint at link (2025)
