Eight Ways Mechanistic Interpretability Can Drive Scientific Discovery in Biology
A framework for translating what's inside biological foundation models into end-product knowledge about how life works.
Biological foundation models are getting impressively good at prediction. Models like scGPT, Geneformer, Evo 2, and Pleiades can classify cell types, predict perturbation responses, detect disease from blood, and model DNA across all domains of life. But a prediction, in biology, without a mechanism is just a correlation. And correlations, as every biologist knows, have a nasty habit of not replicating.
The question that matters for science is not can the model predict? but what has the model learned that we haven't? Mechanistic interpretability, the discipline of reverse-engineering what neural networks compute internally, offers a systematic way to answer this question. And the answer, increasingly, is: quite a lot.
Here I lay out a framework of eight concrete ways that mechanistic interpretability of biological foundation models can produce end-product scientific discoveries, meaning not just observations about model internals, but testable, model-independent hypotheses about biology that can stand on their own. Some of these are already demonstrated. Others are speculative but grounded in existing methodology. Together, they sketch a research program that may become central to biological discovery over the next decade.
1. Novel Biomarker Discovery via Feature Attribution
The pattern: Train or fine-tune a biological foundation model on a disease classification task, then use interpretability tools (sparse autoencoders, gradient attribution, probing) to identify which input features actually drive the model's decisions, and check whether those features were previously known to be relevant.
The demonstrated case: The clearest existing example is Goodfire and Prima Mente's recent work on Alzheimer's detection. They applied mechanistic interpretability to Pleiades, a 7-billion parameter epigenetic foundation model, to understand how it detects Alzheimer's disease from cell-free DNA in blood. Going in, the research team expected methylation patterns to dominate the classifier's decisions, given the existing AD literature. Instead, their SAE decomposition with gradient attribution revealed that approximately nine SAE features responsible for the majority of the classifier's performance were all strongly correlated with fragment length, a signal class (fragmentomics) that had been studied extensively in cancer detection but never systematically applied to Alzheimer's.
The critical step was distillation: they built a simple logistic regression model using only fragment-length features extracted directly from raw sequencing data, with no Pleiades embeddings involved at all. This model-independent classifier achieved 0.78 AUROC on an independent test cohort, while classifiers built on the literature-expected signals (methylation alone: 0.55, cell-type deconvolution alone: 0.54) barely generalized. The foundation model served as a hypothesis generation engine, and the final deliverable was a conventional biostatistical model that any lab could reproduce.
Where this goes next: The same pattern applies naturally to single-cell transcriptomics. Consider fine-tuning scGPT or Geneformer on a disease classification task, say distinguishing ulcerative colitis from Crohn's disease using gut biopsy single-cell data, and then running the same SAE + gradient attribution pipeline. The model might reveal that its classifier relies heavily on gene programs in a rare immune cell subtype, or on metabolic pathway activations, that the field hadn't prioritized as diagnostic markers. The key requirement is that the interpretability must surprise you relative to prior knowledge; if the model simply rediscovers known markers, the contribution is validation rather than discovery, which is valuable but not transformative.
This approach extends naturally to treatment response prediction as well: if a model predicts which patients respond to a given therapy and interpretability reveals it keys on a specific transcriptomic signature in a specific cell type, that signature becomes a candidate companion diagnostic biomarker.
2. Gene Regulatory Network Refinement and Causal Edge Discovery
The pattern: Extract interaction graphs from model internals (attention patterns, SAE feature co-activation, activation patching) and compare them against known GRN databases. Where the model-derived network contains strong edges not present in existing databases, those become candidate novel regulatory relationships.
What we know so far: Our systematic evaluation at Biodyn showed that attention patterns in scGPT and Geneformer capture co-expression structure with layer-specific organization (protein-protein interactions in early layers, transcriptional regulation in later layers), but this structure provides no incremental value for perturbation prediction beyond trivial gene-level baselines. In other words, attention encodes biological structure, but not the causal regulatory logic that would constitute genuine GRN discovery.
However, our subsequent SAE atlas work revealed that the residual stream encodes far richer biological organization than attention alone: 29-59% of SAE features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST terms, and features organize into 141 co-activation modules in Geneformer. The negative result on attention sharpens the question of where regulatory knowledge does reside, and the SAE atlas provides a much more promising substrate for GRN extraction.
The concrete end-product: Suppose you train SAEs on foundation model representations computed over Perturb-seq data, where cells have been subjected to CRISPR-mediated gene knockouts. If you find features that activate specifically when a particular gene is perturbed and that encode the downstream transcriptomic consequences, you can then do causal tracing (activation patching) to discover how the model routes information from the perturbed gene through intermediate representations. If one of those intermediate nodes maps onto a transcription factor not currently annotated in GRN databases as regulating the downstream targets, that node becomes a hypothesis for a novel regulatory link. The end-product is a refined gene regulatory network where new edges come with mechanistic evidence from the model's internal circuitry, plus quantified confidence from causal intervention tests.
3. Cell State Taxonomy Discovery via Representation Geometry
The pattern: Study the manifold structure of model embeddings, identify clusters, trajectories, and branching points, and check whether these correspond to known cell states or reveal new ones.
Existing precedents: Goodfire's phylogeny manifold work on Evo 2 demonstrated that geodesic distances along a curved manifold in the model's embedding space correlate linearly with evolutionary branch lengths across 2,400+ bacterial species, constituting one of the most complex natural manifolds yet found in a foundation model. The Alzheimer's paper showed a similar geometric finding at a different scale: fragment lengths form a U-shaped manifold in Pleiades' embeddings with curvature peaks at biologically meaningful values (147 bp for the nucleosome core, 167 bp with linker DNA), suggesting the model devotes additional representational capacity to biologically privileged lengths. Our own hematopoietic manifold extraction work and topological hypothesis screening across scGPT and Geneformer confirmed that these models learn genuine geometric structure, with persistent homology significant in 11 of 12 transformer layers and cross-model geometric alignment between independently trained architectures.
The end-product discovery: Consider running SAE decomposition on Geneformer embeddings across a large multi-tissue atlas. If you find an SAE feature that activates strongly for a specific subpopulation of cells that doesn't cleanly map onto any existing cell type annotation, but which is characterized by a coherent gene program (and which appears consistently across donors and tissues), you may have discovered a novel cell state. The interpretability tools give you both the existence claim (the feature) and the characterization (which genes define it, where in the manifold it sits, what its neighbors are).
The especially interesting case is when the manifold geometry reveals transition states between known cell types that are biologically meaningful. If there's a bottleneck in the differentiation manifold that maps onto a specific, previously under-characterized progenitor population, the geometry itself is telling you something about the topology of differentiation that standard clustering approaches would miss entirely.
4. Drug Target Prioritization via Perturbation Circuit Analysis
The pattern: Use circuit-level interpretability on perturbation-trained models to identify which genes, when perturbed, produce the largest and most specific effects on disease-relevant model features, and rank these as drug target candidates.
Existing validation: Geneformer's original paper already demonstrated a version of this: in silico perturbation with zero-shot learning identified TEAD4, a novel transcription factor in cardiomyocytes that was experimentally validated to be critical for contractile force generation, and in silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that were experimentally validated to improve contractility in an iPSC disease model. However, this approach used the model as a black box; interpretability can make it substantially more powerful.
What interpretability adds: Suppose you've used SAEs to identify a feature that corresponds to a disease-associated inflammatory program. You could then systematically trace, for each possible gene perturbation, how much it shifts the activation of that disease feature. The genes whose perturbation most effectively suppresses the disease feature, while minimally activating other features representing potential side effects or toxicity programs, become top drug target candidates. You're doing in silico target discovery by searching for perturbations that selectively modulate specific model-internal circuits.
What makes this different from just running the model as a black-box predictor of perturbation outcomes is that the interpretability layer gives you a mechanistic story. You can say not just "knocking out gene X reduces the disease score" but "knocking out gene X disrupts the circuit through which the model represents the NF-κB inflammatory cascade, specifically by eliminating the upstream activation of a feature encoding IL-6/STAT3 pathway co-activation." This mechanistic story is exactly what a drug discovery team needs to assess biological plausibility before investing in experimental validation.
5. Cross-Species and Cross-Tissue Mechanism Transfer
The pattern: Find features or circuits in a biological foundation model that are conserved across species or tissues, and use this to transfer biological knowledge from well-studied systems to poorly-studied ones.
Existing evidence for conservation: Our topological hypothesis screening work found strong cross-model geometric alignment between scGPT and Geneformer despite independent training on different data with different architectures, showing that the two models converge on similar geometric organization of gene relationships. This is analogous to independently constructed maps of a city agreeing on the positions of landmarks: strong evidence that the landmarks (biological relationships) are real features of the territory and not artifacts of the cartographer. The Evo 2 phylogeny manifold provides a complementary form of evidence: the model has implicitly learned what's conserved and what diverges across species by encoding evolutionary distances geometrically.
The end-product discovery: Consider identifying a circuit in Geneformer that represents the epithelial-to-mesenchymal transition (EMT) in human lung cancer cells, characterized by a specific set of SAE features and their causal interactions. If you find the same features (or highly analogous ones) activating in mouse model data or in a different tissue (pancreatic cancer), you've established that the model has learned a conserved mechanism, which tells you that findings from one system are likely to transfer to the other.
Conversely, and this is the more novel discovery, if the circuit is partially conserved but diverges at a specific point (say, a different transcription factor takes over a regulatory role in one tissue versus another), that divergence point is itself a discovery about tissue-specific biology. The model's internals become a systematic way to map conservation and divergence across biological contexts, something that would be extraordinarily labor-intensive to do experimentally.
6. Compact Algorithm Extraction
The pattern: Identify a circuit in a biological foundation model that performs a specific biological computation, reverse-engineer the algorithm the circuit implements, and export that algorithm as a standalone, human-interpretable procedure that can be run independently of the model.
This is arguably the highest-ceiling category in the framework, and it maps onto what I've argued in The Map Inside the Machine regarding extracting "compact biological algorithms" from foundation model internals. Unlike the other categories, which produce hypotheses or statistical models, this one produces computational objects that encode biological logic directly.
The demonstrated case: Our hematopoietic manifold extraction work provides what we believe is the first published instance of a biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We showed that scGPT internally encodes a compact (~8-10 dimensional) hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel (616 anchors, 564,253 cells) and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduced a general three-stage extraction method: direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and competitive cell-type classification. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate.
The key point is that the end-product here is not a claim about the model but rather a standalone, lightweight computational object (a low-rank attention operator with interpretable biological factors) that can be applied to new data without scGPT, that outperforms standard methods on multiple benchmarks, and whose internal structure maps onto known hematopoietic biology. The model served as a source of compressed biological knowledge, and interpretability was the extraction tool.
Where this goes further: The Goodfire/Prima Mente fragment-length distillation for Alzheimer's is structurally the same idea at a simpler level: extracting the decision rule the model uses and exporting it as a model-free logistic regression. The next frontier is to apply this extraction methodology to more algorithmically complex computations that models may have learned, such as scoring functions for protein-protein interactions that weight biophysical properties in novel combinations, or quantitative cell fate decision rules that function as something akin to a Waddington landscape function, giving developmental biologists a mechanistically grounded model of differentiation that goes beyond the qualitative metaphors currently available.
7. Artifact Detection and Ground Truth Correction
The pattern: Use interpretability to discover that a model's high performance on a biological task is driven by confounders or batch effects rather than real biology, thereby correcting the scientific record.
This is a "negative" discovery, but potentially just as valuable as a positive one.
Existing evidence that this matters: Our evaluation bias work showed that mapping and candidate-set choices dominate GRN benchmark metrics, causing misleading ranking reversals, meaning that published accuracy metrics for gene regulatory network extraction may be substantially inflated by benchmark artifacts rather than reflecting genuine biological signal recovery. The finding that trivial gene-level baselines outperform attention-derived edges (AUROC 0.81-0.88 versus 0.70) is itself a form of artifact detection at the benchmark level.
The end-product discovery: If you apply SAE decomposition to a cell-type classifier and discover that the top features driving classification encode sequencing platform artifacts (10x versus Smart-seq2 signatures) rather than genuine cell-type markers, you've identified a specific case where published accuracy metrics are inflated. More importantly, by removing those artifact features and examining what remains, you get a cleaner picture of what the real biological signal is. This kind of systematic deconfounding, guided by the model's own internal representations, could correct significant portions of the published biological literature that rely on machine learning classifiers without understanding what those classifiers actually learned.
The Goodfire Alzheimer's paper acknowledged this limitation honestly: fragment length might be predictive because it tracks real pathology, or because it correlates with confounders. Interpretability doesn't automatically resolve this ambiguity, but it surfaces the question in a way that black-box prediction never can.
8. Manifold-Guided Experimental Design
The pattern: Use the geometric structure of model representations to identify "gaps" or "sparse regions" in biological knowledge that should be prioritized for experimental investigation.
How the geometry becomes actionable: If the hematopoietic differentiation manifold in scGPT has a region of high curvature or low data density between two known cell states, that region represents a biological transition that is poorly characterized by existing data. You can use the model to predict what the transcriptomic profile of cells in that region should look like (by interpolating on the manifold), and then design FACS sorting strategies to specifically isolate cells matching that predicted profile. The model's internal geometry becomes a map that tells experimentalists where to look next.
The curvature analysis in the Goodfire Alzheimer's paper (finding biologically privileged fragment lengths at curvature peaks around 147 bp and 167 bp) is exactly this kind of reasoning: the geometry of the representation itself pointed to specific values worth investigating, and those values turned out to correspond to known nucleosome biology and to be informative for disease classification.
The broader vision: At scale, this becomes a systematic experimental prioritization engine. Instead of exploring biological parameter space blindly, you let the model's learned geometry tell you where the interesting transitions, boundaries, and anomalies are, and you direct experimental effort toward those regions. The model doesn't replace experiments; it tells you which experiments are most likely to be informative.
The Recurring Structure
Looking across all eight categories, there is a recurring structural pattern: the interpretability step converts an opaque predictive success into a transparent, testable claim about biology that can stand independently of the model. The end-product is never "the model says X" but rather "the model's internals suggest hypothesis X, which we can test with a model-free approach." The Alzheimer's logistic regression distillation is the cleanest example of this principle, but it applies everywhere: the model is a hypothesis generation engine, and interpretability is the interface through which you extract and validate those hypotheses.
What makes this fundamentally different from just using model predictions as hypotheses (which any black-box model can do) is that mechanistic understanding gives you structured hypotheses rather than point predictions. When you know that the model routes information through a specific circuit, you get a whole family of testable claims: about the intermediate nodes, about the edges, about what happens when you ablate each component. Each of those claims is a potential scientific discovery, and the causal intervention toolkit (patching, ablation, feature steering) provides a built-in mechanism for testing them before you ever go to the wet lab.
There is a vision of the future here that I find exciting on many dimensions: one in which the next generation of biological breakthroughs come not from training ever-larger models and treating them as oracles, but from opening those models up and reading the biology they've written in their weights. The models have already seen more data than any human researcher ever will. The question is whether we can learn to read what they've learned.
Ihor Kendiukhov is the founder of Biodyn, a mechanistic interpretability research organization focused on biological foundation models. The interactive SAE feature atlases for Geneformer and scGPT are publicly available.