Open problems

We have curated a set of problems that we believe are key to advancing from narrow to general virtual cells.

We frequently think and work around these problems, but the opportunity space is much larger than what we can support through formal collaborations alone.

That is why we decided to publish them so more of us can keep pushing against them.

If you can demonstrate meaningful progress (even at small scale) on any of the problems below, we are happy to support your work with:

access to internal, curated datasets
technical feedback and regular discussion with researchers who have industry experience
help validating results in real-world settings

This comes with no fees, but also no funding and no IP transfer.

Your research stays your own, we just help accelerate promising directions.

What counts as progress?

Clear, measurable improvement over a strong baseline (what an experienced computational biologist would build) in a generalization setting that maps well to the real-world application.

PDX effect prediction from in vitro data

Patient-derived xenograft (PDX) models are the dominant preclinical systems in oncology. However, the biological and experimental conditions of PDX studies differ substantially from those of in vitro experiments. Simply aligning the PDX and in vitro transcriptomes with standard alignment methods often removes not only protocol-specific batch effects, but also biologically meaningful differences between the two microenvironments. We believe modeling these differences is necessary to go beyond the current limits of the field.

Progress is improved prediction of PDX growth-rate prediction performance in a treatment-exclusive setting:

all PDX models and in vitro data can appear in training
but the evaluated treatment must not appear in a PDX model in any modality

This models a scenario where a pharma company wants to determine which PDX models to prioritize for a new molecule that already appears promising in vitro.

Multimodal synergy

One would expect additional data modalities to improve biological prediction performance.

A richer representation of cell state should, in principle, provide a better understanding of the hidden biological context in which the experiments were performed.

In practice, models trained on multi-model data can reconstruct missing modalities, but only where at least one measurement is available.

While this is useful, the true value of measuring multiple modalities would be in improving prediction performance beyond the contexts where any measurements exist.

Given the large number of possible modality combinations, there are likely many genuinely synergistic combinations that remain unexplored.

Progress is a measurable improvement in in vitro phenotype prediction performance in cell-line-exclusive (CEX) or perturbation-exclusive (PEX) splits compared to the strongest single-modality baseline.

Better pretraining

Pretraining on native biological data is intuitively appealing, but so far it has not delivered substantial gains in perturbation prediction.

No currently published approach has convincingly demonstrated that large-scale native pretraining significantly improves perturbation prediction performance.

This may indicate either that native single-cell RNASeq is not the right data type for pretraining, or that the current pretraining objectives are too weak.

We're happy to support any approach demonstrating a robust improvement on in vitro cell-line exclusive (CEX) or perturbation-exclusive (PEX) phenotype prediction using native biological data.

RNA-Seq to phenotype prediction

In an ideal setting, abundant public RNA-Seq datasets would act as a common biological substrate connecting many different downstream assays.

In practice, public Drug-Seq or Perturb-Seq datasets currently provide surprisingly limited improvements for phenotype prediction.

We are happy to support anyone demonstrating robust improvement in zero-shot (cell-, or treatment-exclusive) phenotype prediction using unrelated RNA-Seq datasets.

Biological flows

Over short time horizons, the biological landscape of a cell (as defined by its DNA) can be treated as approximately time invariant. That creates a potentially powerful but largely unexplored regularization opportunity. Could iterative or flow-based architectures improve perturbation prediction performance? Can such systems connect experimental protocols with different treatment times?

As usual, progress is a measurable improvement in in vitro cell-line exclusive (CEX) or perturbation-exclusive (PEX) phenotype prediction performance compared to a non-iterative baseline trained on the same data.