Tabula Bio: towards a future free of disease (& looking for collaborators)

Many of the most devastating human diseases are heritable. Whether an individual develops schizophrenia, obesity, diabetes, or autism depends more on their genes (and epi-genome) than it does on any other factor. However, current genetic models do not explain all of this heritability. Twin studies show that schizophrenia, for example, is 80% heritable (over a broad cohort of Americans), but our best genetic model only explains ~9% of variance in cases. To be fair, we selected a dramatic example here (models for other diseases perform far better). Still, the gap between heritability and prediction stands in the way of personalized genetic medicine.

We are launching Tabula Bio to close this gap. We have a three-part thesis on how to approach this.

  1. The path forward is machine learning. The human genome is staggeringly complex. In the 20 years since the Human Gnome Project, much progress has been made, but we are still entirely short of a mechanistic, bottom-up model that would allow anything like disease prediction. Instead, we have to rely on statistical modeling. And statistical methods are winning over expert systems across domains. Expert-system chess AIs have fallen to less-opinionated ML, syntax-aware NLP models were left in the dust by LLMs, and more recently constraint-based robotics is being replaced by pixel-to-control machine learning. We are betting on the same trend extending to biology.

  2. The core problem is limited data and large genomes. The human genome contains 3 billion base pairs, while labeled biobank datasets (genomes and disease diagnoses) are numbered in the hundreds of thousands. Complex models thus hopelessly overfit and fail to generalize. Additionally, the human genome is highly repetitive and much of it likely has no relation to phenotype. Because of these problems, we can’t simply train high-parameter black-box models (or treat DNA like language in a language model).

  3. Given 1 and 2, novel ML architectures will be required. This is consistent with other breakthroughs in AI. Different problems require different inductive biases. An ML architecture for disease prediction should:

    1. Make use of unlabeled genomics data (human and non-human) as well as homogeneous biobank data. Much genetic coding is conserved across species. Ignoring this is leaving data on the table.

    2. Include priors from human expert research. It would be wonderful to start from a blank slate (we named the company Tabula for a reason). But, for example, we know that DNA is a 3D molecule and that the distance (in 3D space) between genes and regulatory sequences matters. An architecture needs to start from this premise.

    3. Include epigenetic data. We know it’s part of the story (maybe a large part).

We’re interested in probabilistic programming as a method to build such a model.

This is an ambitious idea, and we’re just getting started. But if we look into the future, to a world where humans have closed the heritability gap and personalized genetic medicine has eradicated great swaths of disease, it’s hard to imagine we did not get there via an effort like this.

Our team is currently Ammon Bartram, who was previously cofounder of Triplebyte and Socialcam, and Michael Poon, who has spent the past several years working on polygenic screening, an early employee at Twitch, and studied CS at MIT. We’re honored to have Michael Seibel and Emmett Shear among our initial investors.

Please reach out to us at team@tabulabio.com to help us harden our hypotheses or collaborate together on this mission! We’re especially interested in talking with ML engineers interested in genomics and bioinformaticians interested in ML.