Walking through our Tahoe-100M Manuscript

Authored by
Johnny Yu
Released on
February 25, 2025
Authored by
Johnny Yu
Released on
February 25, 2025
Summary

Along with the release of our Tahoe-100M dataset, we also released a manuscript, describing the science and engineering as well as the story behind our work. In this post, we would like to give you an overview of that.

Tahoe is the largest publicly available single-cell dataset that measures the effect of 1200 genes on 50 cell line models. When we started to work on scalable measurements of drug response many years ago, we never thought this scale was remotely possible! There is enough data in here to keep a cancer lab going for a few generations!

But what gets me excited is how this data will supercharge our AI modeling efforts. And by our, I mean collectively the whole field.

AI has begun to transform biology. From proteins to DNA, we have built capable foundation models that leverage unsupervised learning on large collections of public data. Other domains in biology, however, remain beyond our grasp, largely because they are data poor domains.

AI models of the cell, which we have come to know as “virtual cell”, is our next holy grail as we chart a path for bringing capable AI models to biology. Some of the greatest minds of the field have built new NN architectures to tackle this problem. Read; https://www.cell.com/cell/fulltext/S0092-8674(24)01332-1

It has become clear, however, that we are held back by data. And we are excited to be in a position to reshape the data landscape in one fell swoop. At 100M cellular measurements, Tahoe-100M increases the number of cells measured post-perturbation by 50 times. It is not just more single cell data… It is perturbational data!

This was all made possible by the Mosaic platform! What is Mosaic? Tahoe's (formerly Vevo) CSO, Johnny Yu,  took his work in our lab, and scaled it in every dimension… Mosaic brings a highly diversified, exquisitely optimized, and optimally balanced “cell village” approach to perturbation data collection.

With 50 cell lines in each Mosaic pool, we measure how various cell lines, with varying baseline genetics and gene expression, respond to various treatments. And using a cell village means we eliminate batch effects and allow extreme parallelization.

Despite all that, we benefited immensely from other technologies: the scaling of single cell data and ultra-high throughput sequencing. The Parse Bioscience GigaLab platform and  sequencing platform helped Tahoe-100M to achieve an unprecedented scale.

After all this, we chose to “open-source” Tahoe-100M hot off the press. Because we believe in the transformative capabilities of AI, and we think Tahoe-100M removes a key bottleneck in getting us closer to cell biology’s GPT moment. This will benefit us all...

This is accessible today! Visit our HuggingFace to download: https://huggingface.co/datasets/tahoebio/Tahoe-100M