In simple terms, what is C2S-Scale, how does it ‘read’ the language of individual cells, and why do you consider it a breakthrough in single-cell analysis?
C2S-Scale is a family of large language models (LLMs) built upon Google’s Gemma-2 architecture. Think of it as a specialised AI model that we’ve taught to understand the language of biology in the form of gene expression inside of cells. We do this by taking the complex gene activity inside a single cell — measured by a technique called single-cell RNA sequencing (scRNA-seq) — and translating it into a simple “cell sentence,” which is a list of the most active genes in order of their activity.
Story continues below this ad
The model “reads” these sentences across millions of cells and learns the patterns of gene expression that define what a cell is and what it’s doing. The paradigm shift is that this approach bridges the gap between raw genomic data and human language, and allows LLMs to perform complex tasks on cells in natural language.
C2S-Scale generated a new hypothesis about cancer cell behavior, which you then confirmed in living cells. Can you explain that hypothesis?
Our immune system is constantly looking for unhealthy or diseased cells, but cancer cells are often good at hiding. We asked our model to find drugs that could make cancer cells more “visible” to the immune system by acting as a conditional amplifier: increasing antigen presentation in cancer cells when in the presence of low levels of interferon (a key immune signaling protein).
Our model predicted that a drug called silmitasertib would significantly boost antigen presentation in the immune-context-positive setting. This prediction serves as a promising hypothesis that now requires rigorous validation through research and clinical trials.
What the breakthrough means.
Single-cell RNA sequencing lets scientists peek inside individual cells, but the data is massive and complicated. How does C2S-Scale make sense of all that information and understand what’s happening inside a cell?
The key is in its training. Before we asked it to do a complex task like drug screening, we put C2S-Scale through a rigorous pre-training phase. We trained it on a massive dataset of over 50 million cells from public repositories like the Human Cell Atlas, covering a wide range of human and mouse tissues, diseases, and conditions.
Story continues below this ad
During this pre-training, we gave it a series of fundamental tasks, like predicting a cell’s type based on its “cell sentence,” identifying its tissue of origin, or even generating a realistic new cell from scratch. By mastering these foundational tasks, the model learns the fundamental patterns of gene expression. This biological intuition is what allows it to make sense of new, complex information and perform sophisticated reasoning in later stages.
This model has 27 billion parameters, which is huge. Why does the scale of the AI matter when it comes to discovering new biology?
Scale is critical because biology is unimaginably complex. A large model, like our 27 billion-parameter C2S-Scale, has a greater capacity to learn and remember the countless subtle relationships between genes, cells, and tissues. There’s a well-known phenomenon in AI called “scaling laws,” where larger models don’t just get incrementally better, they often develop entirely new, emergent capabilities that smaller models lack. For a problem as vast as understanding life at the cellular level, that massive scale is essential for the model to have enough capacity to uncover genuinely new biological insights.
The model predicted that a drug called silmitasertib could make certain cancer cells more visible to the immune system, but only under very specific conditions.
How did you test this in actual cells, and how did you confirm that the AI’s prediction really works in the lab?
To validate the AI’s prediction, we took it to the lab. We used human neuroendocrine cancer cell lines that the model had never seen before, and set up a controlled experiment with two scenarios: cells treated with silmitasertib alone, and cells treated with a low dose of the immune signal (interferon) along with silmitasertib.
Story continues below this ad
The results confirmed the AI’s prediction. The drug by itself had no effect on the cells’ visibility markers. But when we combined it with low levels of interferon signaling, we saw a marked and significant increase in the molecules that make cancer cells visible to the immune system. It was a clear demonstration of the synergy the model had predicted, moving an AI-generated hypothesis from the computer to a real biological outcome.
It’s important to note the limitations of this validation: these experiments were conducted in vitro, not in a living organism. Furthermore, this was observed in a specific neuroendocrine cancer cell line. While these results are highly promising, significant further research and clinical trials would be required to understand if this effect translates into a safe and effective therapy for patients.
If C2S-Scale can find ways to make cancer cells more visible to the immune system, what does that mean for developing new treatments or speeding up drug discovery?
Traditional drug discovery involves physically screening thousands of compounds in a lab, which is incredibly slow, expensive, and often misses the mark. C2S-Scale allows us to perform these massive screening experiments in silico — inside the computer — at a scale and speed that would be impossible in the real world. This shows AI can be a powerful accelerator for science.
This doesn’t replace scientists, but it empowers them. It allows us to rapidly identify and prioritise the most promising and often non-obvious drug candidates. By narrowing the search space, AI can help researchers focus their lab experiments where they’re most likely to succeed, dramatically shortening the timeline from an initial idea to a potential new therapy.
Story continues below this ad
AI can connect different sources of knowledge to come up with new ideas. In this case, C2S-Scale didn’t just look at cell data, it also read other biological notes. How does it combine all that information to generate something new?
This gets to the heart of our multimodal approach. During its training, C2S-Scale wasn’t just fed raw cell sentences. It saw them alongside the human-generated context they came from — things like scientific annotations, tissue and disease labels, and even summaries from the research papers where the data was published.
By being trained on this rich mixture of biological data and natural language simultaneously, the model learns to connect the dots. It understands that a certain pattern of genes is not just a list, but corresponds to a “T-cell in a kidney from a patient with this disease,” as described in a scientific abstract. This ability to bridge the world of cellular data with the world of human knowledge is what allows it to generate novel hypotheses.