News

Monthly AI Seminar Synopsis: Multimodal Generative Models for Biomedical Imaging Data

Seminar Summary for HBHI Workgroup on AI and Healthcare, Featuring Dr. Serena Yeung-Levy

At the May HBHI Workgroup on AI and Healthcare seminar, Dr. Serena Yeung-Levy of Stanford University gave a wide-ranging and concrete tour of multimodal generative models for biomedical imaging. Her talk, "Multimodal Generative Models: Extracting Scientific Insights from Biomedical Imaging Data," asked how far today's models can go beyond labeling images toward something closer to scientific reasoning. The session was moderated by Dr. Tinglong Dai and Dr. Risa Wolf, with discussion by Dr. Hadi Kharrazi of Johns Hopkins.

A central tension ran through the session. Biomedical images are everywhere, from cells and tissue to radiology and surgical video, but much of their meaning depends on expert context. A model that can "see" an image in a broad sense may still miss the small feature that matters. Dr. Yeung-Levy's talk kept returning to that gap between image recognition and expert perception.

Speaker Bio: Dr. Serena Yeung-Levy is an Assistant Professor of Biomedical Data Science at Stanford University. Her research focuses on artificial intelligence and machine learning methods that enable new capabilities in biomedicine and healthcare. She leads the Medical AI and Computer Vision Lab at Stanford, is affiliated with the Stanford Artificial Intelligence Laboratory, the Clinical Excellence Research Center, and the Center for Artificial Intelligence in Medicine & Imaging, and is a Chan Zuckerberg Biohub Investigator.

The problem: seeing like a scientist

Dr. Yeung-Levy began by framing biomedical imaging as a natural test bed for multimodal AI. The aim is not just automated interpretation. The harder goal is to connect visual evidence with language, prior knowledge, and scientific decision-making. In practice, that means asking whether a model can help a researcher compare image sets, identify abnormalities, generate hypotheses, propose experiments, or troubleshoot image acquisition.

That is the motivation behind MicroVQA, a benchmark for visual question answering in microscopy-based scientific research. The benchmark grew out of interviews with biologist researchers about what would actually be useful in their work. Those conversations led to three broad capability areas: expert visual understanding, hypothesis generation, and experimental proposal.

MicroVQA and the perception gap

MicroVQA includes more than 1,000 research-level visual question-answering items. Many were converted into multiple-choice questions through a careful process that combined model assistance, clinician-informed exam-writing guidance, and repeated checks for shortcuts. Dr. Yeung-Levy made a point that many educators in the room could recognize immediately: writing a good test question is hard. Distractors matter. Language bias matters. A question that can be answered without looking at the image is not really testing visual reasoning.

The results were encouraging, but not reassuring in a simple way. Newer frontier models are improving, especially as reasoning models become stronger. Still, error analysis showed that a common failure mode is insufficient expert perception. The model may land in the right neighborhood, yet miss the feature that changes the answer. Dr. Yeung-Levy described this as a kind of blurry vision problem: the language reasoning may be impressive, but the visual representation can still be too coarse for expert biomedical work.

BIOMEDICA and the data problem

The next part of the talk turned to data. Dr. Yeung-Levy introduced BIOMEDICA, an open biomedical vision-language resource built from roughly 6 million open-access scientific articles in PubMed Central. The dataset contains 24 million images and captions, 30 million figure references, associated metadata, and annotated concepts.

The point is not scale alone. Biomedical images rarely come with perfect text. A figure caption may be useful but incomplete; relevant context may sit elsewhere in the article; and the relationship between image and text is often weaker than a model developer would like. Dr. Yeung-Levy described the work needed to turn this kind of literature-derived resource into something genuinely useful: concept labeling, filtering, balancing, better caption generation, longer-context encoders, and retrieval-augmented systems. Her group has already used BIOMEDICA to train vision-language embedding models and visual question-answering models, and to support biomedical guideline question answering through RAG-style systems.

A harder visual domain: surgery

Dr. Yeung-Levy also discussed work evaluating vision-language models in surgical AI. Surgery is a demanding setting for visual models: the data often come as long videos, important cues can be subtle, systematic data capture is less mature than in many EHR-linked domains, and surgical images can look very different from the data these models see during broad pretraining.

The findings were mixed in an instructive way. General frontier models can recognize some basic surgical scene elements, such as tools or anatomy. But task-specific classifiers still perform better on many precise detection and segmentation tasks. Where foundation models looked more promising was out-of-domain generalization. When models are tested on data that differ from the training distribution, the gap between specialized classifiers and broader foundation models can narrow.

That matters for real deployment. Health systems rarely get perfectly matched data forever.

CellFlex and virtual-cell modeling

In the final portion of the talk, Dr. Yeung-Levy shifted from vision-language reasoning to image generation for biological simulation. She presented CellFlex, a flow-matching approach for simulating how cell morphology changes in response to chemical or genetic perturbations. The broader vision is virtual-cell modeling: learning how a cell state changes under intervention, with possible applications in basic science, experimentation, and drug development.

CellFlex treats the task as distribution-to-distribution mapping, from control cell images to perturbed cell images. This makes flow matching a natural fit because the goal is not to generate an image from noise, but to map one biologically meaningful distribution to another. One especially useful feature is that researchers can visualize interpolations through the learned velocity field, which may eventually make these transformations more biologically interpretable. Dr. Yeung-Levy also described ongoing work to scale this approach to millions of images, study scaling laws, and jointly model images with transcriptomics.

Discussant Spotlight: Dr. Hadi Kharrazi on trust, bias, and human bottlenecks

As discussant, Dr. Hadi Kharrazi pushed the conversation toward implementation. He noted that the work spanned cell-level simulation, pathology, imaging, surgery, and even scientific reasoning itself. Then he asked the hard question: if these models become larger black boxes, and if humans cannot check everything they produce, how can health systems use them without making humans the bottleneck again?

Dr. Yeung-Levy's answer was careful. Better benchmarks are one part of the answer because they allow faster, repeated, large-scale evaluation as models change. Uncertainty quantification is another. Models need better ways to surface what they do and do not know, and researchers need better ways to measure uncertainty that comes from incomplete data, model design, and generation. She was also candid that this remains an open problem. No one has solved it.

Dr. Kharrazi then connected MicroVQA's use of Bloom's taxonomy to education. If AI systems become very strong at lower levels of the taxonomy, should educators still spend time teaching those skills? Dr. Yeung-Levy said foundational knowledge still matters, but the balance may shift. Students may need less training optimized for memorization and more training on how to understand AI's strengths, limits, and trust boundaries. Dr. Kharrazi added that students are already using ChatGPT in class, which makes the question feel less like a future scenario and more like the current semester.

His final question turned to radiology. A radiologist neighbor had asked when AI would put him out of a job. Dr. Yeung-Levy acknowledged that AI is already strong in some types of radiology image analysis and may reduce parts of the workload, especially as imaging volume grows. But she also emphasized the long tail of radiology tasks where data are limited and performance remains difficult. One more hopeful possibility is that AI could help radiologists move back toward a more integrated role in care teams, bringing their visual and disease-understanding expertise into care planning rather than spending so much of their time on image interpretation alone.

From the Floor: lossy images, longer thinking, and what models actually need

The audience questions sharpened the theme of what information a model should receive. Dr. Stuart Ray asked whether biological modeling should avoid lossy intermediate representations when possible, drawing on experience with lens-free holographic imaging of cells where machine learning performed better on raw holograms than on reconstructed images. Dr. Yeung-Levy agreed with the general principle. Deep learning often does better when intermediate lossy steps can be removed and the model can learn end to end. The practical catch is data: when data are limited, adding structure can still be necessary.

Dr. Antonio Trujillo raised an economist's question about image recognition: what is the tradeoff between accuracy and time? Some settings require rapid recognition, such as autonomous vehicles. Others may allow the model to think longer if that improves accuracy. Dr. Yeung-Levy connected this to the rise of deeper reasoning models. Longer reasoning can improve performance, including on MicroVQA. And as the field begins to exhaust easy sources of internet-scale data, more attention is turning to a different question: how do we get more value out of each data point by reasoning longer and better?

Looking Ahead

Dr. Wolf closed by noting that this was a fitting way to wrap up the virtual portion of this year's series. The discussion pointed directly to where the field is going next: AI that does more with visual information, but still has to earn trust in scientific and clinical settings. She and Tinglong also shared two upcoming opportunities in the chat: the June 4 in-person HBHI-AI event at the DSAI Mt. Washington Campus and the June 11 Learning Health Systems Symposium.

After several years of virtual programming, we are especially excited that the June 4 gathering will be our first in-person event in this series. We hope to see many of you there.

Thank you again for being part of this community!