News

Monthly AI Seminar Synopsis: The Science of Health AI Evals

Seminar Summary for HBHI Workgroup on AI and Healthcare

At this HBHI Workgroup on AI and Healthcare seminar, the conversation focused on a single, consequential question: how should health systems evaluate clinical AI in ways that truly guide patient care. The session, moderated by Tinglong Dai, featured Dr. Michael Oberst, Assistant Professor of Computer Science at Johns Hopkins, who laid out a practical, evidence-first approach to evaluation. He framed automated testing, human-in-the-loop assessment, and randomized trials as complementary tools, each suited to a distinct stage in the journey from promising model to trustworthy clinical aid. Throughout, he returned to one anchor. Methodology and transparency determine whether findings travel from benchmarks to the bedside.

Speaker Bio: Dr. Michael Oberst is an Assistant Professor of Computer Science at Johns Hopkins. His research centers on rigorous and efficient evaluation of AI and machine learning systems in healthcare, with particular attention to statistical uncertainty, human oversight, and the tradeoffs between evidence quality and feasibility. He collaborates with clinicians to design evaluations that support credible, reproducible claims about model performance and clinical impact.

The Case for Rigorous Evaluation
Dr. Oberst opened by treating evaluation as a clinical science rather than a leaderboard contest. Static benchmarks can signal potential, yet real clinical environments present shifting data, confounding factors, and feedback effects that can mislead simple accuracy measures. He argued for prespecified protocols, for explicit accounting of statistical uncertainty, and for evaluation pathways that match the stakes of the decision at hand. Automated evaluations help screen many ideas quickly. Human-in-the-loop studies probe usefulness and safety in context. Randomized trials quantify impact when the question and resources call for them. The aim is not to elevate a single method, but to align method to decision so that evidence carries weight beyond a test set.

What Happens When Medical Models Face Their Base Versions
A central section reviewed findings from studies that compared medical domain-adaptive language models with the strong base models they seek to surpass. The headline was cautionary. Once prompt sensitivity and statistical uncertainty were handled with care, specialized medical variants did not consistently outperform their base counterparts on clinical question answering. In many direct comparisons, medical models lost more often than they won, with clearer gains largely in settings where supervised fine tuning was targeted to medical QA. Dr. Oberst noted that small choices in prompting, sampling, and scoring can flip apparent winners and losers. He pointed to efforts like LavaMed to illustrate the drive to combine general purpose capability with medical training, while emphasizing that the right test is whether any model changes clinical decision making for the better.

From Methods to Practice: Proxies, PPI, and PPI++
Because high-quality human labels are expensive, the seminar turned to how to evaluate responsibly without breaking budgets. Dr. Oberst explained prediction-powered inference and the enhanced approach PPI++, which use proxy labels at scale while estimating and correcting proxy bias. He demonstrated a calculator that helps teams plan studies by linking sample size and expected proxy-to-truth correlation to the precision they can achieve. When proxies correlate strongly with ground truth, PPI++ can reduce cost and speed learning while preserving valid inference. When correlation is weak, the right move is to invest in better human judgments rather than lean on proxies that cannot support credible conclusions.

Discussant Spotlight: Dr. Esther Oh on Representation and Study Design
As discussant, Dr. Esther Oh brought the conversation back to the people represented in the data and to the obligations that follow from that reality. She pressed on the evaluation of AI for older adults and for minoritized patients, and asked how study plans will address known representation gaps rather than assume they wash out in aggregate. Her framing made equity a methodological requirement. If a system is intended for diverse populations, subgroup performance must be measured intentionally, with prespecified analyses and stable criteria. She also argued that evaluation should mirror the care pathway where AI is introduced. If a tool is placed in primary care or triage, outcome measures should reflect those settings and the decisions clinicians must make there, not only abstract accuracy.

From the Floor: Monitoring, Costs, and Clinical Disagreement
Audience questions carried the discussion from design to deployment. Dr. Stuart Ray focused on the infrastructure that health systems need for post-deployment monitoring, pointing to the risk of drift when context changes. Dr. Kadija Ferryman examined the cost assumptions that determine whether AI tools are adopted and the challenge of defining ground truth when clinicians legitimately disagree. Dr. Oberst emphasized careful data curation, live monitoring capable of detecting performance shifts, and well-defined rubrics that specify how disagreement is adjudicated so that standards do not move from case to case.