Principles to guide clinical AI readiness and move from benchmarks to real-world evaluation
Tej Azad, Harlan M. Krumholz, Suchi Saria
We propose straightforward principles to foster an evaluation-forward operating system that can transform the adoption of clinical artificial intelligence from a leap of faith into a stepwise, trust-building process.
Clinical artificial intelligence (AI) evaluation resembles standardized testing more than bedside medicine. Most studies emphasize retrospective accuracy on curated datasets or short vignettes, with limited measurement of workflow fit, adoption, safety guardrails or downstream care impacts. Recent generative AI systems report striking benchmark gains1,2, but the study evaluations rarely match intended use and reproducible gains in clinical process and outcomes remain scarce. Integrative reviews3 confirm the disconnect, finding few pragmatic, randomized assessments. Meanwhile, there are often off-target endpoints, sparse reporting of who used the tool, when, and how, and minimal monitoring for drift (defined as a degradation in real-world model performance over time as the patient population, data capture, or practice patterns shift away from the features on which the model was trained).
The result is a credibility gap. Clinicians see high-profile performance gains in artificial environments (that is, leaderboard wins), but not evidence that the tools help patients, within existing workflows, at an acceptable level of risk. What they need are tools that work in real clinical environments, not leaderboard winners. To close this gap, we must complement the current benchmarks with measures of success at the task level, in the form of use, and in outcomes. To justify the investment, we need evidence that AI systems can improve patient-centered outcomes and healthcare economics.
Citation
Azad, T.D., Krumholz, H.M. & Saria, S. Principles to guide clinical AI readiness and move from benchmarks to real-world evaluation. Nat Med (2026). https://doi.org/10.1038/s41591-025-04198-1