Risk stratification at prediabetes onset and association with diabetes outcomes using EHR data
Ritu Agarwal, Guodong (Gordon) Gao, Junjie Luo, Di Hu, Rui Han, Diyang Lyu, Nestoras Mathioudakis, Jehan El-Bayoumi, Nawar Shara
Abstract
Prediabetes can progress to type 2 diabetes (T2D), but individual risk varies widely. Few studies have rigorously characterized subgroups at the point of prediabetes (PD) onset. Using electronic health records (EHRs), we developed a machine learning approach to stratify PD and analyze T2D progression risk. We defined PD onset based on strict HbA1c criteria and excluded patients with missing follow-ups or atypical clinical events, yielding a high-fidelity cohort of 14,436 patients from an initial pool of 74,054 (2017–2023, MedStar Health). An XGBoost model using routine features, including HbA1c, BMI, blood pressure, lipids, ALT, medication history, and lifestyle factors, was trained on 2018–2020 data and tested on 2021–2022 patients, achieving an AUC of 81.6%. Risk scores enabled subtyping into high-, medium-, and low-risk groups with distinct progression trajectories. Stratification patterns remained consistent in future cohorts. This approach supports earlier, personalized intervention and diabetes risk prediction using real-world EHR data.
Citation: Luo, J., Hu, D., Han, R. et al. Risk stratification at prediabetes onset and association with diabetes outcomes using EHR data. npj Metab Health Dis 3, 48 (2025). https://doi.org/10.1038/s44324-025-00091-0