Monthly AI Seminar Synopsis: Advancing Personal Health with Foundation Models (Xin Liu, PhD, Google Research)
Featuring Tinglong Dai, Risa Wolf
Seminar Summary for HBHI Workgroup on AI and Healthcare
At this HBHI Workgroup on AI and Healthcare seminar, the conversation centered on how foundation models can transform personal health by turning continuous sensor streams into reliable, actionable guidance. The session, moderated by Dr. Risa Wolf of the Johns Hopkins School of Medicine and Dr. Tinglong Dai of the Johns Hopkins Carey Business School, featured Dr. Xin Liu of Google Research presenting “Advancing Personal Health with Foundation Models,” a tour of recent work that links multimodal sensing with large-scale reasoning for sleep, fitness, and everyday wellness decisions. Valerie Smothers served as discussant and situated the technical agenda in the governance and data-trust realities that determine whether such systems can be used responsibly inside a health system.
Speaker Bio: Dr. Xin Liu is a Senior Research Scientist at Google Research whose work integrates AI, wearable sensing, and reasoning to enable personalized health technologies. His recent projects include the Personal Health Large Language Model and Personal Health Agent, the Large Sensor Model trained on multi-million-hour wearable datasets, SensorLM for language-aligned signal understanding, and RADAR, a benchmark for data-aware reasoning on imperfect tabular data.
The Case for Foundation Models in Personal Health
Dr. Liu opened with what wearables can measure today, from sleep and activity to heart rate variability, respiration, and temperature. He framed the question patients actually ask—how to sleep better—and walked through the steps an effective system must perform: check data availability, compute meaningful aggregates, spot anomalies, contextualize within broader health factors, compare to relevant norms, and deliver recommendations that can be followed. That sequence set up why a foundation model approach is useful for problems that mix time-series signals with text and domain knowledge.
What the Evidence Shows: PH-LLM and the Personal Health Agent
The Personal Health LLM was presented as a fine-tuned model for sleep and fitness coaching, evaluated on curated case studies with expert ratings. A companion line of work introduced a Personal Health Agent that uses tool-augmented reasoning to analyze wearable data, generate code when needed, and iteratively answer open health queries. He presented results covering numerical correctness on objective tasks and human-rated reasoning quality on open-ended questions. Dr. Liu’s team also raised a practical question about when fine-tuning is necessary and when an agentic approach on a strong base model is sufficient.
Scaling Wearable Models: The Large Sensor Model
Dr. Liu then turned to a foundation model trained directly on wearable sensor data at scale. He described a dataset built from well over one hundred thousand participants and tens of millions of hours of signals, alongside systematic scaling experiments across compute, data, and parameters. The figures emphasized that larger models can be more sample-efficient and that, when total hours are held constant, the total hours of data often matter more than the number of subjects for downstream discriminative tasks.
Bridging Sensors and Language: SensorLM
To connect continuous signals with natural-language insights, Dr. Liu’s team introduced SensorLM, which “learns the language” of wearable sensors. The examples showed how statistical, structural, and semantic narratives can be generated over a day of data to explain patterns such as activity episodes or shifts in heart-rate dynamics. That bridge between time-series representation and text enables personalized, comprehensible feedback.
Strengthening Base-Model Reasoning: RADAR for Imperfect Tables
Because much of personal health data arrives as imperfect tables, Dr. Liu highlighted RADAR, a benchmark that programmatically introduces missingness, outliers, inconsistent logic, and bad values to test data-aware reasoning. The slides showed RADAR used with Gemini, framing a path to improve tabular reasoning without relying only on end-to-end task-specific fine-tuning.
Discussant Spotlight: Valerie Smothers on Governance and Deployment
As discussant, Valerie Smothers placed the work in the context of Johns Hopkins’ Office of the Data Trust. She focused on how policies, processes, and infrastructure must evolve if personal health agents and sensor-scale models are to be used responsibly. Her remarks tied model design to privacy and safety guardrails, emphasized the need for auditable evaluation artifacts, and underscored workforce readiness for tools that will surface health guidance directly to patients and clinicians.
From the Floor: Safety, Privacy, and Generalization
The Q&A featured sustained and detailed questions from the audience, and the exchange helped translate research claims into operational choices.
Dr. Stuart Ray pressed on clinical significance rather than statistical novelty. He pointed to evidence from follow-ons to the Large Sensor Model and asked how much improvement actually matters at the bedside if sensitivity and specificity cluster near eighty percent. He noted that gains can look fractional once measurement error and data quality are taken seriously, and he asked how the team plans to adjudicate whether a given score is worth adopting in practice.
Dr. Gordon Gao asked about the tension between research and product realities at a large technology company. He raised the question of how research agendas are protected when market turbulence and competitive pressure are high. He also asked how the group thinks about disclosure, reproducibility, and governance when moving from a research prototype to a product-adjacent system that people could come to depend on for health guidance.
Junjie Luo focused on generalization beyond the wearables featured in the slides. He asked how the modeling pipeline would extend to continuous glucose monitoring and other clinical-grade signals, what the failure modes might be when sampling is irregular or sensors drop data, and how the models would avoid spurious correlations when fusing biomedical time series with lifestyle logs.
Additional questions from the audience returned to deployment. These questions probed how to preserve privacy while analyzing continuous streams, how to monitor systems that learn from changing behavior, and how to compare models across devices, firmware updates, and subpopulations. Dr. Liu’s responses emphasized careful data curation, evaluations that include subgroup analyses by design, and live monitoring capable of detecting drift before it affects users.
Looking Ahead
The session closed with thanks from the moderators and a reminder about the next seminar in the series.