CSB – The Deployment Gap in Clinical AI

Insight Hub

>view archive

2026.03.31

The Deployment Gap in Clinical AI

By Henry J. Haiser, Operating Partner, Corundum Systems Biology

At this year’s HPP Global conference in Abu Dhabi, much of the discussion focused on the promise of AI in healthcare: foundation models, digital twins, and multimodal data collection. But some of the most valuable sessions took a different angle: what happens when you hand control of a high-performing AI model to clinicians?

Across several talks, a consistent theme emerged. Building a model that works in a research setting is one challenge. Deploying it in a way that measurably improves patient care is a much harder one.

Here, we highlight a few of the most instructive examples shared at the conference.

The Messy Reality of Real-World Data

Anna Goldenberg from the University of Toronto presented one of the conference’s starkest case studies. Her team developed a model to predict cardiac arrest in ICU patients using continuous physiological signals recorded every five seconds¹. Across six years of retrospective data, the model correctly distinguished patients at risk for cardiac arrest more than 90% of the time. It looked, by standard metrics, like a success.

Then the model was deployed. On real clinical data from 2020 to 2023, with 360 patients across 42 beds, it achieved almost no precision or recall. The causes were determined to be real-world noise that the training data had not captured: ventilator blockages mimicking cardiac arrest signals, bed transfers creating unexpected signal patterns, and evolving clinical definitions of cardiac arrest that shifted as clinicians became more involved.

Goldenberg’s team eventually recovered performance by carefully redefining outcomes with clinical partners and retraining the model. The fix involved shifting from cardiac arrest prediction to the broader, more actionable problem of patient deterioration.

The lesson is that a model’s initial performance on a curated dataset may tell you very little about how it will behave in the clinic. The messiness of clinical environments, with shifting definitions, missing data, and unexpected artifacts, is a central challenge for clinical AI deployment.

The Reliance Problem

Jenna Wiens from the University of Michigan brought a complementary perspective. Her group has achieved something rare in academic AI research: multiple models running at the bedside, including systems for predicting healthcare-associated infections and clinical deterioration used by Michigan’s rapid response team.

Wiens focused on the challenge of how clinicians interact with AI outputs. In a study of over 450 clinicians across 12 U.S. states, her team tested whether AI assistance improved the diagnosis of acute respiratory failure, a condition with an initial misdiagnosis rate of up to 30%². The AI model itself performed on par with or better than physicians who had access to the full patient record. But when clinicians worked with the model, accuracy rose only modestly, from 73% to 81%. This suggests a ceiling on the value of simply showing clinicians more information. Interface and workflow design may matter as much as the model’s accuracy.

The flip side was even more troubling. When clinicians were given an inaccurate model, their performance dropped to 62%. They did not ignore poor predictions; they over-relied on them. This creates a difficult design problem. AI must be useful enough that clinicians engage with it, but not so authoritative that it overrides independent judgment.

Wiens also presented a pointed evaluation of Epic’s widely adopted sepsis prediction model³. Studying over 77,000 patients at Michigan Medicine, her team found that the model performed well retrospectively but performed worse than random chance at predicting sepsis before clinical diagnosis. Foundation models trained on clinical data may compound this problem, she argued, by learning to predict clinician actions rather than providing independent insight that improves clinical performance.

Bright Spots in Deployment

Not every clinical AI deployment story was cautionary. The ones that worked tended to share a common trait: they were designed around existing clinical workflows rather than asking clinicians to change their behavior. Nigam Shah from Stanford described the rollout of Chat EHR, a natural language interface that allows physicians to query electronic health records conversationally. Since launching, the tool has attracted 1,500 physician users across 23,000 sessions. Shah noted it is one of the rare health IT deployments where physicians actively request access and continue using the tool weeks after adoption.

Yossi Matias from Google Research highlighted ARDA, a diabetic retinopathy screening program deployed in Thailand and India. The system delivers a diagnosis in roughly two minutes, before the patient leaves the clinic, replacing a process that previously took weeks. Over one million screenings have been completed, and governments have committed to six million free screenings over the coming decade. The program’s success hinged not just on model accuracy but on thoughtful integration into existing clinical workflows.

Looking Ahead

The talks at HPP Global painted a nuanced picture. The field has moved well beyond the question of whether machine learning models can match physician performance on specific tasks; in many cases, they clearly can. The harder questions are now operational. How do you handle shifting outcome definitions? How do you calibrate clinician trust? How do you avoid models that simply mirror the patterns of existing clinical behavior?

Though the examples from the conference focused on clinical AI broadly, a recent systematic review of LLMs in medicine tells a similar story.⁴ The authors identified over 4,600 peer-reviewed studies evaluating large language models in clinical medicine since 2022, yet only around 1,000 used real patient data, and just 19 were prospective randomized trials. LLMs outperformed humans more often on synthetic tasks than on real clinical data, reinforcing a theme heard repeatedly at HPP Global: strong benchmark performance does not reliably translate to clinical impact.

The recurring lessons from the conference were clear: train and evaluate models on real-world clinical data, optimize for objectives that reflect genuine clinical utility, and design AI products with a clear understanding of how humans interact with them, as even a perfect model can degrade performance if the interface encourages over-reliance. The teams making progress on these fronts have invested years in the work, and the results are beginning to show.

About Henry J. Haiser

Henry Haiser is an operating partner at Corundum Systems Biology, where he leads scientific diligence and supports portfolio company development. He brings over a decade of experience in pharmaceutical drug discovery at the Novartis Institutes for BioMedical Research and Takeda Pharmaceuticals, spanning pre-clinical and clinical-stage programs. Henry has worked at the intersection of the gut microbiome and therapeutics for over fifteen years. He holds a PhD in microbiology from McMaster University and completed postdoctoral research in systems biology at Harvard University.

Works Cited

Tonekaboni S, Mazwi M, Laussen P, Eytan D, Greer R, Goodfellow SD, Goodwin A, Brudno M, Goldenberg A. Prediction of Cardiac Arrest from Physiological Signals in the Pediatric ICU. Proceedings of Machine Learning Research 85:534–550, 2018.
Jabbour S, Fouhey D, Shepard S, Valley TS, Kazerooni EA, Banovic N, Wiens J, Sjoding MW. Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study. JAMA 330(23):2275, 2023.
Kamran F, Tjandra D, Heiler A, Virzi J, Singh K, King JE, Valley TS, Wiens J. Evaluation of Sepsis Prediction Models before Onset of Treatment. NEJM AI 1(3), 2024.
Chen SF, Alyakin A, Seas A, Yang E, Choi JJ, Lee JV, Chen AL, Warman PI, Bitolas RT, Steele RJ, Alber DA, Oermann EK. LLM-assisted systematic review of large language models in clinical medicine. Nature Medicine, 2026. doi:10.1038/s41591-026-04229-5.