How The ARISE Network Is Rethinking Clinical AI
You’ve seen the headlines: AI aces the medical boards. AI outperforms expert physicians. But what does this actually mean? And how do we evaluate technology that’s advancing faster than we can fully make sense of it?
The AI Research and Science Evaluation ( ARISE ) Healthcare Network was formed to help answer these questions. Spanning multiple medical centers and led by physicians at Harvard and Stanford with diverse and complementary backgrounds, ARISE is trying to understand what AI systems can do in medicine and how we can evaluate and explain their performance.
They are working to define what holds up in real-world medicine, what we mean by clinical reasoning, how clinicians and AI should work together, when either may perform better alone, and how we might recognize if AI approaches “medical superintelligence.”
The Physician Data Scientist And AI Magic Tricks
Arthur C. Clarke famously wrote, “Any sufficiently advanced technology is indistinguishable from magic.”
Decades later, many people see A.I. as magic. So, who better to spot a magic trick than Jonathan H. Chen, a physician, data scientist, and performing magician?
Chen’s path is not typical. He started college at 13 and worked as a software engineer before returning to school to earn an MD and PhD in computer science and then training in internal medicine. Since joining the Stanford faculty in 2017, he’s been evaluating how AI applies to medical problems.
He points out that the first rule of magic is (mis)directing the audience to look where you want them to look. So, when LLMs like ChatGPT arrived, he knew to look in the other direction to understand what they’re doing, where they fail, and how clinicians might use them.
A core theme of his research is understanding how physicians and AI can best work together .
In late 2024, his team made headlines after finding that, on diagnostic reasoning tasks, LLMs alone outperformed both physicians using AI and physicians working alone. This ran counter to the long-held “ fundamental theorem ” of informatics that physicians plus AI will outperform either alone.
Part of the explanation was timing. It was still early, and many physicians used LLMs like search engines.
So, in a follow-up trial , the team tested a customized LLM tailored for clinical collaboration that taught clinicians in real time how to use it. This time, physician-plus-AI outperformed physicians alone, while matching—but still not surpassing—AI alone in diagnostic reasoning.
The group later reported similar results in another study on management reasoning tasks. Through a new ARPA-H grant, ARISE is now building a “flight simulator” for medicine to study and improve how clinicians and AI work together.
Taken together, these findings raise a deeper question: if AI alone sometimes outperforms physicians working with AI on reasoning tasks, what exactly are we measuring when we talk about “clinical reasoning” in the first place?
The Physician Historian And The Nature of Reasoning
Adam Rodman, a fast-talking and even faster-thinking Harvard internist, medical historian, and clinical educator, has spent the past two decades studying clinical reasoning and decision-making.
The first thing he will tell you is that none of this is new. Technology has always changed what it means to be a doctor. Think of the stethoscope, anesthesia, penicillin, MRI, and the electronic health record.
What’s different this time is that AI moves up the cognitive stack, shifting knowledge and even thinking to machines. Yet there’s also a long history behind this work, and clinical reasoning may not be what medicine portrays it to be.
Rodman points out that modern ideas about both clinical reasoning and AI surprisingly share common roots in World War II-era signal detection theory, which gave rise to frameworks such as sensitivity, specificity, and ROC curves.
Building on this tradition, pioneers such as Robert Ledley and Lee Lusted argued in a landmark 1959 article that medical decision-making could be understood through logic, probability, and value theory. Their work laid the groundwork for a series of computerized diagnostic tools like INTERNIST-1 and Isabel that sought to model the clinical reasoning of expert physicians.
Rodman believes that computer science, in turn, shaped how medical schools teach clinical reasoning, using frameworks such as Fagan’s nomogram, pretest probabilities, and rule-based heuristics. While these approaches are useful for teaching and assessing trainees, he believes they may not fully capture how experts actually practice.
In his words, “We train doctors in ways that reflect the appearance of expertise, based on cognitive models and computer-era abstractions, rather than how real experts behave, which is often fast, intuitive, and non-linear. And we are now building AI systems that mimic that same abstraction.”
These same traditions shaped how medical AI systems came to be evaluated, often using complex clinical case vignettes drawn from the New England Journal of Medicine clinicopathological case conference series.
Following this tradition, when LLMs emerged, Rodman and colleagues were the first to report that GPT-4 provided the correct diagnosis in its differential in two-thirds of these challenging cases. Still, he was quick to admit that studies like his are limited by saturated benchmarks and a lack of physician comparators.
So, Rodman and the ARISE team went a few steps further in a set of experiments recently published in Science . They found that OpenAI’s o1 reasoning model outperformed physicians across multiple historical clinical reasoning tasks.
More notably, o1 performed as well or better than two Harvard internists in generating differential diagnoses based on EHR data for 76 real-world emergency cases.
While the study captured widespread attention, Rodman sees this as an incremental step on a much longer journey.
“What we need now,” he told me, “are prospective clinical trials in real-world patient care settings.”
Of course, diagnosis is just one aspect of clinical reasoning. And clinical reasoning is just one domain in which medical AI is being developed. As AI systems begin to perform differently across tasks—and sometimes outperform physicians—how should we evaluate them in ways that actually matter?
The Physician Bridge Builder And Communicating Science
Ethan Goh is a thoughtful, mild-mannered physician with a diverse range of experience beyond his years. After starting his career as a hospitalist in Europe and Asia, he served as a policymaker in Singapore, an advisor to the UK National Health Service, and an executive at a digital health startup before joining Stanford as a postdoctoral fellow in informatics.
Now serving as ARISE’s Executive Director, he draws on his diverse background to connect AI development, academic research, and real-world clinical care.
Goh sees ARISE’s main role as understanding and clearly explaining what AI can do in healthcare.
Traditionally, AI was evaluated using medical exam questions, such as those on the USMLE. Yet while these exams assess knowledge, they do not reflect real-world practice, where clinicians iteratively gather information from patients who often present differently than textbook descriptions. And a machine that performs well on a standardized knowledge test will not necessarily provide good clinical care.
The field is now moving toward simulations that more closely mirror clinical practice, often using rubrics rather than single correct answers.
Still, even these newer benchmarks typically lack context and focus on isolated cognition rather than actual clinical work.
Because medicine is not a single task and “doctoring” is not a single function, Goh argues that benchmarks must be framed around precise tasks such as triage, diagnosis, treatment, and communication, each with different thresholds for AI readiness.
Accordingly, ARISE introduced the Medical AI Superintelligence Test (MAST), which combines multiple domains of clinical competence—including diagnosis, management, reasoning, safety, and agentic workflow use—all benchmarked against realistic physician baselines and incorporating physicians working with AI, not just models alone.
As Goh explained, “Our goal is to open-source benchmarks and constantly index the latest models to find out where they are strong or weak on key clinical tasks, rather than the industry relying on its own limited benchmarks.”
One MAST component benchmark is NOHARM, which quantifies how often an LLM makes potentially harmful recommendations. Recently, the ARISE team reported that even top models generate potentially harmful advice in up to 22% of cases, typically due to errors of omission. Still, the best models outperformed generalist physicians on safety, and ensembles of models made fewer errors than individual models.
Another MAST component is MedAgentBench, which assesses models’ ability to independently perform 300 patient-specific, clinically relevant tasks—like ordering medications and aggregating test results—in a realistic FHIR-based EHR setting.
In mid-2025, the ARISE team reported that the best-performing model achieved a 70% success rate, with most failures clustering around tasks requiring three or more steps. However, just six months later, Anthropic announced that its Opus 4.6 model achieved a 92% success rate, underscoring how quickly these capabilities are advancing and how quickly benchmarks themselves may become outdated.
In response, ARISE developed PhysicianBench , a new benchmark designed to evaluate how well AI agents complete multi-step medical consultation and execution tasks in realistic EHR settings.
Yet AI may quickly outpace this benchmark, too. If these trends continue and AI approaches superintelligence—which ARISE defines as outperforming top clinicians across a range of clinically meaningful tasks under real-world conditions—evaluating AI based on concordance with physician experts will break down, just as experts in the game of Go were confounded by AlphaGo’s winning Move 37.
This will force a shift to real-world randomized controlled trials with hard clinical outcomes.
Chen, Rodman, and Goh each believe we are on the verge of a fundamental shift in what it means to be a doctor. Whether they are right or wrong, AI is forcing medicine to reconsider some of its deepest assumptions about clinical reasoning and expertise, human-machine interaction, and how we define good care.
In the process, AI is pushing us to think more carefully about what physicians do, where AI helps, and how and when the two should work together. These questions are no longer theoretical.
Loading article...