Table of Contents
- Introduction
- Overview of the Study
- What Is o1-preview and How Does It Differ from GPT-4?
- History of AI in Medical Diagnostics
- Study Design: Five Experiments Testing o1-preview
- NEJM Clinicopathologic Conferences (CPCs)
- NEJM Healer Diagnostic Cases
- Grey Matters Management Cases
- Landmark Diagnostic Cases
- Diagnostic Probabilistic Reasoning Cases
- Key Findings
- Differential Diagnosis Accuracy
- Management Reasoning Superiority
- Medical Reasoning and R-IDEA Scores
- Probabilistic Reasoning Limitations
- Triage “Cannot-Miss” Diagnoses
- Analysis: Why Does o1-preview Excel?
- Implications for Clinical Practice
- Remaining Challenges
- Memorization and Data Leakage
- Probabilistic Reasoning Shortcomings
- Real-World Workflows and Integration
- Looking Ahead: The Future of AI in Medicine
- Conclusion
- References
1. Introduction
Within the past few years, the landscape of medical diagnostics has experienced a rapid transformation, largely fueled by breakthroughs in artificial intelligence (AI). Previously, AI-based systems were considered experimental—tools that might one day help overworked clinicians. However, that day may be arriving faster than expected.
A new study by a team of researchers from Harvard Medical School and Stanford University suggests that OpenAI’s newest large language model (LLM) known as o1-preview might already be outperforming doctors in diagnosing complicated, real-world medical cases. This is not just about besting laypeople or medical students on a simple quiz: in some instances, o1-preview showed performance that exceeded that of practicing physicians, especially in scenarios requiring complex reasoning and management decisions.
Such findings have broad implications. Diagnostic errors and delays cause significant patient harm worldwide, and any technology that can reliably minimize or even prevent misdiagnoses could save lives and reduce healthcare costs, potentially revolutionizing the entire field of clinical medicine. But as with any new technology, there are also pitfalls, ethical concerns, and limitations that must be addressed.
In this article, we’ll explore the specifics of the o1-preview model, how the study was conducted, and why these findings matter. We’ll draw upon multiple references to ensure our analysis is grounded in the current literature and to highlight the existing research that paved the way for these breakthroughs.
2. Overview of the Study
The research team—led by authors from Beth Israel Deaconess Medical Center, Harvard Medical School, Stanford University, University of Minnesota Medical School, and more—performed a comprehensive evaluation of OpenAI’s o1-preview model across five different medical reasoning tasks:
- Generating differential diagnoses on NEJM Clinicopathologic Conferences (CPCs)
- Presentation of reasoning in NEJM Healer Diagnostic Cases
- Management (treatment) reasoning for complex, multi-specialty scenarios
- Landmark diagnostic challenges from a previously published gold-standard study
- Diagnostic probabilistic reasoning tasks
Importantly, the data from these tasks were adjudicated by physician experts, with validated psychometric instruments like the Bond Score for differential diagnoses and the R-IDEA scale for evaluating clinical reasoning. The large sample sizes and the diversity of test formats, ranging from text-based hypothetical cases to real NEJM CPC scenarios, make this study a compelling piece of evidence.
In nearly all tasks except one (probabilistic reasoning), o1-preview either matched or significantly outperformed practicing physicians and earlier-generation language models such as GPT-4.
For direct access to the study and supplemental materials, see: https://arxiv.org/pdf/2412.10849
3. What Is o1-preview and How Does It Differ from GPT-4?
Most readers are already familiar with GPT-4, the general-purpose large language model released by OpenAI that gained global attention for its strong performance on knowledge tests, coding tasks, and yes, even medical licensing exams. But o1-preview is a specialized iteration built with the capacity for so-called “native chain-of-thought” reasoning.
Unlike GPT-4, which can provide chain-of-thought reasoning only when prompted to do so, o1-preview internally expands its own reasoning steps more robustly at run-time. This means that it effectively takes more “mental” steps under the hood to analyze a complex question before finalizing its response. Preliminary evidence suggested that this leads to better performance in tasks requiring multi-step logic, higher-order problem solving, or the integration of multiple data streams.
According to OpenAI’s official system card and a blog post describing the approach (Introducing OpenAI o1), o1-preview’s architecture focuses on dynamic expansions of context, advanced constraint satisfaction, and “reasoning buffers” that let it parse more nuanced requests. The study by Kanjee et al., referenced in the results, is among the first to evaluate the model in the real-world-ish domain of tricky medical diagnoses.
4. History of AI in Medical Diagnostics
Artificial intelligence support tools aren’t entirely new in medicine. Some notable landmarks:
- 1950s and 1960s:
- Emergence of naive Bayesian calculators and linear regression modeling for diagnostic probability.
- Early attempts to systematically measure disease prevalence using electronic calculators.
- [Reference: Brodman et al. (1952). J. Clin. Psychol.; Ledley & Lusted (1959). Science.]
- 1970s:
- The MYCIN system, developed at Stanford, harnessed rule-based approaches to help with infectious disease management.
- [Reference: Shortliffe (1977). Proc. Annu. Symp. Comput. Appl. Med. Care.]
- 1980s – Early 2000s:
- Expansion of differential diagnosis generators like Isabel and early “expert systems.”
- The notion that AI might replicate some aspects of clinical reasoning.
- [Reference: Ing et al. (2023). Ophthal. Plast. Reconstr. Surg.]
- 2012 – Present:
- The deep learning revolution, leading to neural networks surpassing classical approaches.
- GPT-based models soared in popularity, culminating in GPT-4 and specialized variants.
- [References: GPT-4 was widely explored in medical contexts in JAMA and Nat. Med. articles by multiple research groups.]
While these older methods had successes, many were stymied by the complexity of real-world clinical practice, with an inability to handle the open-endedness of differential diagnoses. Large language models (LLMs) brought a new wave of capabilities, as they could parse free-text clinical notes and produce reasoned, narrative responses. Yet, concerns remained regarding data leakage, memorization, and the potential for misinformation.
5. Study Design: Five Experiments Testing o1-preview
The comprehensive study that pitted o1-preview against medical professionals (and GPT-4) involved five different experimental setups, each capturing a unique dimension of clinical reasoning:
5.1 NEJM Clinicopathologic Conferences (CPCs)
- Data Source: The New England Journal of Medicine (NEJM) regularly publishes Clinicopathologic Conferences. These are real, challenging cases often used to test differential diagnosis skills since the 1950s.
- Method: The authors selected 143 NEJM CPCs from 2021 to September 2024. The model was prompted to provide a differential diagnosis, then asked which diagnostic tests it would order next.
- Scoring:
- Bond Score (0–5) to assess whether the correct diagnosis appears in the suggested list and how precisely it’s labeled.
- A 0–2 Likert scale for how “helpful” or “accurate” the recommended next diagnostic steps were, as compared to the case’s actual management.
5.2 NEJM Healer Diagnostic Cases
- Data Source: 20 medical cases from the NEJM Healer curriculum, specifically curated to assess clinical reasoning.
- Method: The model was asked to produce a “problem representation,” a prioritized differential diagnosis, and a justification at four data-acquisition stages (triage, review of systems, physical exam, diagnostic tests).
- Scoring:
- R-IDEA: A validated 10-point scale that breaks down the essential elements of diagnostic reasoning into comprehensibility, focus, coherence, and correctness. A perfect R-IDEA score indicates truly outstanding clinical reasoning documentation.
5.3 Grey Matters Management Cases
- Data Source: Five complex cases, originally employed in a separate study. These cases were derived from real patients and involved multi-step management decisions, designed by 25 physician experts from a range of subspecialties.
- Method: The model was tested on next-step management, requiring knowledge of specialized guidelines, emergent interventions, or uncommon treatment protocols.
- Scoring:
- Each case was scored out of 100 points, based on a detailed rubric assessing correctness, comprehensiveness, and appropriateness of the management strategy.
5.4 Landmark Diagnostic Cases
- Data Source: Six clinical vignettes from a landmark 1994 study by Berner and colleagues that originally evaluated four computer-based diagnostic systems. These cases are carefully protected from public domain to minimize the chance that a model would have them memorized.
- Method: The model was asked to identify up to three differential diagnoses, factors favoring or opposing each diagnosis, a single final diagnosis, and next diagnostic steps.
- Scoring:
- A more elaborate rubric, again aggregated into a 0–100 scale for final comparison.
5.5 Diagnostic Probabilistic Reasoning Cases
- Data Source: Five primary care vignettes from a previous large-scale study of 553 clinicians. Physicians were asked to estimate pretest and post-test probabilities for given clinical scenarios.
- Method: o1-preview was prompted to do exactly the same, using 100 model outputs per scenario.
- Scoring:
- The difference between the model’s estimates and the reference “true” range from the medical literature was captured by mean absolute error (MAE) and mean absolute percentage error (MAPE).
6. Key Findings
6.1 Differential Diagnosis Accuracy
o1-preview included the correct diagnosis in its list of differentials 78.3% of the time overall on NEJM CPCs. On 70 overlapping CPCs that had been previously tested on GPT-4, it performed at 88.6% (vs. GPT-4’s 72.9%). A Bond Score of 5/5 (exact match of the final diagnosis) was achieved in 52% of all cases.
Physician raters showed substantial agreement (κ=0.66) on scoring, indicating that the differential diagnosis lists proposed by the model were consistently judged to be accurate. Figure 1 in the original publication compares o1-preview to older differential diagnosis generators and shows a trajectory of rapid improvement in LLMs over the last few years.
Reference:
(Note: The link is provided as an illustrative reference format—please refer to the published study for the actual DOI.)
6.2 Management Reasoning Superiority
In the Grey Matters Management Cases, o1-preview dramatically outperformed practicing clinicians with conventional resources. The study used a 100-point scoring system:
- o1-preview: 86% median score
- GPT-4 alone: 42% median score
- Physicians with GPT-4: 41% median score
- Physicians with conventional resources: 34% median score
This difference was highly statistically significant (p < 0.001). Notably, these cases were carefully designed by 25 specialists to be challenging and to require advanced medical knowledge. “Humans appropriately struggled,” says Dr. Adam Rodman on social media, referencing how even experienced clinicians rarely scored beyond 60–70%. Meanwhile, o1-preview soared well above that threshold.
6.3 Medical Reasoning and R-IDEA Scores
The NEJM Healer set of 20 clinical vignettes measured how well participants documented reasoning. On the R-IDEA scale (0–10), the results were:
- o1-preview: Perfect scores in 78 out of 80 total responses
- GPT-4: Perfect scores in 47 out of 80
- Attending physicians: Perfect in 28 out of 80
- Resident physicians: Perfect in 16 out of 80
The difference between o1-preview and human physicians was statistically significant (p < 0.0001). This suggests o1-preview not only arrives at correct diagnoses, but also explains its reasoning in a manner consistent with high-quality clinical documentation standards.
One measure of interest was “cannot-miss” diagnoses, i.e., those that, if overlooked, could lead to catastrophic patient harm (e.g., subarachnoid hemorrhage). The median proportion of “cannot-miss” diagnoses identified by o1-preview was 0.92 (IQR 0.62–1.0), comparable to GPT-4 and marginally higher than some human groups, although not statistically significant.
6.4 Probabilistic Reasoning Limitations
When asked to estimate pre- and post-test probabilities, o1-preview did not significantly improve over GPT-4. It often overestimated the likelihood of pneumonia or missed the lower boundary in very rare events. This might reflect the inherent difficulty of specifying probabilities for open-text generative models or could be a sign that it simply “thinks in text” rather than in numeric intervals.
Interestingly, for a stress test scenario evaluating coronary artery disease, o1-preview’s distribution was closer to the target range than GPT-4 or even the human median. However, the difference overall was not as striking as in other tasks, reinforcing that purely numeric calibration of risk remains a challenge.
6.5 Triage “Cannot-Miss” Diagnoses
Although o1-preview performed strongly in enumerating “cannot-miss” diagnoses in the NEJM Healer cases, it didn’t represent a statistically significant leap over GPT-4. That said, both LLMs typically outperformed the median physician baseline. This pattern suggests that LLMs may be robust in ensuring broad coverage of serious possibilities, but the real breakthroughs appear in their capacity to integrate complex data and propose next steps.
7. Analysis: Why Does o1-preview Excel?
There are a few hypotheses:
- Enhanced Chain-of-Thought
- By internally generating a longer chain of reasoning at run-time, the model can better handle multi-step inferential tasks.
- Reference: Introducing OpenAI o1-preview. (2024)
- Large-Scale Training with Real Medical Data
- Past models may have been overshadowed by domain general knowledge. By refining the data pipeline and using specialized prompts, o1-preview appears more adept at bridging basic science with clinical presentations.
- Reference: Nori et al. From Medprompt to o1: Exploration of run-time strategies. (2024)
- Reduced Hallucinations
- The study did not systematically measure hallucinations (fabricated facts), but the improved performance might reflect better guardrails.
- Vast Knowledge Integration
- The difference in performance for complex management cases suggests that o1-preview effectively synthesizes guidelines, standards of care, and pathophysiologic principles in ways that can eclipse even resourceful clinicians.
8. Implications for Clinical Practice
If validated in further real-world trials, an AI system that outperforms experienced physicians in diagnosing and managing complex cases could lead to fewer diagnostic errors, improved patient safety, and cost savings due to less “trial-and-error” testing. Potential use cases might include:
- Decision Support in Hospital Wards: Real-time assistance for hospitalists, guiding the next diagnostic test or management step.
- Primary Care Triage: Especially valuable in telemedicine, where a physician cannot examine a patient in person.
- Rural or Underserved Areas: Could reduce the gap in specialist availability by providing advanced reasoning to generalists.
Still, the study’s authors and many domain experts (e.g., Morgan et al. (2021) and Ratwani et al. (2024)) caution that these models may be at risk of unanticipated failure modes. Integration must be accompanied by robust monitoring systems, clinical trials, and clarifications about accountability, ethical considerations, and potential biases.
9. Remaining Challenges
9.1 Memorization and Data Leakage
Despite the strong results, the study notes that some NEJM CPCs might have been present in the large textual corpus used to train or fine-tune o1-preview. Although performance only fell slightly on more recent cases, it remains possible that the model had partial “memorized” data.
Researchers’ approach: They compared performance before and after October 2023—the approximate cutoff date for the model’s training. No major difference was observed, but continued vigilance is needed.
9.2 Probabilistic Reasoning Shortcomings
As shown in the Diagnostic Probabilistic Reasoning Cases, o1-preview tends to overestimate or mis-calibrate pretest probabilities. This underlines how these LLMs, for all their textual abilities, may not excel at numerical tasks without specialized alignment or calibration.
From a practical standpoint, a miscalibrated model might overshadow crucial risk-benefit judgments in borderline or emergent clinical scenarios.
9.3 Real-World Workflows and Integration
The environment of a real hospital or clinic is chaotic, with constant interruptions, incomplete data, and organizational constraints. The question is whether integrating an advanced LLM-based tool like o1-preview might hamper existing workflows, generate alert fatigue, or require time-consuming oversight.
Clinical environments typically revolve around:
- Electronic Health Record (EHR) Systems: Integration requires stable, standardized APIs (e.g., FHIR) and sophisticated user-interface design.
- Liability and Accountability: Practitioners remain legally responsible for decisions. If o1-preview suggests a certain management path that leads to harm, who is accountable?
- Data Privacy: PHI (Protected Health Information) must remain secure and comply with regulations (e.g., HIPAA in the U.S.).
10. Looking Ahead: The Future of AI in Medicine
AI in medicine is at an inflection point. The o1-preview model’s “superhuman” performance in certain tasks, especially after only a few years of incremental improvements from GPT-2 to GPT-4, signals the potential for even faster leaps as new generative or foundation models come online. Some experts anticipate multi-modal expansions that incorporate:
- Medical Imaging (X-rays, CT scans, MRIs)
- Pathology Slides
- Genomic Data
Early moves in that direction include Buckley et al. (2024) who explored how large language models can interpret text and radiological images together. The synergy of combining textual clinical notes with diagnostic images might yield an even more nuanced AI assistant, further challenging the boundaries of what was once strictly a physician’s domain.
11. Conclusion
The new study’s results are, by any measure, remarkable: OpenAI’s o1-preview outperforms practicing doctors on tricky diagnoses in multiple test sets. From NEJM CPCs to specialized management vignettes, the AI’s ability to produce a robust differential diagnosis, design advanced test and treatment plans, and articulate its clinical reasoning in a coherent, exam-ready format suggests we’re close to an era where AI is no longer just a reference tool—it might become a frontline partner in patient care.
Still, it’s critical to keep a balanced view. More robust and pragmatic benchmarks must be developed—ones that are less susceptible to memorization or semantic pattern recognition. Additionally, thorough clinical trials are needed to confirm that these systems reduce diagnostic errors in real practice, rather than introducing new forms of bias or confusion.
As physician and data scientist Dr. Jonathan Chen from Stanford puts it, “We’re seeing new levels of performance that challenge assumptions about the function of LLM-based systems. Yet we must carefully deploy these models with the same rigor we apply to any new drug or device, ensuring that the patient benefits come first.”
In the meantime, for clinicians who have felt overwhelmed by the complexity of modern medicine, these early findings on o1-preview offer a glimpse of a future where technology might lighten the cognitive load, improve the quality of care, and help doctors focus on what they do best: compassionate, hands-on patient care.
Comments 3