AI diagnozuje celniej niż lekarze na izbie przyjęć — badanie Harvardu

A team of researchers from Harvard Medical School and Beth Israel Deaconess Medical Center has published in Science the results of a controlled experiment comparing OpenAI models o1 and 4o against internal medicine physicians diagnosing patients in the emergency department. At the most critical juncture — initial triage, with the least data and highest time pressure — model o1 produced an accurate or near-accurate diagnosis in 67% of cases, while both physicians scored 55% and 50% respectively.

Key Takeaways

Study published in Science (May 2026), covering 76 real emergency cases from Beth Israel in Boston.
OpenAI o1 achieved 67% accuracy at initial triage vs. 55% and 50% for two internal medicine physicians.
AI and physicians worked on identical, unmodified data from electronic patient records — no preprocessing.
Researchers do not claim AI is ready for autonomous clinical decisions; they call for urgent prospective trials.
Comparison was made against internists, not ER specialists — a significant methodological limitation.

The Experiment with Real Cases

The study, published May 3, 2026 in Science, led by Arjun Manrai from the AI Lab at Harvard Medical School and Adam Rodman from Beth Israel Deaconess Medical Center, differs from prior AI-in-medicine tests on one critical point: the data was not curated. Models received exactly the same information available in the electronic medical system at each diagnostic checkpoint — from initial presentation through final diagnosis.

Researchers identified three moments in the histories of 76 patients: initial triage (least data), an intermediate point, and final diagnosis. Each diagnosis was scored by independent internal medicine physicians who did not know whether they were evaluating a human or machine diagnosis.

At initial triage, model o1 identified "an accurate or near-accurate diagnosis" in 67% of cases. The internal medicine physicians: 55% and 50%. Model 4o performed comparably to physicians or slightly below, depending on the diagnostic checkpoint.

"We tested the AI model on essentially every benchmark available and it outperformed both previous models and the physician baseline" — Arjun Manrai, Harvard Medical School

Key Methodological Limitations

The study drew rapid responses from the medical community. Kristen Panthagani, an ER physician, published a blog analysis pointing out a core issue: the comparison was against internists, not emergency medicine specialists. For ER specialists, this may be a fundamental distinction — like comparing an AI to a dermatologist on a neurosurgical task.

"My primary goal in the emergency department is not to guess the final diagnosis. My goal is to determine whether the patient has a life-threatening condition" — Kristen Panthagani, ER physician

Rodman acknowledged in a conversation with The Guardian that "there are no formal accountability frameworks right now" around AI diagnoses, and that patients still "want humans to guide them through life-and-death decisions."

The researchers also note that AI worked exclusively with text data — EHR entries, lab results, symptom descriptions. Models did not analyze imaging, ECG results, or direct clinical observations. As the authors themselves state, "existing research suggests current large language models have limitations in reasoning from non-text data."

Context: AI in Medical Diagnosis in 2026

This is not the first study suggesting that large language models achieve results comparable to physicians in diagnostic tasks. However, prior tests relied on synthetic data or specially prepared question sets (e.g., medical licensing exams, MedQA benchmark). The Beth Israel study stands out for using real, unmodified patient records.

Google DeepMind announced around the same time an "AI co-clinician" project — a model designed to assist, not replace, physicians. The Harvard study feeds into an ongoing debate: should AI aspire to clinical diagnosis at all, or should it serve as a screening assistant or a tool for underserved communities with limited access to specialists?

The study's authors explicitly lean toward the second scenario. Keywords in their conclusions: "urgent need for prospective clinical trials" — not deployment of AI to the ER tomorrow.

Why This Matters

The Harvard Medical School study is one of the few experiments conducted on real emergency department data — without preprocessing, with diagnosis sources unknown to external evaluators. The result in which model o1 outperforms internal medicine physicians at initial triage is not proof of AI's clinical readiness, but a signal demanding serious methodological follow-through: prospective, multi-center trials involving the right specialists (ER physicians, not internists).

More important than the raw numbers is what the study reveals structurally: AI operates without fatigue, without anchoring bias at first contact, and with full access to the entire patient record history — something physicians during triage often review only superficially. If the model genuinely leverages these advantages — and if this is confirmed in settings with chronic physician shortages, such as rural hospitals or developing countries — the scale of implications extends far beyond the specific numbers of a single study.

Equally important is the accountability debate. Rodman rightly notes: there are no legal or ethical frameworks today for situations where AI participates in a clinical decision. This is a question that the healthcare sector, regulators, and patients themselves will need to resolve — before AI is actually placed at the bedside.

What's Next?

Authors call for urgent prospective clinical trials with live patients — with AI models in the role of diagnostic assistant, not autonomous diagnostician.
Study results will likely be discussed at medical conferences in 2026, potentially accelerating pilot projects at select academic hospitals.
The regulatory question remains open: the FDA (US) and EMA (EU) do not yet have a certification pathway for AI models as clinical-grade diagnostic tools.