Monday, May 4, 2026
[gtranslate]

In emergency room trials, Harvard study finds AI outperformed doctors with 67% accuracy

by

Post Content ​
Artificial intelligence (AI) and robotics are increasingly positioned as the future of healthcare, promising to transform diagnosis, treatment, and patient care. However, there is a lack of evidence showing that AI-powered systems can reliably deliver in high-pressure clinical settings, where accuracy and speed are critical.
In an attempt to examine the performance of large language models (LLMs) in medical contexts, particularly a real-life emergency room, a new study has found that at least one LLM was able to diagnose patients more accurately than human doctors. It provided the exact or very close diagnosis in 67 per cent of cases, compared to human doctors with 50-55 per cent accuracy rate.

The study was published in Science journal last week by a team of researchers comprising physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center.
The findings of the study, based on trials that tested the responses of hundreds of doctors against LLMs, suggest that AI-powered systems may be inching closer to supporting doctors in real-world decision-making. It comes at a time when AI adoption in healthcare is picking up, with nearly one in five US physicians already using AI to assist diagnosis, as per the American Medical Association (AMA).
Another survey carried out by the Royal College of Physicians found that 16 per cent of doctors in the UK are using AI tools daily for clinical decision-making.
“I don’t think our findings mean that AI replaces doctors. I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine,” Arjun Manrai, one of the lead authors of the Harvard study, was quoted as saying by The Guardian. Dr Adam Rodman, another lead author, expects that AI systems will not replace physicians but join them in a “triadic care model … the doctor, the patient, and an artificial intelligence system.”
Research methodology
As part of their experiment, the researchers selected 76 patients who came into the Beth Israel emergency room. Two internal medicine attending physicians were asked to diagnose the patients, while OpenAI’s o1 reasoning model and GPT-4o were used to separately generate diagnoses of the same group of patients.

Story continues below this ad

The researchers emphasised that the patient data was not pre-processed, meaning that the AI models were presented with the same information that was available in the patients’ electronic medical records at the time of each diagnosis.
All the diagnoses were assessed by two other attending physicians, who did not know which ones came from humans and which ones were AI-generated.
Key findings
At the first diagnostic touchpoint (initial ER triage), where there is minimal information available such as vital sign data, demographic information, and a few sentences on why the patient was there, the AI models identified the exact or very close diagnosis in 67 per cent of cases, outperforming the human doctors, who were right in only 50-55 per cent of cases.
Also Read | Anthropic takes on OpenAI with ‘Claude for Healthcare’, its own offering for doctors and patients
When more information about the patients were made available, the diagnosis accuracy rate of OpenAI’s o1 reasoning model rose to 82 per cent compared with the 70-79 per cent accuracy achieved by the human doctors. “At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,” the study said.

Story continues below this ad

When a larger cohort of 46 human doctors and the two AI models were asked to examine five clinical case studies, the models scored 89 per cent compared to 34 per cent obtained by human doctors relying on conventional tools such as search engines.
Limitations of the study
The study only focused on the responses of AI models based on paperwork in text form. It did not evaluate LLMs based on their reading of other signals such as a patient’s level of distress or their visual appearance, the researchers acknowledged in the study. Present-day AI models such as o1 and GPT-4o are also prone to errors and hallucinations, posing severe liability risks.
Doctors have also highlighted the absence of a formal framework for accountability. Most patients would also want human doctors to guide them through life or death decisions without relying on AI tools. Notably, the rise of AI tools in healthcare could also lead to doctors deferring to AI-generated answers without thinking independently. The Harvard study further fails to evaluate the performance of LLMs in accurately diagnosing elderly patients or non-English speakers.
How doctors are reacting to the study
“These systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important,” Professor Ewen Harrison, co-director of the University of Edinburgh’s centre for medical informatics, said.

Story continues below this ad

Also Read | From mental health issues to cancer care: AI startups reimagine healthcare access with smart diagnostics, digital tools, tele-doctors
“If we’re going to compare AI tools to physicians’ clinical ability, we should start by comparing [them] to physicians who actually practice that specialty. I would not be surprised if a LLM could beat a dermatologist at a neurosurgery board exam, [but] that’s not a particularly helpful thing to know,” Kristen Panthagani, an emergency physician, said in an online post.
“As an ER doctor seeing a patient for the first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you,” Panthagani added.

 

Related Articles

Leave a Comment