Top AI models fail spectacularly when faced with slightly altered medical questions

PsyPost: “Artificial intelligence systems often perform impressively on standardized medical exams—but new research suggests these test scores may be misleading. A study published in JAMA Network Open indicates that large language models, or LLMs, might not actually “reason” through clinical questions. Instead, they seem to rely heavily on recognizing familiar answer patterns. When those patterns were slightly altered, the models’ performance dropped significantly—sometimes by more than half. Large language models are a type of artificial intelligence system trained to process and generate human-like language. They are built using vast datasets that include books, scientific papers, web pages, and other text sources. By analyzing patterns in this data, these models learn how to respond to questions, summarize information, and even simulate reasoning. In recent years, several models have achieved high scores on medical exams, sparking interest in using them to support clinical decision-making. But high test scores do not necessarily indicate an understanding of the underlying content. Instead, many of these models may simply be predicting the most likely answer based on statistical patterns. This raises the question: are they truly reasoning about medical scenarios, or just mimicking answers they’ve seen before?That’s what the researchers behind the new study set out to examine. “I am particularly excited about bridging the gap between model building and model deployment and the right evaluation is key to that,” explained study author Suhana Bedi, a PhD student at Stanford University. “We have AI models achieving near perfect accuracy on benchmarks like multiple choice based medical licensing exam questions. But this doesn’t reflect the reality of clinical practice. We found that less than 5% of papers evaluate LLMs on real patient data which can be messy and fragmented.”

So, we released a benchmark suite of 35 benchmarks mapped to a taxonomy of real medical and healthcare tasks that were verified by 30 clinicians. We found that most models (including reasoning models) struggled on Administrative and Clinical Decision Support tasks.”

Posted in: AI, Education, Health Care, Internet, Knowledge Management, Medicine, Search Engines