Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam
Loading...
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Background Large language models (LLMs) such as GPT-4o have the potential to transform clinical decision-making, patient education, and medical research. Despite impressive performance in generating patient-friendly educational materials and assisting in clinical documentation, concerns remain regarding the reliability, subtle errors, and biases that can undermine their use in high-stakes medical settings. Methods A multi-phase experimental design was employed to assess the performance of GPT-4o on the Chilean anesthesiology exam (CONACEM), which comprised 183 questions covering four cognitive domains—Understanding, Recall, Application, and Analysis—based on Bloom’s taxonomy. Thirty independent simulation runs were conducted with systematic variation of the model’s temperature parameter to gauge the balance between deterministic and creative responses. The generated responses underwent qualitative error analysis using a refined taxonomy that categorized errors such as “Unsupported Medical Claim,” “Hallucination of Information,” “Sticking with Wrong Diagnosis,” “Non-medical Factual Error,” “Incorrect Understanding of Task,” “Reasonable Response,” “Ignore Missing Information,” and “Incorrect or Vague Conclusion.” Two board-certified anesthesiologists performed independent annotations, with disagreements resolved by a third expert. Statistical evaluations—including one-way ANOVA, non-parametric tests, chi-square, and linear mixed-effects modeling—were used to compare performance across domains and analyze error frequency. Results GPT-4o achieved an overall accuracy of 83.69%. Performance varied significantly by cognitive domain, with the highest accuracy observed in the Understanding (90.10%) and Recall (84.38%) domains, and lower accuracy in Application (76.83%) and Analysis (76.54%). Among the 120 incorrect responses, unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%). Co-occurrence analyses revealed that unsupported claims often appeared alongside imprecise conclusions, highlighting a trend of compounded errors particularly in tasks requiring complex reasoning. Inter-rater reliability for error annotation was robust, with a mean Cohen’s kappa of 0.73. Conclusions While GPT-4o exhibits strengths in factual recall and comprehension, its limitations in handling higher-order reasoning and diagnostic judgment are evident through frequent unsupported medical claims and vague conclusions. These findings underscore the need for improved domain-specific fine-tuning, enhanced error mitigation strategies, and integrated knowledge verification mechanisms prior to clinical deployment.
Description
Keywords
Large language models, GPT-4o, Anesthesiology, Medical AI evaluation, Clinical decision support, Spanish-language exam, Diagnostic reasoning
Citation
BMC Medical Education. 2025 Oct 27;25(1):1499
