Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam

dc.article.number1499
dc.catalogadorgjm
dc.contributor.authorAltermatt Couratier, Fernando René
dc.contributor.authorNeyem, Andrés
dc.contributor.authorSumonte Fuenzalida, Nicolás Ignacio
dc.contributor.authorVillagrán Gutiérrez, Ignacio Andrés
dc.contributor.authorMendoza Rocha, Marcelo
dc.contributor.authorLacassie Quiroga, Héctor
dc.contributor.authorDelfino Yurin, Alejandro
dc.date.accessioned2025-12-10T12:54:20Z
dc.date.available2025-12-10T12:54:20Z
dc.date.issued2025
dc.date.updated2025-11-02T01:04:50Z
dc.description.abstractBackground Large language models (LLMs) such as GPT-4o have the potential to transform clinical decision-making, patient education, and medical research. Despite impressive performance in generating patient-friendly educational materials and assisting in clinical documentation, concerns remain regarding the reliability, subtle errors, and biases that can undermine their use in high-stakes medical settings. Methods A multi-phase experimental design was employed to assess the performance of GPT-4o on the Chilean anesthesiology exam (CONACEM), which comprised 183 questions covering four cognitive domains—Understanding, Recall, Application, and Analysis—based on Bloom’s taxonomy. Thirty independent simulation runs were conducted with systematic variation of the model’s temperature parameter to gauge the balance between deterministic and creative responses. The generated responses underwent qualitative error analysis using a refined taxonomy that categorized errors such as “Unsupported Medical Claim,” “Hallucination of Information,” “Sticking with Wrong Diagnosis,” “Non-medical Factual Error,” “Incorrect Understanding of Task,” “Reasonable Response,” “Ignore Missing Information,” and “Incorrect or Vague Conclusion.” Two board-certified anesthesiologists performed independent annotations, with disagreements resolved by a third expert. Statistical evaluations—including one-way ANOVA, non-parametric tests, chi-square, and linear mixed-effects modeling—were used to compare performance across domains and analyze error frequency. Results GPT-4o achieved an overall accuracy of 83.69%. Performance varied significantly by cognitive domain, with the highest accuracy observed in the Understanding (90.10%) and Recall (84.38%) domains, and lower accuracy in Application (76.83%) and Analysis (76.54%). Among the 120 incorrect responses, unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%). Co-occurrence analyses revealed that unsupported claims often appeared alongside imprecise conclusions, highlighting a trend of compounded errors particularly in tasks requiring complex reasoning. Inter-rater reliability for error annotation was robust, with a mean Cohen’s kappa of 0.73. Conclusions While GPT-4o exhibits strengths in factual recall and comprehension, its limitations in handling higher-order reasoning and diagnostic judgment are evident through frequent unsupported medical claims and vague conclusions. These findings underscore the need for improved domain-specific fine-tuning, enhanced error mitigation strategies, and integrated knowledge verification mechanisms prior to clinical deployment.
dc.fechaingreso.objetodigital2025-12-10
dc.format.extent14 páginas
dc.fuente.origenAutoarchivo
dc.identifier.citationBMC Medical Education. 2025 Oct 27;25(1):1499
dc.identifier.doi10.1186/s12909-025-08084-9
dc.identifier.urihttps://doi.org/10.1186/s12909-025-08084-9
dc.identifier.urihttps://repositorio.uc.cl/handle/11534/107323
dc.information.autorucEscuela de Medicina; Altermatt Couratier, Fernando René; 0000-0002-0464-8643; 7381
dc.information.autorucEscuela de Ingeniería; Neyem, Andrés; 0000-0002-5734-722X; 1007638
dc.information.autorucEscuela de Ingeniería; Sumonte Fuenzalida, Nicolás Ignacio; S/I; 1046132
dc.information.autorucEscuela de Ingeniería; Villagrán Gutiérrez, Ignacio Andrés; 0000-0003-3130-8326; 1039444
dc.information.autorucEscuela de Ingeniería; Mendoza Rocha, Marcelo; 0000-0002-7969-6041; 1237020
dc.information.autorucEscuela de Medicina; Lacassie Quiroga, Héctor; 0000-0001-5758-4113; 68956
dc.information.autorucEscuela de Medicina; Delfino Yurin, Alejandro; 0000-0002-0659-7130; 129220
dc.language.isoen
dc.language.isocontenido completo
dc.language.rfc3066en
dc.revistaBMC Medical Education
dc.rightsacceso abierto
dc.rights.holderThe Author(s)
dc.rights.licenseCC BY-NC-ND 4.0 Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subjectLarge language models
dc.subjectGPT-4o
dc.subjectAnesthesiology
dc.subjectMedical AI evaluation
dc.subjectClinical decision support
dc.subjectSpanish-language exam
dc.subjectDiagnostic reasoning
dc.subject.ddc610
dc.subject.ods03 Good health and well-being
dc.subject.odspa03 Salud y bienestar
dc.titleEvaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam
dc.typeartículo
dc.volumen25
sipa.codpersvinculados7381
sipa.codpersvinculados1007638
sipa.codpersvinculados1046132
sipa.codpersvinculados1039444
sipa.codpersvinculados1237020
sipa.codpersvinculados68956
sipa.codpersvinculados129220
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
12909_2025_Article_8084.pdf
Size:
2.15 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.98 KB
Format:
Item-specific license agreed upon to submission
Description: