Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants

dc.article.number6245
dc.catalogadoryvc
dc.contributor.authorAltermatt Couratier, Fernando René
dc.contributor.authorNeyem, Hugo Andrés
dc.contributor.authorSumonte Fuenzalida, Nicolás Ignacio
dc.contributor.authorVillagrán, Ignacio
dc.contributor.authorMendoza, Marcelo
dc.contributor.authorLacassie Quiroga, Héctor Javier
dc.date.accessioned2025-06-27T15:17:33Z
dc.date.available2025-06-27T15:17:33Z
dc.date.issued2025
dc.description.abstractLarge Language Models (LLMs) have demonstrated strong performance on English-language medical exams, but their effectiveness in non-English, high-stakes environments is less understood. This study benchmarks nine LLMs against human examinees on the Chilean Anesthesiology Certification Exam (CONACEM), a Spanish-language board examination. A curated set of 63 multiple-choice questions was used, categorized by Bloom’s taxonomy into four cognitive levels. Model responses were assessed using Item Response Theory and Classical Test Theory, complemented by additional error analysis, categorizing errors as reasoning-based, knowledge-based, or comprehension-related. Closed-source models surpassed open-source models, with GPT-o1 achieving the highest accuracy (88.7%). Deepseek-R1 is a strong performer among open-source options. Item difficulty significantly predicted the model accuracy, while discrimination did not. Most errors occurred in application and understanding tasks and were linked to flawed reasoning or knowledge misapplication. These results underscore LLMs’ potential for factual recall in Spanish medical exams but also their limitations in complex reasoning. Incorporating cognitive classification and error taxonomy provides deeper insights into model behavior and supports their cautious use as educational aids in clinical settings.
dc.fechaingreso.objetodigital2025-06-18
dc.fuente.origenORCID
dc.identifier.doi10.3390/app15116245
dc.identifier.urihttps://doi.org/10.3390/app15116245
dc.identifier.urihttps://www.mdpi.com/2076-3417/15/11/6245
dc.identifier.urihttps://repositorio.uc.cl/handle/11534/104784
dc.information.autorucEscuela de Medicina; Altermatt Couratier, Fernando René; 0000-0002-0464-8643; 7381
dc.information.autorucEscuela de Ingeniería; Neyem, Hugo Andrés; 0000-0002-5734-722X; 1007638
dc.information.autorucEscuela de Ingeniería; Sumonte Fuenzalida, Nicolás Ignacio; S/I; 1046132
dc.information.autorucEscuela de Medicina; Lacassie Quiroga, Héctor Javier; 0000-0001-5758-4113; 68956
dc.language.isoen
dc.nota.accesocontenido completo
dc.revistaApplication Science
dc.rightsacceso abierto
dc.rights.licenseCC BY Atribución Internacional 4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectAnesthesiology certification
dc.subjectClinical reasoning assessment
dc.subjectLanguage model benchmarking
dc.subjectMedical AI evaluation
dc.subjectNon-English medical exams
dc.subjectPsychometric analysis
dc.subjectSpanish-language healthcare
dc.subjectZero-shot prompting
dc.subject.ddc610
dc.subject.deweyMedicina y saludes_ES
dc.subject.ods03 Good health and well-being
dc.subject.odspa03 Salud y bienestar
dc.titleEvaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants
dc.typeartículo
sipa.codpersvinculados7381
sipa.codpersvinculados1007638
sipa.codpersvinculados1046132
sipa.codpersvinculados68956
sipa.trazabilidadORCID;2025-06-16
Files