Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants
No Thumbnail Available
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large Language Models (LLMs) have demonstrated strong performance on English-language medical exams, but their effectiveness in non-English, high-stakes environments is less understood. This study benchmarks nine LLMs against human examinees on the Chilean Anesthesiology Certification Exam (CONACEM), a Spanish-language board examination. A curated set of 63 multiple-choice questions was used, categorized by Bloom’s taxonomy into four cognitive levels. Model responses were assessed using Item Response Theory and Classical Test Theory, complemented by additional error analysis, categorizing errors as reasoning-based, knowledge-based, or comprehension-related. Closed-source models surpassed open-source models, with GPT-o1 achieving the highest accuracy (88.7%). Deepseek-R1 is a strong performer among open-source options. Item difficulty significantly predicted the model accuracy, while discrimination did not. Most errors occurred in application and understanding tasks and were linked to flawed reasoning or knowledge misapplication. These results underscore LLMs’ potential for factual recall in Spanish medical exams but also their limitations in complex reasoning. Incorporating cognitive classification and error taxonomy provides deeper insights into model behavior and supports their cautious use as educational aids in clinical settings.
Description
Keywords
Anesthesiology certification, Clinical reasoning assessment, Language model benchmarking, Medical AI evaluation, Non-English medical exams, Psychometric analysis, Spanish-language healthcare, Zero-shot prompting