Performance of single-agent and multi-agent language models in Spanish language medical competency exams

dc.article.number666
dc.catalogadorpva
dc.contributor.authorAltermatt Couratier, Fernando René
dc.contributor.authorNeyem, Andrés
dc.contributor.authorSumonte Fuenzalida, Nicolás Ignacio
dc.contributor.authorMendoza Rocha, Marcelo
dc.contributor.authorVillagrán Gutiérrez, Ignacio Andrés
dc.contributor.authorLacassie Quiroga, Héctor
dc.date.accessioned2025-05-15T20:04:56Z
dc.date.available2025-05-15T20:04:56Z
dc.date.issued2025
dc.date.updated2025-05-11T00:04:10Z
dc.description.abstractBackground Large language models (LLMs) like GPT-4o have shown promise in advancing medical decision-making and education. However, their performance in Spanish-language medical contexts remains underexplored. This study evaluates the effectiveness of single-agent and multi-agent strategies in answering questions from the EUNACOM, a standardized medical licensure exam in Chile, across 21 medical specialties. Methods GPT-4o was tested on 1,062 multiple-choice questions from publicly available EUNACOM preparation materials. Single-agent strategies included Zero-Shot, Few-Shot, Chain-of-Thought (CoT), Self-Reflection, and MED-PROMPT, while multi-agent strategies involved Voting, Weighted Voting, Borda Count, MEDAGENTS, and MDAGENTS. Each strategy was tested under three temperature settings (0.3, 0.6, 1.2). Performance was assessed by accuracy, and statistical analyses, including Kruskal–Wallis and Mann–Whitney U tests, were performed. Computational resource utilization, such as API calls and execution time, was also analyzed. Results MDAGENTS achieved the highest accuracy with a mean score of 89.97% (SD = 0.56%), outperforming all other strategies (p < 0.001). MEDAGENTS followed with a mean score of 87.99% (SD = 0.49%), and the CoT with Few-Shot strategy scored 87.67% (SD = 0.12%). Temperature settings did not significantly affect performance (F2,54 = 1.45, p = 0.24). Specialty-level analysis showed the highest accuracies in Psychiatry (95.51%), Neurology (95.49%), and Surgery (95.38%), while lower accuracies were observed in Neonatology (77.54%), Otolaryngology (76.64%), and Urology/Nephrology (76.59%). Notably, several exam questions were correctly answered using simpler single-agent strategies without employing complex reasoning or collaboration frameworks. Conclusions and relevance Multi-agent strategies, particularly MDAGENTS, significantly enhance GPT-4o’s performance on Spanish-language medical exams, leveraging collaboration to improve diagnostic accuracy. However, simpler single-agent strategies are sufficient to address many questions, high-lighting that only a fraction of standardized medical exams require sophisticated reasoning or multi-agent interaction. These findings suggest potential for LLMs as efficient and scalable tools in Spanish-speaking healthcare, though computational optimization remains a key area for future research.
dc.description.funderANID
dc.description.funderFONDEF
dc.fechaingreso.objetodigital2025-05-11
dc.format.extent11 páginas
dc.fuente.origenBiomed Central
dc.identifier.citationBMC Medical Education. 2025 May 07;25(1):666
dc.identifier.doi10.1186/s12909-025-07250-3
dc.identifier.issn1472-6920
dc.identifier.urihttps://doi.org/10.1186/s12909-025-07250-3
dc.identifier.urihttps://repositorio.uc.cl/handle/11534/104329
dc.information.autorucEscuela de Medicina; Altermatt Couratier, Fernando René; 0000-0002-0464-8643; 7381
dc.information.autorucEscuela de Ingeniería; Neyem, Andrés; 0000-0002-5734-722X; 1007638
dc.information.autorucEscuela de Ingeniería; Sumonte Fuenzalida, Nicolás Ignacio; S/I; 1046132
dc.information.autorucEscuela de Ingeniería; Mendoza Rocha, Marcelo; S/I; 1237020
dc.information.autorucEscuela de Ingeniería; Villagrán Gutiérrez, Ignacio Andrés; 0000-0003-3130-8326; 1039444
dc.information.autorucEscuela de Medicina; Lacassie Quiroga, Héctor; 0000-0001-5758-4113; 68956
dc.issue.numero1
dc.language.isoen
dc.nota.accesocontenido completo
dc.publisherSpringer Nature
dc.revistaBMC Medical Education
dc.rightsacceso abierto
dc.rights.holderThe Author(s)
dc.rights.licenseAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectLarge language models
dc.subjectMedical decision-making
dc.subjectSpanish medical contexts
dc.subjectMedical AI.
dc.subjectGPT-4o
dc.subject.ddc610
dc.subject.deweyMedicina y saludes_ES
dc.subject.ods03 Good health and well-being
dc.subject.odspa03 Salud y bienestar
dc.titlePerformance of single-agent and multi-agent language models in Spanish language medical competency exams
dc.typeartículo
dc.volumen25
sipa.codpersvinculados7381
sipa.codpersvinculados1007638
sipa.codpersvinculados1046132
sipa.codpersvinculados1237020
sipa.codpersvinculados1039444
sipa.codpersvinculados68956
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
12909_2025_Article_7250.pdf
Size:
1.32 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.98 KB
Format:
Item-specific license agreed upon to submission
Description: