A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish
dc.article.number | 204 | |
dc.catalogador | yvc | |
dc.contributor.author | Dunstan Escudero, Jocelyn Mariel | |
dc.contributor.author | Vakili, Thomas | |
dc.contributor.author | Miranda Huerta, Luis Alberto | |
dc.contributor.author | Villena, Fabián | |
dc.contributor.author | Aracena, Claudio | |
dc.contributor.author | Quiroga Curin, Tamara Nancy | |
dc.contributor.author | Vera, Paulina | |
dc.contributor.author | Viteri Valenzuela, Sebastián | |
dc.contributor.author | Rocco, Victor | |
dc.date.accessioned | 2024-08-01T23:33:57Z | |
dc.date.available | 2024-08-01T23:33:57Z | |
dc.date.issued | 2024 | |
dc.date.updated | 2024-07-28T00:04:31Z | |
dc.description.abstract | Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus. | |
dc.description.auspiciador | Stockholm University (Financiamiento Acceso Abierto) | |
dc.description.funder | ANID Chile Fondo Basal, Centro de Excelencia FB210005 (CMM) | |
dc.description.funder | PhD Visiting Scholarship No. 2023 (TV) | |
dc.description.funder | Millennium Science Initiative Program ICN17_002 (IMFD) | |
dc.description.funder | ANID Fondecyt No. 1241825 (JD) | |
dc.description.funder | ANID National Doctoral Scholarship 21220200 (FV), 21211659 (CA) and 21220586 (TQ) | |
dc.description.funder | DataLEASH (TV) ACHS 304-2023 | |
dc.format.extent | 10 páginas | |
dc.fuente.origen | Biomed Central | |
dc.identifier.citation | BMC Medical Informatics and Decision Making. 2024, 24(1):204 | |
dc.identifier.doi | 10.1186/s12911-024-02609-w | |
dc.identifier.uri | https://doi.org/10.1186/s12911-024-02609-w | |
dc.identifier.uri | https://repositorio.uc.cl/handle/11534/87253 | |
dc.identifier.wosid | WOS:001275573100002 | |
dc.information.autoruc | Escuela de Ingeniería; Dunstan Escudero, Jocelyn Mariel; S/I; 1285723 | |
dc.information.autoruc | Escuela de Ingeniería; Miranda Huerta, Luis Alberto; S/I; 66497 | |
dc.information.autoruc | Escuela de Ingeniería; Quiroga Curin, Tamara Nancy; S/I; 1207385 | |
dc.language.iso | en | |
dc.nota.acceso | contenido completo | |
dc.publisher | Springer Nature | |
dc.revista | BMC Medical Informatics and Decision Making | |
dc.rights | acceso abierto | |
dc.rights.holder | The Author(s) | |
dc.rights.license | CC BY Atribución 4.0 Internacional | |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
dc.subject | Natural language processing | |
dc.subject | Privacy | |
dc.subject | Named entity recognition | |
dc.subject | Corpus annotation | |
dc.subject.ddc | 610 | |
dc.subject.dewey | Medicina y salud | es_ES |
dc.subject.ods | 03 Good health and well-being | |
dc.subject.odspa | 03 Salud y bienestar | |
dc.title | A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish | |
dc.type | artículo | |
dc.volumen | 24 | |
sipa.codpersvinculados | 1285723 | |
sipa.codpersvinculados | 66497 | |
sipa.codpersvinculados | 1207385 |