Browsing by Author "Quiroga Curin, Tamara Nancy"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- ItemA pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish(Springer Nature, 2024) Dunstan Escudero, Jocelyn Mariel; Vakili, Thomas; Miranda Huerta, Luis Alberto; Villena, Fabián; Aracena, Claudio; Quiroga Curin, Tamara Nancy; Vera, Paulina; Viteri Valenzuela, Sebastián; Rocco, VictorDespite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.
- ItemClinical analogy resolution performance for foundation language models(2024) Villena, Fabián; Quiroga Curin, Tamara Nancy; Dunstan Escudero, Jocelyn MarielUsing extensive data sources to create foundation language models has revolutionized the performance of deep learning-based architectures. This remarkable improvement has led to state-of-the-art results for various downstream NLP tasks, including clinical tasks. However, more research is needed to measure model performance intrinsically, especially in the clinical domain. We revisit the use of analogy questions as an effective method to measure the intrinsic performance of language models for the clinical domain in English. We tested multiple Transformers-based language models over analogy questions constructed from the Unified Medical Language System (UMLS), a massive knowledge graph of clinical concepts. Our results show that large language models are significantly more performant for analogy resolution than small language models. Similarly, domain-specific language models perform better than general domain language models. We also found a correlation between intrinsic and extrinsic performance, validated through PubMedQA extrinsic task. Creating clinical-specific and language-specific language models is essential for advancing biomedical and clinical NLP and will ensure a valid application in clinical practice. Finally, given that our proposed intrinsic test is based on a term graph available in multiple languages, the dataset can be built to measure the performance of models in languages other than English.