A methodology for developing dermatological datasets: lessons from retrospective data collection for AI-based applications

Abstract
Purpose The integration of artificial intelligence into dermatological research has underscored the need for robust and well-structured dermatological datasets. However, these datasets vary widely in their development processes, and there is currently no standard methodology to create such datasets. This work identifies three pressing needs for the building of dermatological datasets focus on skin tumor classification: the need for multimodal datasets, the definition of minimum metadata requirements, and the inclusion of underrepresented populations to address the scarcity of health data. Methods We propose a practical methodology to create dermatological datasets from clinical records, incorporating both images and patient metadata. The process consists of four key stages: getting the institutional review board approval and analysis of clinical information sources, data recording and structuring, processing of clinical data and images, and quality assessment. This methodology was derived from hands-on experience in building two datasets from Chilean and Mexican populations, respectively. Results The methodology allows the creation of well-structured datasets by simplifying data organization and enabling replication. Each step includes practical guidance for dealing with typical challenges, such as image metadata categorization and technical validation by dermatologists and computer scientists. Conclusion Our contribution offers a reproducible, scalable, and interdisciplinary framework for creating dermatological datasets, especially useful for countries initiating dataset creation. In addition to the methodological proposal, we highlight common pitfalls and offer recommendations to mitigate them.
Description
Keywords
Dataset methodology, Skin cancer, Clinical metadata, Dermatology
Citation
BMC Medical Research Methodology. 2025 Nov 05;25(1):251