Dataset of anonymized discharge summaries of sepsis patients from a Brazilian tertiary hospital for NLP applications
收藏DataONE2025-05-02 更新2025-11-01 收录
下载链接:
https://search.dataone.org/view/sha256:f80a08067046e2fea596f49865ce26aef9cb71bcc04502f53e0b920f31318dfc
下载链接
链接失效反馈官方服务:
资源简介:
Background: Publicly available clinical text datasets in Brazilian Portuguese for Natural Language Processing (NLP) research and education are scarce, largely due to challenges in ensuring robust anonymization of sensitive patient data, especially within long clinical notes. Objective: To address this gap, we created and describe a new dataset of anonymized discharge summaries from sepsis patients treated at a Brazilian tertiary teaching hospital. Methods: Discharge summaries for adult sepsis patients (identified via ICD-10 codes) were extracted from the hospital's Electronic Health Record (EHR) system. Following manual physician review to ensure text quality and relevance (N=387), the summaries underwent processing including cleaning, abbreviation expansion using a custom dictionary, and a two-stage automated anonymization process (unsupervised GLiNER followed by a supervised custom spaCy NER model). A final manual review ensured confidentiality and excluded summaries unsuitable for NLP educational tasks. Key structured clinical variables (length of stay, ICU admission, palliative care status, number of specialties, outcome) were also extracted and linked to each summary. Results: The resulting dataset comprises 200 anonymized discharge summaries in Brazilian Portuguese, presented in tabular format (.xlsx file) alongside the linked structured clinical variables, relevant ICD-10 codes, and the abbreviation dictionary. An accompanying Jupyter Notebook details the processing steps. Conclusion: This dataset provides a valuable and accessible resource of real-world, anonymized Brazilian Portuguese clinical text, suitable for educational purposes and research in NLP. It facilitates training and experimentation with tasks such as text preprocessing, named entity recognition, classification, and topic modeling, and enables the exploration of integrating textual data with structured clinical variables.
创建时间:
2025-10-29



