five

jgchaicoski/datasus_sim

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jgchaicoski/datasus_sim
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - pt tags: - medical - clinical-nlp - healthcare - biology license: cc-by-sa-4.0 size_categories: - 100M<n<1B task_categories: - text-generation - token-classification --- # Dataset Card for [DATASUS SIM] This dataset is a large-scale collection of information around deaths registered by the Brazilian public health care system. ## Dataset Details Check the Data Dictionary attached to the project. ### Dataset Description - **Curated by:** [João Guilherme Toscan Chaicoski] - **Funded by:** [No one] - **Shared by:** [João Guilherme Toscan Chaicoski] - **Language(s) (NLP):** Portuguese (pt-BR / pt-PT) - **License:** CC-BY-SA-4.0 ### Dataset Sources - **Repository:** [Link to HF Repo or GitHub] - **Paper:** [Optional: Link to ArXiv or Journal] - **Demo:** [Optional: Link to Space or Web App] ## Uses ### Direct Use * **Pre-training:** Training Large Language Models (LLMs) specialized in the medical domain. * **NER:** Extracting clinical entities (diseases, drugs, procedures). * **Summarization:** Creating concise summaries of clinical cases or medical papers. ### Out-of-Scope Use * **Direct Medical Diagnosis:** This dataset should NOT be used to provide medical advice to patients. * **Individual Identification:** Any attempt to re-identify patients is strictly prohibited. ## Dataset Structure The dataset is provided in Parquet format for optimized loading. * `source`: PYSUS API (DATASUS data) ## Dataset Creation ### Curation Rationale There is a significant gap in high-quality, large-scale medical datasets for the Portuguese language compared to English. This dataset aims to bridge that gap to improve healthcare AI in Brazil, Portugal, and PALOP countries. ### Source Data #### Data Collection and Processing Data was aggregated from public medical repositories using PYSUS API. * **Cleaning:** Used the data dictionary * **Filtering:** Applied quality filters to ensure medical relevance using keyword matching and domain classifiers. #### Who are the source data producers? The data originates from healthcare professionals, researchers, and medical students across the Portuguese-speaking world. ### Personal and Sensitive Information **Crucial:** This dataset has undergone extensive de-identification. Names, specific dates, addresses, and ID numbers (CPF/NIF) have been replaced with placeholders or removed in compliance with **LGPD (Lei Geral de Proteção de Dados)**. ## Bias, Risks, and Limitations * **Regional Bias:** The majority of the data may lean toward Brazilian Portuguese (pt-BR). * **Model Hallucination:** Models trained on this data may still produce medically incorrect info. ### Recommendations Users should validate model outputs with licensed medical professionals and perform bias checks if deploying in specific regional healthcare systems. ## Citation **BibTeX:** ```bibtex @misc{datasus_sim_2026, author = {João Guilherme Toscan Chaicoski}, title = {DATASUS SIM}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{[https://huggingface.co/datasets/jgchaicoski/datasus_sim](https://huggingface.co/datasets/jgchaicoski/datasus_sim)}} }
提供机构:
jgchaicoski
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作