jgchaicoski/datasus_sim
收藏Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jgchaicoski/datasus_sim
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
tags:
- medical
- clinical-nlp
- healthcare
- biology
license: cc-by-sa-4.0
size_categories:
- 100M<n<1B
task_categories:
- text-generation
- token-classification
---
# Dataset Card for [DATASUS SIM]
This dataset is a large-scale collection of information around deaths registered by the Brazilian public health care system.
## Dataset Details
Check the Data Dictionary attached to the project.
### Dataset Description
- **Curated by:** [João Guilherme Toscan Chaicoski]
- **Funded by:** [No one]
- **Shared by:** [João Guilherme Toscan Chaicoski]
- **Language(s) (NLP):** Portuguese (pt-BR / pt-PT)
- **License:** CC-BY-SA-4.0
### Dataset Sources
- **Repository:** [Link to HF Repo or GitHub]
- **Paper:** [Optional: Link to ArXiv or Journal]
- **Demo:** [Optional: Link to Space or Web App]
## Uses
### Direct Use
* **Pre-training:** Training Large Language Models (LLMs) specialized in the medical domain.
* **NER:** Extracting clinical entities (diseases, drugs, procedures).
* **Summarization:** Creating concise summaries of clinical cases or medical papers.
### Out-of-Scope Use
* **Direct Medical Diagnosis:** This dataset should NOT be used to provide medical advice to patients.
* **Individual Identification:** Any attempt to re-identify patients is strictly prohibited.
## Dataset Structure
The dataset is provided in Parquet format for optimized loading.
* `source`: PYSUS API (DATASUS data)
## Dataset Creation
### Curation Rationale
There is a significant gap in high-quality, large-scale medical datasets for the Portuguese language compared to English. This dataset aims to bridge that gap to improve healthcare AI in Brazil, Portugal, and PALOP countries.
### Source Data
#### Data Collection and Processing
Data was aggregated from public medical repositories using PYSUS API.
* **Cleaning:** Used the data dictionary
* **Filtering:** Applied quality filters to ensure medical relevance using keyword matching and domain classifiers.
#### Who are the source data producers?
The data originates from healthcare professionals, researchers, and medical students across the Portuguese-speaking world.
### Personal and Sensitive Information
**Crucial:** This dataset has undergone extensive de-identification. Names, specific dates, addresses, and ID numbers (CPF/NIF) have been replaced with placeholders or removed in compliance with **LGPD (Lei Geral de Proteção de Dados)**.
## Bias, Risks, and Limitations
* **Regional Bias:** The majority of the data may lean toward Brazilian Portuguese (pt-BR).
* **Model Hallucination:** Models trained on this data may still produce medically incorrect info.
### Recommendations
Users should validate model outputs with licensed medical professionals and perform bias checks if deploying in specific regional healthcare systems.
## Citation
**BibTeX:**
```bibtex
@misc{datasus_sim_2026,
author = {João Guilherme Toscan Chaicoski},
title = {DATASUS SIM},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/datasets/jgchaicoski/datasus_sim](https://huggingface.co/datasets/jgchaicoski/datasus_sim)}}
}
提供机构:
jgchaicoski



