symptom_to_diagnosis
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/gretelai/symptom_to_diagnosis
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Summary
This dataset contains natural language descriptions of symptoms labeled with 22 corresponding diagnoses. `Gretel/symptom_to_diagnosis` provides 1065 symptom descriptions in the English language labeled with 22 diagnoses, focusing on fine-grained single-domain diagnosis.
## Data Fields
Each row contains the following fields:
* `input_text` : A string field containing symptoms
* `output_text` : A string field containing a diagnosis
Example:
```
{
"output_text": "drug reaction",
"input_text": "I've been having headaches and migraines, and I can't sleep. My whole body shakes and twitches. Sometimes I feel lightheaded."
}
```
## Diagnoses
This table contains the count of each diagnosis in the train and test splits.
| | Diagnosis | train.jsonl | test.jsonl |
|---:|:--------------------------------|--------------:|-------------:|
| 0 | drug reaction | 40 | 8 |
| 1 | allergy | 40 | 10 |
| 2 | chicken pox | 40 | 10 |
| 3 | diabetes | 40 | 10 |
| 4 | psoriasis | 40 | 10 |
| 5 | hypertension | 40 | 10 |
| 6 | cervical spondylosis | 40 | 10 |
| 7 | bronchial asthma | 40 | 10 |
| 8 | varicose veins | 40 | 10 |
| 9 | malaria | 40 | 10 |
| 10 | dengue | 40 | 10 |
| 11 | arthritis | 40 | 10 |
| 12 | impetigo | 40 | 10 |
| 13 | fungal infection | 39 | 9 |
| 14 | common cold | 39 | 10 |
| 15 | gastroesophageal reflux disease | 39 | 10 |
| 16 | urinary tract infection | 39 | 9 |
| 17 | typhoid | 38 | 9 |
| 18 | pneumonia | 37 | 10 |
| 19 | peptic ulcer disease | 37 | 10 |
| 20 | jaundice | 33 | 7 |
| 21 | migraine | 32 | 10 |
## Data Splits
The data is split to 80% train (853 examples, 167kb) and 20% test (212 examples, 42kb).
## Dataset Creation
Data was filtered to remove unwanted categories and updated using an LLM to create language more consistent with how a patient would describe symptoms in natural language to a doctor.
## Source Data
This dataset was adapted based on the [Symptom2Disease](https://www.kaggle.com/datasets/niyarrbarman/symptom2disease) dataset from Kaggle.
## Personal and Sensitive Information
The symptoms in this dataset were modified from their original format using an LLM and do not contain personal data.
## Limitations
This dataset is licensed Apache 2.0 and free for use.
# 数据集概述
本数据集包含标注了22种对应诊断结果的症状自然语言描述。`Gretel/symptom_to_diagnosis` 包含1065条英文症状描述,均标注了22种诊断类别,聚焦于细粒度的单领域诊断任务。
## 数据字段
每一行均包含以下字段:
* `input_text`:存储症状内容的字符串字段
* `output_text`:存储诊断结果的字符串字段
示例:
{
"output_text": "药物不良反应(drug reaction)",
"input_text": "我持续头痛并伴有偏头痛,无法入眠,全身震颤抽搐,有时还会感到头晕目眩。"
}
## 诊断类别
下表展示了训练集与测试集中各诊断类别的样本数量:
| | 诊断名称 | train.jsonl | test.jsonl |
|---:|:--------------------------------|--------------:|-------------:|
| 0 | 药物不良反应 | 40 | 8 |
| 1 | 过敏反应 | 40 | 10 |
| 2 | 水痘 | 40 | 10 |
| 3 | 糖尿病 | 40 | 10 |
| 4 | 银屑病 | 40 | 10 |
| 5 | 高血压 | 40 | 10 |
| 6 | 颈椎病 | 40 | 10 |
| 7 | 支气管哮喘 | 40 | 10 |
| 8 | 静脉曲张 | 40 | 10 |
| 9 | 疟疾 | 40 | 10 |
| 10 | 登革热 | 40 | 10 |
| 11 | 关节炎 | 40 | 10 |
| 12 | 脓疱病 | 40 | 10 |
| 13 | 真菌感染 | 39 | 9 |
| 14 | 普通感冒 | 39 | 10 |
| 15 | 胃食管反流病 | 39 | 10 |
| 16 | 尿路感染 | 39 | 9 |
| 17 | 伤寒 | 38 | 9 |
| 18 | 肺炎 | 37 | 10 |
| 19 | 消化性溃疡病 | 37 | 10 |
| 20 | 黄疸 | 33 | 7 |
| 21 | 偏头痛 | 32 | 10 |
## 数据划分
数据集按照80%的比例划分为训练集(853条样本,大小167KB)与20%的测试集(212条样本,大小42KB)。
## 数据集构建
研究人员对数据进行了过滤以移除无关类别,并通过大语言模型(Large Language Model)对文本进行优化,使其表述更贴合患者向医生描述症状时的自然语言风格。
## 原始数据来源
本数据集改编自Kaggle平台上的[Symptom2Disease](https://www.kaggle.com/datasets/niyarrbarman/symptom2disease)数据集。
## 个人与敏感信息说明
本数据集的症状描述已通过大语言模型(Large Language Model)从原始格式修改而来,未包含任何个人隐私数据。
## 局限性说明
本数据集采用Apache 2.0许可证,可免费使用。
提供机构:
maas
创建时间:
2025-05-20



