mathewhe/medqa
收藏Hugging Face2025-11-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mathewhe/medqa
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
- tw
license: mit
tags:
- text
- question-and-answer
pretty_name: MedQA
task_categories:
- question-answering
configs:
- config_name: en
data_files:
- split: train
path: data/questions/en/train.jsonl
- split: dev
path: data/questions/en/dev.jsonl
- split: test
path: data/questions/en/test.jsonl
- split: all_splits
path: data/questions/en/all_splits.jsonl
- config_name: tw
data_files:
- split: train
path: data/questions/tw/train.jsonl
- split: dev
path: data/questions/tw/dev.jsonl
- split: test
path: data/questions/tw/test.jsonl
- split: all_splits
path: data/questions/tw/all_splits.jsonl
- config_name: zh
data_files:
- split: train
path: data/questions/zh/train.jsonl
- split: dev
path: data/questions/zh/dev.jsonl
- split: test
path: data/questions/zh/test.jsonl
- split: all_splits
path: data/questions/zh/all_splits.jsonl
- config_name: xlang
data_files:
- split: train
path: data/questions/xlang/train.jsonl
- split: dev
path: data/questions/xlang/dev.jsonl
- split: test
path: data/questions/xlang/test.jsonl
- split: all_splits
path: data/questions/xlang/all_splits.jsonl
- config_name: en_5
data_files:
- split: train
path: data/questions/en_5/train.jsonl
- split: dev
path: data/questions/en_5/dev.jsonl
- split: test
path: data/questions/en_5/test.jsonl
- split: all_splits
path: data/questions/en_5/all_splits.jsonl
- config_name: zh_5
data_files:
- split: train
path: data/questions/zh_5/train.jsonl
- split: dev
path: data/questions/zh_5/dev.jsonl
- split: test
path: data/questions/zh_5/test.jsonl
- split: all_splits
path: data/questions/zh_5/all_splits.jsonl
---
# Dataset Card for MedQA
- **Homepage:** [https://github.com/jind11/MedQA](https://github.com/jind11/MedQA)
- This is an unofficial curation of the MedQA dataset, uploaded here with minimal (i.e., no content-modifying) processing.
- **Paper:** [*What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams*](https://www.mdpi.com/2076-3417/11/14/6421) (MDPI)
- **Languages:** English (en), Taiwanese (tw), and Chinese (zh).
## Dataset Subsets
This dataset contains multiple configs:
- QA with four possible answers (as reported in the paper)
- `en`: English instances
- `tw`: Taiwanese instances
- `zh`: Chinese instances
- `xlang`: instances in any language
- QA with five possible answers (the original datasets for English and Chinese)
- `en_5`
- `zh_5`
Data can be loaded by specifying the config and data split:
```
from datasets import load_dataset
data = load_dataset("mathewhe/medqa", "en", split="train")
```
Possible splits are "train", "dev", and "test".
## Dataset Structure
Each data subset will contain the following columns:
```
question (string): The question/prompt.
answer: The correct response.
answer_idx: The multiple-choice identifier for the correct response.
A: The "A" answer.
B: The "B" answer.
C: The "C" answer.
D: The "D" answer.
E (in `en_5` or `zh_5` subsets): The "E" answer.
language: "en", "tw", or "zh".
```
Example from en-train:
| question | answer | meta_info | answer_idx | A | B | C | D | language |
|-------------------------|----------------|-----------|------------|------------|-------------|-------------|----------------|----------|
| A 23-year-old pregna... | Nitrofurantoin | step2&3 | D | Ampicillin | Ceftriaxone | Doxycycline | Nitrofurantoin | en |
## Citation Information
For reproducibility, please include a link to *this* dataset when publishing results based on the included data.
For formal citations, please cite the *original* publication:
```
@article{jin2020disease,
title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams},
author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter},
journal={arXiv preprint arXiv:2009.13081},
year={2020}
}
```
语言支持:
- 英语(en)
- 简体中文(zh)
- 繁体中文(tw)
许可证:MIT许可证
标签:
- 文本
- 问答(Question Answering)
数据集名称:MedQA
任务类别:
- 问答(Question Answering)
配置项:
- 配置名称:en
数据文件:
- 划分:训练集(train),路径:data/questions/en/train.jsonl
- 划分:验证集(dev),路径:data/questions/en/dev.jsonl
- 划分:测试集(test),路径:data/questions/en/test.jsonl
- 划分:全划分(all_splits),路径:data/questions/en/all_splits.jsonl
- 配置名称:tw
数据文件:
- 划分:训练集(train),路径:data/questions/tw/train.jsonl
- 划分:验证集(dev),路径:data/questions/tw/dev.jsonl
- 划分:测试集(test),路径:data/questions/tw/test.jsonl
- 划分:全划分(all_splits),路径:data/questions/tw/all_splits.jsonl
- 配置名称:zh
数据文件:
- 划分:训练集(train),路径:data/questions/zh/train.jsonl
- 划分:验证集(dev),路径:data/questions/zh/dev.jsonl
- 划分:测试集(test),路径:data/questions/zh/test.jsonl
- 划分:全划分(all_splits),路径:data/questions/zh/all_splits.jsonl
- 配置名称:xlang
数据文件:
- 划分:训练集(train),路径:data/questions/xlang/train.jsonl
- 划分:验证集(dev),路径:data/questions/xlang/dev.jsonl
- 划分:测试集(test),路径:data/questions/xlang/test.jsonl
- 划分:全划分(all_splits),路径:data/questions/xlang/all_splits.jsonl
- 配置名称:en_5
数据文件:
- 划分:训练集(train),路径:data/questions/en_5/train.jsonl
- 划分:验证集(dev),路径:data/questions/en_5/dev.jsonl
- 划分:测试集(test),路径:data/questions/en_5/test.jsonl
- 划分:全划分(all_splits),路径:data/questions/en_5/all_splits.jsonl
- 配置名称:zh_5
数据文件:
- 划分:训练集(train),路径:data/questions/zh_5/train.jsonl
- 划分:验证集(dev),路径:data/questions/zh_5/dev.jsonl
- 划分:测试集(test),路径:data/questions/zh_5/test.jsonl
- 划分:全划分(all_splits),路径:data/questions/zh_5/all_splits.jsonl
# MedQA 数据集卡片
- **主页**:[https://github.com/jind11/MedQA](https://github.com/jind11/MedQA)
本数据集为非官方整理版本,仅经过极少处理(即未修改任何内容)后上传至此仓库。
- **论文**:[*该患者罹患何种疾病?来自医学考试的大规模开放域问答(Open Domain Question Answering)数据集*](https://www.mdpi.com/2076-3417/11/14/6421)(MDPI出版)
- **支持语言**:英语(en)、繁体中文(tw)与简体中文(zh)。
## 数据集子集
本数据集包含多种配置子集:
1. 四选项问答(与论文中报道的版本一致)
- `en`:英文样本
- `tw`:繁体中文样本
- `zh`:简体中文样本
- `xlang`:多语言混合样本
2. 五选项问答(英语与汉语的原始数据集版本)
- `en_5`
- `zh_5`
可通过指定配置与数据划分加载数据,示例代码如下:
python
from datasets import load_dataset
data = load_dataset("mathewhe/medqa", "en", split="train")
支持的数据划分包括`train`(训练集)、`dev`(验证集)与`test`(测试集)。
## 数据集结构
每个数据子集均包含以下字段:
question (string):问题/提示文本
answer:正确答案
answer_idx:正确答案的选择题标识符
A:选项A的内容
B:选项B的内容
C:选项C的内容
D:选项D的内容
E(仅在`en_5`或`zh_5`子集中存在):选项E的内容
language:语言标识,取值为"en"、"tw"或"zh"
英文训练集(en-train)的示例如下:
| 问题 | 正确答案 | 元信息 | 答案索引 | 选项A | 选项B | 选项C | 选项D | 语言 |
|------|----------|--------|----------|-------|-------|-------|-------|------|
| A 23-year-old pregna... | Nitrofurantoin | step2&3 | D | Ampicillin | Ceftriaxone | Doxycycline | Nitrofurantoin | en |
## 引用信息
如需复现实验结果,请在基于本数据集发表成果时,附上本数据集的链接。
正式引用请标注原始论文:
bibtex
@article{jin2020disease,
title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams},
author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter},
journal={arXiv preprint arXiv:2009.13081},
year={2020}
}
提供机构:
mathewhe



