five

mathewhe/medqa

收藏
Hugging Face2025-11-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mathewhe/medqa
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh - tw license: mit tags: - text - question-and-answer pretty_name: MedQA task_categories: - question-answering configs: - config_name: en data_files: - split: train path: data/questions/en/train.jsonl - split: dev path: data/questions/en/dev.jsonl - split: test path: data/questions/en/test.jsonl - split: all_splits path: data/questions/en/all_splits.jsonl - config_name: tw data_files: - split: train path: data/questions/tw/train.jsonl - split: dev path: data/questions/tw/dev.jsonl - split: test path: data/questions/tw/test.jsonl - split: all_splits path: data/questions/tw/all_splits.jsonl - config_name: zh data_files: - split: train path: data/questions/zh/train.jsonl - split: dev path: data/questions/zh/dev.jsonl - split: test path: data/questions/zh/test.jsonl - split: all_splits path: data/questions/zh/all_splits.jsonl - config_name: xlang data_files: - split: train path: data/questions/xlang/train.jsonl - split: dev path: data/questions/xlang/dev.jsonl - split: test path: data/questions/xlang/test.jsonl - split: all_splits path: data/questions/xlang/all_splits.jsonl - config_name: en_5 data_files: - split: train path: data/questions/en_5/train.jsonl - split: dev path: data/questions/en_5/dev.jsonl - split: test path: data/questions/en_5/test.jsonl - split: all_splits path: data/questions/en_5/all_splits.jsonl - config_name: zh_5 data_files: - split: train path: data/questions/zh_5/train.jsonl - split: dev path: data/questions/zh_5/dev.jsonl - split: test path: data/questions/zh_5/test.jsonl - split: all_splits path: data/questions/zh_5/all_splits.jsonl --- # Dataset Card for MedQA - **Homepage:** [https://github.com/jind11/MedQA](https://github.com/jind11/MedQA) - This is an unofficial curation of the MedQA dataset, uploaded here with minimal (i.e., no content-modifying) processing. - **Paper:** [*What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams*](https://www.mdpi.com/2076-3417/11/14/6421) (MDPI) - **Languages:** English (en), Taiwanese (tw), and Chinese (zh). ## Dataset Subsets This dataset contains multiple configs: - QA with four possible answers (as reported in the paper) - `en`: English instances - `tw`: Taiwanese instances - `zh`: Chinese instances - `xlang`: instances in any language - QA with five possible answers (the original datasets for English and Chinese) - `en_5` - `zh_5` Data can be loaded by specifying the config and data split: ``` from datasets import load_dataset data = load_dataset("mathewhe/medqa", "en", split="train") ``` Possible splits are "train", "dev", and "test". ## Dataset Structure Each data subset will contain the following columns: ``` question (string): The question/prompt. answer: The correct response. answer_idx: The multiple-choice identifier for the correct response. A: The "A" answer. B: The "B" answer. C: The "C" answer. D: The "D" answer. E (in `en_5` or `zh_5` subsets): The "E" answer. language: "en", "tw", or "zh". ``` Example from en-train: | question | answer | meta_info | answer_idx | A | B | C | D | language | |-------------------------|----------------|-----------|------------|------------|-------------|-------------|----------------|----------| | A 23-year-old pregna... | Nitrofurantoin | step2&3 | D | Ampicillin | Ceftriaxone | Doxycycline | Nitrofurantoin | en | ## Citation Information For reproducibility, please include a link to *this* dataset when publishing results based on the included data. For formal citations, please cite the *original* publication: ``` @article{jin2020disease, title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams}, author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter}, journal={arXiv preprint arXiv:2009.13081}, year={2020} } ```

语言支持: - 英语(en) - 简体中文(zh) - 繁体中文(tw) 许可证:MIT许可证 标签: - 文本 - 问答(Question Answering) 数据集名称:MedQA 任务类别: - 问答(Question Answering) 配置项: - 配置名称:en 数据文件: - 划分:训练集(train),路径:data/questions/en/train.jsonl - 划分:验证集(dev),路径:data/questions/en/dev.jsonl - 划分:测试集(test),路径:data/questions/en/test.jsonl - 划分:全划分(all_splits),路径:data/questions/en/all_splits.jsonl - 配置名称:tw 数据文件: - 划分:训练集(train),路径:data/questions/tw/train.jsonl - 划分:验证集(dev),路径:data/questions/tw/dev.jsonl - 划分:测试集(test),路径:data/questions/tw/test.jsonl - 划分:全划分(all_splits),路径:data/questions/tw/all_splits.jsonl - 配置名称:zh 数据文件: - 划分:训练集(train),路径:data/questions/zh/train.jsonl - 划分:验证集(dev),路径:data/questions/zh/dev.jsonl - 划分:测试集(test),路径:data/questions/zh/test.jsonl - 划分:全划分(all_splits),路径:data/questions/zh/all_splits.jsonl - 配置名称:xlang 数据文件: - 划分:训练集(train),路径:data/questions/xlang/train.jsonl - 划分:验证集(dev),路径:data/questions/xlang/dev.jsonl - 划分:测试集(test),路径:data/questions/xlang/test.jsonl - 划分:全划分(all_splits),路径:data/questions/xlang/all_splits.jsonl - 配置名称:en_5 数据文件: - 划分:训练集(train),路径:data/questions/en_5/train.jsonl - 划分:验证集(dev),路径:data/questions/en_5/dev.jsonl - 划分:测试集(test),路径:data/questions/en_5/test.jsonl - 划分:全划分(all_splits),路径:data/questions/en_5/all_splits.jsonl - 配置名称:zh_5 数据文件: - 划分:训练集(train),路径:data/questions/zh_5/train.jsonl - 划分:验证集(dev),路径:data/questions/zh_5/dev.jsonl - 划分:测试集(test),路径:data/questions/zh_5/test.jsonl - 划分:全划分(all_splits),路径:data/questions/zh_5/all_splits.jsonl # MedQA 数据集卡片 - **主页**:[https://github.com/jind11/MedQA](https://github.com/jind11/MedQA) 本数据集为非官方整理版本,仅经过极少处理(即未修改任何内容)后上传至此仓库。 - **论文**:[*该患者罹患何种疾病?来自医学考试的大规模开放域问答(Open Domain Question Answering)数据集*](https://www.mdpi.com/2076-3417/11/14/6421)(MDPI出版) - **支持语言**:英语(en)、繁体中文(tw)与简体中文(zh)。 ## 数据集子集 本数据集包含多种配置子集: 1. 四选项问答(与论文中报道的版本一致) - `en`:英文样本 - `tw`:繁体中文样本 - `zh`:简体中文样本 - `xlang`:多语言混合样本 2. 五选项问答(英语与汉语的原始数据集版本) - `en_5` - `zh_5` 可通过指定配置与数据划分加载数据,示例代码如下: python from datasets import load_dataset data = load_dataset("mathewhe/medqa", "en", split="train") 支持的数据划分包括`train`(训练集)、`dev`(验证集)与`test`(测试集)。 ## 数据集结构 每个数据子集均包含以下字段: question (string):问题/提示文本 answer:正确答案 answer_idx:正确答案的选择题标识符 A:选项A的内容 B:选项B的内容 C:选项C的内容 D:选项D的内容 E(仅在`en_5`或`zh_5`子集中存在):选项E的内容 language:语言标识,取值为"en"、"tw"或"zh" 英文训练集(en-train)的示例如下: | 问题 | 正确答案 | 元信息 | 答案索引 | 选项A | 选项B | 选项C | 选项D | 语言 | |------|----------|--------|----------|-------|-------|-------|-------|------| | A 23-year-old pregna... | Nitrofurantoin | step2&3 | D | Ampicillin | Ceftriaxone | Doxycycline | Nitrofurantoin | en | ## 引用信息 如需复现实验结果,请在基于本数据集发表成果时,附上本数据集的链接。 正式引用请标注原始论文: bibtex @article{jin2020disease, title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams}, author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter}, journal={arXiv preprint arXiv:2009.13081}, year={2020} }
提供机构:
mathewhe
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作