CesarCEOAI/PubMedQA
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/CesarCEOAI/PubMedQA
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
- machine-generated
language_creators:
- expert-generated
language:
- en
license:
- mit
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
- 10K<n<100K
- 1K<n<10K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- multiple-choice-qa
paperswithcode_id: pubmedqa
pretty_name: PubMedQA
config_names:
- pqa_artificial
- pqa_labeled
- pqa_unlabeled
dataset_info:
- config_name: pqa_artificial
features:
- name: pubid
dtype: int32
- name: question
dtype: string
- name: context
sequence:
- name: contexts
dtype: string
- name: labels
dtype: string
- name: meshes
dtype: string
- name: long_answer
dtype: string
- name: final_decision
dtype: string
splits:
- name: train
num_bytes: 443501057
num_examples: 211269
download_size: 233411194
dataset_size: 443501057
- config_name: pqa_labeled
features:
- name: pubid
dtype: int32
- name: question
dtype: string
- name: context
sequence:
- name: contexts
dtype: string
- name: labels
dtype: string
- name: meshes
dtype: string
- name: reasoning_required_pred
dtype: string
- name: reasoning_free_pred
dtype: string
- name: long_answer
dtype: string
- name: final_decision
dtype: string
splits:
- name: train
num_bytes: 2088898
num_examples: 1000
download_size: 1075513
dataset_size: 2088898
- config_name: pqa_unlabeled
features:
- name: pubid
dtype: int32
- name: question
dtype: string
- name: context
sequence:
- name: contexts
dtype: string
- name: labels
dtype: string
- name: meshes
dtype: string
- name: long_answer
dtype: string
splits:
- name: train
num_bytes: 125922964
num_examples: 61249
download_size: 66010017
dataset_size: 125922964
configs:
- config_name: pqa_artificial
data_files:
- split: train
path: pqa_artificial/train-*
- config_name: pqa_labeled
data_files:
- split: train
path: pqa_labeled/train-*
- config_name: pqa_unlabeled
data_files:
- split: train
path: pqa_unlabeled/train-*
---
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [PubMedQA homepage](https://pubmedqa.github.io/ )
- **Repository:** [PubMedQA repository](https://github.com/pubmedqa/pubmedqa)
- **Paper:** [PubMedQA: A Dataset for Biomedical Research Question Answering](https://arxiv.org/abs/1909.06146)
- **Leaderboard:** [PubMedQA: Leaderboard](https://pubmedqa.github.io/)
### Dataset Summary
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.
### Supported Tasks and Leaderboards
The official leaderboard is available at: https://pubmedqa.github.io/.
500 questions in the `pqa_labeled` are used as the test set. They can be found at https://github.com/pubmedqa/pubmedqa.
### Languages
English
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
Thanks to [@tuner007](https://github.com/tuner007) for adding this dataset.
注释生成者:
- 专家生成
- 机器生成
语言生成者:
- 专家生成
语言:
- 英语(en)
许可证:
- MIT协议
多语言属性:
- 单语言
规模类别:
- 10万<样本数<100万
- 1万<样本数<10万
- 1千<样本数<1万
源数据集:
- 原始数据集
任务类别:
- 问答
任务子类型:
- 多选问答(multiple-choice-qa)
PaperWithCode标识:pubmedqa
规范名称:PubMedQA
配置名称:
- pqa_artificial
- pqa_labeled
- pqa_unlabeled
数据集信息:
- 配置名称:pqa_artificial
字段特征:
- 字段名:pubid
数据类型:int32
- 字段名:question
数据类型:字符串
- 字段名:context
类型为序列,包含子字段:
- 字段名:contexts
数据类型:字符串
- 字段名:labels
数据类型:字符串
- 字段名:meshes
数据类型:字符串
- 字段名:long_answer
数据类型:字符串
- 字段名:final_decision
数据类型:字符串
数据分割:
- 分割名称:train
数据字节数:443501057
样本数量:211269
下载大小:233411194
数据集总大小:443501057
- 配置名称:pqa_labeled
字段特征:
- 字段名:pubid
数据类型:int32
- 字段名:question
数据类型:字符串
- 字段名:context
类型为序列,包含子字段:
- 字段名:contexts
数据类型:字符串
- 字段名:labels
数据类型:字符串
- 字段名:meshes
数据类型:字符串
- 字段名:reasoning_required_pred
数据类型:字符串
- 字段名:reasoning_free_pred
数据类型:字符串
- 字段名:long_answer
数据类型:字符串
- 字段名:final_decision
数据类型:字符串
数据分割:
- 分割名称:train
数据字节数:2088898
样本数量:1000
下载大小:1075513
数据集总大小:2088898
- 配置名称:pqa_unlabeled
字段特征:
- 字段名:pubid
数据类型:int32
- 字段名:question
数据类型:字符串
- 字段名:context
类型为序列,包含子字段:
- 字段名:contexts
数据类型:字符串
- 字段名:labels
数据类型:字符串
- 字段名:meshes
数据类型:字符串
- 字段名:long_answer
数据类型:字符串
数据分割:
- 分割名称:train
数据字节数:125922964
样本数量:61249
下载大小:66010017
数据集总大小:125922964
配置项:
- 配置名称:pqa_artificial
数据文件:
- 分割:train
路径:pqa_artificial/train-*
- 配置名称:pqa_labeled
数据文件:
- 分割:train
路径:pqa_labeled/train-*
- 配置名称:pqa_unlabeled
数据文件:
- 分割:train
路径:pqa_unlabeled/train-*
# PubMedQA 数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据分割](#data-splits)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [注释](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可证信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集概述
- **主页:** [PubMedQA 官方主页](https://pubmedqa.github.io/)
- **代码仓库:** [PubMedQA 代码仓库](https://github.com/pubmedqa/pubmedqa)
- **论文:** [《PubMedQA:一款生物医学研究问答数据集》](https://arxiv.org/abs/1909.06146)
- **排行榜:** [PubMedQA 官方排行榜](https://pubmedqa.github.io/)
### 数据集摘要
PubMedQA 的任务是基于对应学术论文摘要,以“是/否/不确定”三种形式回答生物医学研究问题(示例:术前他汀类药物是否会降低冠状动脉搭桥术后的心房颤动发生率?)。
### 支持任务与排行榜
官方排行榜地址为:https://pubmedqa.github.io/。`pqa_labeled` 配置中的500条问题被用作测试集,可从 https://github.com/pubmedqa/pubmedqa 获取该测试集。
### 语言
英语
## 数据集结构
### 数据实例
[需补充更多信息]
### 数据字段
[需补充更多信息]
### 数据分割
[需补充更多信息]
## 数据集构建
### 构建依据
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生成者是谁?
[需补充更多信息]
### 注释
#### 注释流程
[需补充更多信息]
#### 注释者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可证信息
[需补充更多信息]
### 引用信息
[需补充更多信息]
### 贡献
感谢 [@tuner007](https://github.com/tuner007) 提交本数据集。
提供机构:
CesarCEOAI



