Jackkoo/hotpot_qa
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Jackkoo/hotpot_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- found
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
pretty_name: HotpotQA
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- question-answering
task_ids: []
paperswithcode_id: hotpotqa
tags:
- multi-hop
dataset_info:
- config_name: distractor
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: type
dtype: string
- name: level
dtype: string
- name: supporting_facts
sequence:
- name: title
dtype: string
- name: sent_id
dtype: int32
- name: context
sequence:
- name: title
dtype: string
- name: sentences
sequence: string
splits:
- name: train
num_bytes: 552948795
num_examples: 90447
- name: validation
num_bytes: 45716059
num_examples: 7405
download_size: 359239231
dataset_size: 598664854
- config_name: fullwiki
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: type
dtype: string
- name: level
dtype: string
- name: supporting_facts
sequence:
- name: title
dtype: string
- name: sent_id
dtype: int32
- name: context
sequence:
- name: title
dtype: string
- name: sentences
sequence: string
splits:
- name: train
num_bytes: 552948795
num_examples: 90447
- name: validation
num_bytes: 46848549
num_examples: 7405
- name: test
num_bytes: 45999922
num_examples: 7405
download_size: 387387120
dataset_size: 645797266
configs:
- config_name: distractor
data_files:
- split: train
path: distractor/train-*
- split: validation
path: distractor/validation-*
- config_name: fullwiki
data_files:
- split: train
path: fullwiki/train-*
- split: validation
path: fullwiki/validation-*
- split: test
path: fullwiki/test-*
---
# Dataset Card for "hotpot_qa"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://hotpotqa.github.io/](https://hotpotqa.github.io/)
- **Repository:** https://github.com/hotpotqa/hotpot
- **Paper:** [HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 1.27 GB
- **Size of the generated dataset:** 1.24 GB
- **Total amount of disk used:** 2.52 GB
### Dataset Summary
HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### distractor
- **Size of downloaded dataset files:** 612.75 MB
- **Size of the generated dataset:** 598.66 MB
- **Total amount of disk used:** 1.21 GB
An example of 'validation' looks as follows.
```
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 21", "Sent 22"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "medium",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "comparison"
}
```
#### fullwiki
- **Size of downloaded dataset files:** 660.10 MB
- **Size of the generated dataset:** 645.80 MB
- **Total amount of disk used:** 1.31 GB
An example of 'train' looks as follows.
```
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 2"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "hard",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "bridge"
}
```
### Data Fields
The data fields are the same among all splits.
#### distractor
- `id`: a `string` feature.
- `question`: a `string` feature.
- `answer`: a `string` feature.
- `type`: a `string` feature.
- `level`: a `string` feature.
- `supporting_facts`: a dictionary feature containing:
- `title`: a `string` feature.
- `sent_id`: a `int32` feature.
- `context`: a dictionary feature containing:
- `title`: a `string` feature.
- `sentences`: a `list` of `string` features.
#### fullwiki
- `id`: a `string` feature.
- `question`: a `string` feature.
- `answer`: a `string` feature.
- `type`: a `string` feature.
- `level`: a `string` feature.
- `supporting_facts`: a dictionary feature containing:
- `title`: a `string` feature.
- `sent_id`: a `int32` feature.
- `context`: a dictionary feature containing:
- `title`: a `string` feature.
- `sentences`: a `list` of `string` features.
### Data Splits
#### distractor
| |train|validation|
|----------|----:|---------:|
|distractor|90447| 7405|
#### fullwiki
| |train|validation|test|
|--------|----:|---------:|---:|
|fullwiki|90447| 7405|7405|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
HotpotQA is distributed under a [CC BY-SA 4.0 License](http://creativecommons.org/licenses/by-sa/4.0/).
### Citation Information
```
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
year={2018}
}
```
### Contributions
Thanks to [@albertvillanova](https://github.com/albertvillanova), [@ghomasHudson](https://github.com/ghomasHudson) for adding this dataset.
---
注释创建者:
- 众包
语言:
- 英语
语言来源:
- 公开采集
许可协议:
- CC BY-SA 4.0
多语言属性:
- 单语言
数据集展示名:
- HotpotQA
样本规模区间:
- 10万 < 样本数 < 100万
源数据集:
- 原生数据集
任务类别:
- 问答(question-answering)
任务子类别:
- 无
PapersWithCode编号:
- hotpotqa
标签:
- 多跳(multi-hop)
数据集信息:
- 配置名称:干扰项(distractor)
字段信息:
- 字段名:id
数据类型:字符串
- 字段名:question
数据类型:字符串
- 字段名:answer
数据类型:字符串
- 字段名:type
数据类型:字符串
- 字段名:level
数据类型:字符串
- 字段名:supporting_facts
序列类型:
- 字段名:title
数据类型:字符串
- 字段名:sent_id
数据类型:int32
- 字段名:context
序列类型:
- 字段名:title
数据类型:字符串
- 字段名:sentences
序列类型:字符串
数据划分:
- 划分名称:train
字节数:552948795
样本数:90447
- 划分名称:validation
字节数:45716059
样本数:7405
下载大小:359239231
数据集总大小:598664854
- 配置名称:全维基(fullwiki)
字段信息:
- 字段名:id
数据类型:字符串
- 字段名:question
数据类型:字符串
- 字段名:answer
数据类型:字符串
- 字段名:type
数据类型:字符串
- 字段名:level
数据类型:字符串
- 字段名:supporting_facts
序列类型:
- 字段名:title
数据类型:字符串
- 字段名:sent_id
数据类型:int32
- 字段名:context
序列类型:
- 字段名:title
数据类型:字符串
- 字段名:sentences
序列类型:字符串
数据划分:
- 划分名称:train
字节数:552948795
样本数:90447
- 划分名称:validation
字节数:46848549
样本数:7405
- 划分名称:test
字节数:45999922
样本数:7405
下载大小:387387120
数据集总大小:645797266
配置项:
- 配置名称:干扰项(distractor)
数据文件:
- 划分:train
路径:distractor/train-*
- 划分:validation
路径:distractor/validation-*
- 配置名称:全维基(fullwiki)
数据文件:
- 划分:train
路径:fullwiki/train-*
- 划分:validation
路径:fullwiki/validation-*
- 划分:test
路径:fullwiki/test-*
---
# HotpotQA数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与基准榜单](#支持任务与基准榜单)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [注释](#注释)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献者](#贡献者)
## 数据集描述
- **主页**:[https://hotpotqa.github.io/](https://hotpotqa.github.io/)
- **代码仓库**:https://github.com/hotpotqa/hotpot
- **相关论文**:[HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600)
- **联络人**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集总大小**:1.27 GB
- **生成后数据集大小**:1.24 GB
- **总磁盘占用**:2.52 GB
### 数据集概述
HotpotQA是一款基于维基百科构建的新型问答数据集,包含11.3万组问答样本,具备四大核心特性:(1)回答问题需检索并推理多篇支持文档;(2)问题类型多样,不受限于任何已有知识库或知识图谱模式;(3)提供推理所需的句子级支持事实(supporting facts),使问答系统可借助强监督开展推理并解释预测结果;(4)新增一类事实型对比问题,用于测试问答系统提取相关事实并完成必要对比的能力。
### 支持任务与基准榜单
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 干扰项配置(distractor)
- **下载数据集大小**:612.75 MB
- **生成后数据集大小**:598.66 MB
- **总磁盘占用**:1.21 GB
验证集的一个示例如下:
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 21", "Sent 22"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "medium",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "comparison"
}
#### 全维基配置(fullwiki)
- **下载数据集大小**:660.10 MB
- **生成后数据集大小**:645.80 MB
- **总磁盘占用**:1.31 GB
训练集的一个示例如下:
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 2"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "hard",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "bridge"
}
### 数据字段
所有划分的数据字段均保持一致。
#### 干扰项配置(distractor)
- `id`:字符串类型字段。
- `question`:问题字符串字段。
- `answer`:答案字符串字段。
- `type`:问题类型字符串字段。
- `level`:问题难度字符串字段。
- `supporting_facts`:字典类型字段,包含:
- `title`:文档标题字符串字段。
- `sent_id`:句子ID整型字段。
- `context`:字典类型字段,包含:
- `title`:文档标题字符串字段。
- `sentences`:字符串列表字段。
#### 全维基配置(fullwiki)
- `id`:字符串类型字段。
- `question`:问题字符串字段。
- `answer`:答案字符串字段。
- `type`:问题类型字符串字段。
- `level`:问题难度字符串字段。
- `supporting_facts`:字典类型字段,包含:
- `title`:文档标题字符串字段。
- `sent_id`:句子ID整型字段。
- `context`:字典类型字段,包含:
- `title`:文档标题字符串字段。
- `sentences`:字符串列表字段。
### 数据划分
#### 干扰项配置(distractor)
| | 训练集 | 验证集 |
|----------|-------:|-------:|
| 干扰项配置 | 90447 | 7405 |
#### 全维基配置(fullwiki)
| | 训练集 | 验证集 | 测试集 |
|--------|-------:|-------:|-------:|
| 全维基配置 | 90447 | 7405 | 7405 |
## 数据集构建
### 构建初衷
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 注释
#### 注释流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 注释者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
HotpotQA采用[CC BY-SA 4.0许可协议](http://creativecommons.org/licenses/by-sa/4.0/)进行分发。
### 引用信息
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
year={2018}
}
### 贡献者
感谢[@albertvillanova](https://github.com/albertvillanova), [@ghomasHudson](https://github.com/ghomasHudson)为本数据集添加支持。
提供机构:
Jackkoo



