dnaori/hotpot_qa
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dnaori/hotpot_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- found
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
pretty_name: HotpotQA
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- question-answering
task_ids: []
paperswithcode_id: hotpotqa
tags:
- multi-hop
dataset_info:
- config_name: distractor
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: type
dtype: string
- name: level
dtype: string
- name: supporting_facts
sequence:
- name: title
dtype: string
- name: sent_id
dtype: int32
- name: context
sequence:
- name: title
dtype: string
- name: sentences
sequence: string
splits:
- name: train
num_bytes: 552948795
num_examples: 90447
- name: validation
num_bytes: 45716059
num_examples: 7405
download_size: 359239231
dataset_size: 598664854
- config_name: fullwiki
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: type
dtype: string
- name: level
dtype: string
- name: supporting_facts
sequence:
- name: title
dtype: string
- name: sent_id
dtype: int32
- name: context
sequence:
- name: title
dtype: string
- name: sentences
sequence: string
splits:
- name: train
num_bytes: 552948795
num_examples: 90447
- name: validation
num_bytes: 46848549
num_examples: 7405
- name: test
num_bytes: 45999922
num_examples: 7405
download_size: 387387120
dataset_size: 645797266
configs:
- config_name: distractor
data_files:
- split: train
path: distractor/train-*
- split: validation
path: distractor/validation-*
- config_name: fullwiki
data_files:
- split: train
path: fullwiki/train-*
- split: validation
path: fullwiki/validation-*
- split: test
path: fullwiki/test-*
---
# Dataset Card for "hotpot_qa"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://hotpotqa.github.io/](https://hotpotqa.github.io/)
- **Repository:** https://github.com/hotpotqa/hotpot
- **Paper:** [HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 1.27 GB
- **Size of the generated dataset:** 1.24 GB
- **Total amount of disk used:** 2.52 GB
### Dataset Summary
HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### distractor
- **Size of downloaded dataset files:** 612.75 MB
- **Size of the generated dataset:** 598.66 MB
- **Total amount of disk used:** 1.21 GB
An example of 'validation' looks as follows.
```
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 21", "Sent 22"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "medium",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "comparison"
}
```
#### fullwiki
- **Size of downloaded dataset files:** 660.10 MB
- **Size of the generated dataset:** 645.80 MB
- **Total amount of disk used:** 1.31 GB
An example of 'train' looks as follows.
```
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 2"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "hard",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "bridge"
}
```
### Data Fields
The data fields are the same among all splits.
#### distractor
- `id`: a `string` feature.
- `question`: a `string` feature.
- `answer`: a `string` feature.
- `type`: a `string` feature.
- `level`: a `string` feature.
- `supporting_facts`: a dictionary feature containing:
- `title`: a `string` feature.
- `sent_id`: a `int32` feature.
- `context`: a dictionary feature containing:
- `title`: a `string` feature.
- `sentences`: a `list` of `string` features.
#### fullwiki
- `id`: a `string` feature.
- `question`: a `string` feature.
- `answer`: a `string` feature.
- `type`: a `string` feature.
- `level`: a `string` feature.
- `supporting_facts`: a dictionary feature containing:
- `title`: a `string` feature.
- `sent_id`: a `int32` feature.
- `context`: a dictionary feature containing:
- `title`: a `string` feature.
- `sentences`: a `list` of `string` features.
### Data Splits
#### distractor
| |train|validation|
|----------|----:|---------:|
|distractor|90447| 7405|
#### fullwiki
| |train|validation|test|
|--------|----:|---------:|---:|
|fullwiki|90447| 7405|7405|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
HotpotQA is distributed under a [CC BY-SA 4.0 License](http://creativecommons.org/licenses/by-sa/4.0/).
### Citation Information
```
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
year={2018}
}
```
### Contributions
Thanks to [@albertvillanova](https://github.com/albertvillanova), [@ghomasHudson](https://github.com/ghomasHudson) for adding this dataset.
---
annotations_creators:
- 众包标注
language:
- 英语
language_creators:
- 公开资源采集
license:
- 知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0)
multilinguality:
- 单语言
pretty_name: HotpotQA
size_categories:
- 10万~100万样本
source_datasets:
- 原生数据集
task_categories:
- 问答(question-answering)
task_ids: []
paperswithcode_id: hotpotqa
tags:
- 多跳问答(multi-hop)
dataset_info:
- config_name: distractor
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: type
dtype: string
- name: level
dtype: string
- name: supporting_facts
sequence:
- name: title
dtype: string
- name: sent_id
dtype: int32
- name: context
sequence:
- name: title
dtype: string
- name: sentences
sequence: string
splits:
- name: train
num_bytes: 552948795
num_examples: 90447
- name: validation
num_bytes: 45716059
num_examples: 7405
download_size: 359239231
dataset_size: 598664854
- config_name: fullwiki
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: type
dtype: string
- name: level
dtype: string
- name: supporting_facts
sequence:
- name: title
dtype: string
- name: sent_id
dtype: int32
- name: context
sequence:
- name: title
dtype: string
- name: sentences
sequence: string
splits:
- name: train
num_bytes: 552948795
num_examples: 90447
- name: validation
num_bytes: 46848549
num_examples: 7405
- name: test
num_bytes: 45999922
num_examples: 7405
download_size: 387387120
dataset_size: 645797266
configs:
- config_name: distractor
data_files:
- split: train
path: distractor/train-*
- split: validation
path: distractor/validation-*
- config_name: fullwiki
data_files:
- split: train
path: fullwiki/train-*
- split: validation
path: fullwiki/validation-*
- split: test
path: fullwiki/test-*
---
# 「HotpotQA」数据集卡片
## 目录
- [数据集概述](#数据集概述)
- [数据集摘要](#数据集摘要)
- [支持任务与排行榜](#支持任务与排行榜)
- [语言覆盖](#语言覆盖)
- [数据集结构](#数据集结构)
- [数据样例](#数据样例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注](#标注)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差分析](#偏差分析)
- [已知其他局限性](#已知其他局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [授权信息](#授权信息)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集概述
- **主页:** [https://hotpotqa.github.io/](https://hotpotqa.github.io/)
- **代码仓库:** https://github.com/hotpotqa/hotpot
- **相关论文:** [《HotpotQA:面向多样可解释多跳问答的数据集》](https://arxiv.org/abs/1809.09600)
- **联系方式:** [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小:** 1.27 GB
- **生成后数据集大小:** 1.24 GB
- **总磁盘占用:** 2.52 GB
### 数据集摘要
HotpotQA是一个基于维基百科构建的问答数据集,包含11.3万条问答样本,具备四大核心特性:(1) 回答问题需要检索并推理多篇支撑文档;(2) 问题类型多样,不受限于任何预先定义的知识库或知识框架;(3) 提供推理所需的句子级支撑事实,使得问答系统可以在强监督下进行推理并可解释其预测结果;(4) 新增一类事实比较类问题,用于评测问答系统提取相关事实并完成必要比较的能力。
### 支持任务与排行榜
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言覆盖
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据样例
#### 干扰项配置(distractor)
- **下载数据集文件大小:** 612.75 MB
- **生成后数据集大小:** 598.66 MB
- **总磁盘占用:** 1.21 GB
「验证集」样本示例如下:
json
{
"answer": "这是答案",
"context": {
"sentences": [["句子1"], ["句子21", "句子22"]],
"title": ["标题1", "标题2"]
},
"id": "000001",
"level": "中等",
"question": "这是什么答案?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["段落1标题", "段落2标题", "段落3标题"]
},
"type": "比较类"
}
#### 全维基配置(fullwiki)
- **下载数据集文件大小:** 660.10 MB
- **生成后数据集大小:** 645.80 MB
- **总磁盘占用:** 1.31 GB
「训练集」样本示例如下:
json
{
"answer": "这是答案",
"context": {
"sentences": [["句子1"], ["句子2"]],
"title": ["标题1", "标题2"]
},
"id": "000001",
"level": "困难",
"question": "这是什么答案?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["段落1标题", "段落2标题", "段落3标题"]
},
"type": "桥接类"
}
### 数据字段
所有划分的数据字段均保持一致。
#### 干扰项配置(distractor)
- `id`: 字符串类型特征。
- `question`: 字符串类型特征,即问题文本。
- `answer`: 字符串类型特征,即问题答案。
- `type`: 字符串类型特征,即问题类型。
- `level`: 字符串类型特征,即问题难度等级。
- `supporting_facts`: 字典类型特征,包含以下子字段:
- `title`: 字符串类型特征,即支撑文档的标题。
- `sent_id`: int32类型特征,即支撑句子在文档中的序号。
- `context`: 字典类型特征,包含以下子字段:
- `title`: 字符串类型特征,即上下文文档的标题列表。
- `sentences`: 字符串列表类型特征,即上下文文档的句子列表。
#### 全维基配置(fullwiki)
- `id`: 字符串类型特征。
- `question`: 字符串类型特征,即问题文本。
- `answer`: 字符串类型特征,即问题答案。
- `type`: 字符串类型特征,即问题类型。
- `level`: 字符串类型特征,即问题难度等级。
- `supporting_facts`: 字典类型特征,包含以下子字段:
- `title`: 字符串类型特征,即支撑文档的标题。
- `sent_id`: int32类型特征,即支撑句子在文档中的序号。
- `context`: 字典类型特征,包含以下子字段:
- `title`: 字符串类型特征,即上下文文档的标题列表。
- `sentences`: 字符串列表类型特征,即上下文文档的句子列表。
### 数据划分
#### 干扰项配置(distractor)
| | 训练集 | 验证集 |
|----------|-------:|-------:|
| 干扰项配置 | 90447 | 7405 |
#### 全维基配置(fullwiki)
| | 训练集 | 验证集 | 测试集 |
|----------|-------:|-------:|-------:|
| 全维基配置 | 90447 | 7405 | 7405 |
## 数据集构建
### 构建初衷
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注
#### 标注流程
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注人员是谁?
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差分析
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 已知其他局限性
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 授权信息
HotpotQA采用[知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/)进行分发。
### 引用信息
bibtex
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}:面向多样可解释多跳问答的数据集},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={实证自然语言处理会议(EMNLP)},
year={2018}
}
### 贡献致谢
感谢[@albertvillanova](https://github.com/albertvillanova)、[@ghomasHudson](https://github.com/ghomasHudson)为本数据集添加支持。
提供机构:
dnaori



