Jackkoo/hotpot_qa

Name: Jackkoo/hotpot_qa
Creator: Jackkoo
Published: 2026-03-30 07:32:22
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Jackkoo/hotpot_qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language: - en language_creators: - found license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: HotpotQA size_categories: - 100K<n<1M source_datasets: - original task_categories: - question-answering task_ids: [] paperswithcode_id: hotpotqa tags: - multi-hop dataset_info: - config_name: distractor features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: level dtype: string - name: supporting_facts sequence: - name: title dtype: string - name: sent_id dtype: int32 - name: context sequence: - name: title dtype: string - name: sentences sequence: string splits: - name: train num_bytes: 552948795 num_examples: 90447 - name: validation num_bytes: 45716059 num_examples: 7405 download_size: 359239231 dataset_size: 598664854 - config_name: fullwiki features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: level dtype: string - name: supporting_facts sequence: - name: title dtype: string - name: sent_id dtype: int32 - name: context sequence: - name: title dtype: string - name: sentences sequence: string splits: - name: train num_bytes: 552948795 num_examples: 90447 - name: validation num_bytes: 46848549 num_examples: 7405 - name: test num_bytes: 45999922 num_examples: 7405 download_size: 387387120 dataset_size: 645797266 configs: - config_name: distractor data_files: - split: train path: distractor/train-* - split: validation path: distractor/validation-* - config_name: fullwiki data_files: - split: train path: fullwiki/train-* - split: validation path: fullwiki/validation-* - split: test path: fullwiki/test-* --- # Dataset Card for "hotpot_qa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://hotpotqa.github.io/](https://hotpotqa.github.io/) - **Repository:** https://github.com/hotpotqa/hotpot - **Paper:** [HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 1.27 GB - **Size of the generated dataset:** 1.24 GB - **Total amount of disk used:** 2.52 GB ### Dataset Summary HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### distractor - **Size of downloaded dataset files:** 612.75 MB - **Size of the generated dataset:** 598.66 MB - **Total amount of disk used:** 1.21 GB An example of 'validation' looks as follows. ``` { "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 21", "Sent 22"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "medium", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "comparison" } ``` #### fullwiki - **Size of downloaded dataset files:** 660.10 MB - **Size of the generated dataset:** 645.80 MB - **Total amount of disk used:** 1.31 GB An example of 'train' looks as follows. ``` { "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 2"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "hard", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "bridge" } ``` ### Data Fields The data fields are the same among all splits. #### distractor - `id`: a `string` feature. - `question`: a `string` feature. - `answer`: a `string` feature. - `type`: a `string` feature. - `level`: a `string` feature. - `supporting_facts`: a dictionary feature containing: - `title`: a `string` feature. - `sent_id`: a `int32` feature. - `context`: a dictionary feature containing: - `title`: a `string` feature. - `sentences`: a `list` of `string` features. #### fullwiki - `id`: a `string` feature. - `question`: a `string` feature. - `answer`: a `string` feature. - `type`: a `string` feature. - `level`: a `string` feature. - `supporting_facts`: a dictionary feature containing: - `title`: a `string` feature. - `sent_id`: a `int32` feature. - `context`: a dictionary feature containing: - `title`: a `string` feature. - `sentences`: a `list` of `string` features. ### Data Splits #### distractor | |train|validation| |----------|----:|---------:| |distractor|90447| 7405| #### fullwiki | |train|validation|test| |--------|----:|---------:|---:| |fullwiki|90447| 7405|7405| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information HotpotQA is distributed under a [CC BY-SA 4.0 License](http://creativecommons.org/licenses/by-sa/4.0/). ### Citation Information ``` @inproceedings{yang2018hotpotqa, title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})}, year={2018} } ``` ### Contributions Thanks to [@albertvillanova](https://github.com/albertvillanova), [@ghomasHudson](https://github.com/ghomasHudson) for adding this dataset.

--- 注释创建者： - 众包语言： - 英语语言来源： - 公开采集许可协议： - CC BY-SA 4.0 多语言属性： - 单语言数据集展示名： - HotpotQA 样本规模区间： - 10万 < 样本数 < 100万源数据集： - 原生数据集任务类别： - 问答(question-answering) 任务子类别： - 无 PapersWithCode编号： - hotpotqa 标签： - 多跳(multi-hop) 数据集信息： - 配置名称：干扰项(distractor) 字段信息： - 字段名：id 数据类型：字符串 - 字段名：question 数据类型：字符串 - 字段名：answer 数据类型：字符串 - 字段名：type 数据类型：字符串 - 字段名：level 数据类型：字符串 - 字段名：supporting_facts 序列类型： - 字段名：title 数据类型：字符串 - 字段名：sent_id 数据类型：int32 - 字段名：context 序列类型： - 字段名：title 数据类型：字符串 - 字段名：sentences 序列类型：字符串数据划分： - 划分名称：train 字节数：552948795 样本数：90447 - 划分名称：validation 字节数：45716059 样本数：7405 下载大小：359239231 数据集总大小：598664854 - 配置名称：全维基(fullwiki) 字段信息： - 字段名：id 数据类型：字符串 - 字段名：question 数据类型：字符串 - 字段名：answer 数据类型：字符串 - 字段名：type 数据类型：字符串 - 字段名：level 数据类型：字符串 - 字段名：supporting_facts 序列类型： - 字段名：title 数据类型：字符串 - 字段名：sent_id 数据类型：int32 - 字段名：context 序列类型： - 字段名：title 数据类型：字符串 - 字段名：sentences 序列类型：字符串数据划分： - 划分名称：train 字节数：552948795 样本数：90447 - 划分名称：validation 字节数：46848549 样本数：7405 - 划分名称：test 字节数：45999922 样本数：7405 下载大小：387387120 数据集总大小：645797266 配置项： - 配置名称：干扰项(distractor) 数据文件： - 划分：train 路径：distractor/train-* - 划分：validation 路径：distractor/validation-* - 配置名称：全维基(fullwiki) 数据文件： - 划分：train 路径：fullwiki/train-* - 划分：validation 路径：fullwiki/validation-* - 划分：test 路径：fullwiki/test-* --- # HotpotQA数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与基准榜单](#支持任务与基准榜单) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [注释](#注释) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页**：[https://hotpotqa.github.io/](https://hotpotqa.github.io/) - **代码仓库**：https://github.com/hotpotqa/hotpot - **相关论文**：[HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600) - **联络人**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集总大小**：1.27 GB - **生成后数据集大小**：1.24 GB - **总磁盘占用**：2.52 GB ### 数据集概述 HotpotQA是一款基于维基百科构建的新型问答数据集，包含11.3万组问答样本，具备四大核心特性：（1）回答问题需检索并推理多篇支持文档；（2）问题类型多样，不受限于任何已有知识库或知识图谱模式；（3）提供推理所需的句子级支持事实(supporting facts)，使问答系统可借助强监督开展推理并解释预测结果；（4）新增一类事实型对比问题，用于测试问答系统提取相关事实并完成必要对比的能力。 ### 支持任务与基准榜单 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 干扰项配置（distractor） - **下载数据集大小**：612.75 MB - **生成后数据集大小**：598.66 MB - **总磁盘占用**：1.21 GB 验证集的一个示例如下： { "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 21", "Sent 22"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "medium", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "comparison" } #### 全维基配置（fullwiki） - **下载数据集大小**：660.10 MB - **生成后数据集大小**：645.80 MB - **总磁盘占用**：1.31 GB 训练集的一个示例如下： { "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 2"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "hard", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "bridge" } ### 数据字段所有划分的数据字段均保持一致。 #### 干扰项配置（distractor） - `id`：字符串类型字段。 - `question`：问题字符串字段。 - `answer`：答案字符串字段。 - `type`：问题类型字符串字段。 - `level`：问题难度字符串字段。 - `supporting_facts`：字典类型字段，包含： - `title`：文档标题字符串字段。 - `sent_id`：句子ID整型字段。 - `context`：字典类型字段，包含： - `title`：文档标题字符串字段。 - `sentences`：字符串列表字段。 #### 全维基配置（fullwiki） - `id`：字符串类型字段。 - `question`：问题字符串字段。 - `answer`：答案字符串字段。 - `type`：问题类型字符串字段。 - `level`：问题难度字符串字段。 - `supporting_facts`：字典类型字段，包含： - `title`：文档标题字符串字段。 - `sent_id`：句子ID整型字段。 - `context`：字典类型字段，包含： - `title`：文档标题字符串字段。 - `sentences`：字符串列表字段。 ### 数据划分 #### 干扰项配置（distractor） | | 训练集 | 验证集 | |----------|-------:|-------:| | 干扰项配置 | 90447 | 7405 | #### 全维基配置（fullwiki） | | 训练集 | 验证集 | 测试集 | |--------|-------:|-------:|-------:| | 全维基配置 | 90447 | 7405 | 7405 | ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 注释 #### 注释流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 注释者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 HotpotQA采用[CC BY-SA 4.0许可协议](http://creativecommons.org/licenses/by-sa/4.0/)进行分发。 ### 引用信息 @inproceedings{yang2018hotpotqa, title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})}, year={2018} } ### 贡献者感谢[@albertvillanova](https://github.com/albertvillanova), [@ghomasHudson](https://github.com/ghomasHudson)为本数据集添加支持。

提供机构：

Jackkoo

5,000+

优质数据集

54 个

任务类型

进入经典数据集