five

dnaori/hotpot_qa

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dnaori/hotpot_qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language: - en language_creators: - found license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: HotpotQA size_categories: - 100K<n<1M source_datasets: - original task_categories: - question-answering task_ids: [] paperswithcode_id: hotpotqa tags: - multi-hop dataset_info: - config_name: distractor features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: level dtype: string - name: supporting_facts sequence: - name: title dtype: string - name: sent_id dtype: int32 - name: context sequence: - name: title dtype: string - name: sentences sequence: string splits: - name: train num_bytes: 552948795 num_examples: 90447 - name: validation num_bytes: 45716059 num_examples: 7405 download_size: 359239231 dataset_size: 598664854 - config_name: fullwiki features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: level dtype: string - name: supporting_facts sequence: - name: title dtype: string - name: sent_id dtype: int32 - name: context sequence: - name: title dtype: string - name: sentences sequence: string splits: - name: train num_bytes: 552948795 num_examples: 90447 - name: validation num_bytes: 46848549 num_examples: 7405 - name: test num_bytes: 45999922 num_examples: 7405 download_size: 387387120 dataset_size: 645797266 configs: - config_name: distractor data_files: - split: train path: distractor/train-* - split: validation path: distractor/validation-* - config_name: fullwiki data_files: - split: train path: fullwiki/train-* - split: validation path: fullwiki/validation-* - split: test path: fullwiki/test-* --- # Dataset Card for "hotpot_qa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://hotpotqa.github.io/](https://hotpotqa.github.io/) - **Repository:** https://github.com/hotpotqa/hotpot - **Paper:** [HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 1.27 GB - **Size of the generated dataset:** 1.24 GB - **Total amount of disk used:** 2.52 GB ### Dataset Summary HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### distractor - **Size of downloaded dataset files:** 612.75 MB - **Size of the generated dataset:** 598.66 MB - **Total amount of disk used:** 1.21 GB An example of 'validation' looks as follows. ``` { "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 21", "Sent 22"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "medium", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "comparison" } ``` #### fullwiki - **Size of downloaded dataset files:** 660.10 MB - **Size of the generated dataset:** 645.80 MB - **Total amount of disk used:** 1.31 GB An example of 'train' looks as follows. ``` { "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 2"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "hard", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "bridge" } ``` ### Data Fields The data fields are the same among all splits. #### distractor - `id`: a `string` feature. - `question`: a `string` feature. - `answer`: a `string` feature. - `type`: a `string` feature. - `level`: a `string` feature. - `supporting_facts`: a dictionary feature containing: - `title`: a `string` feature. - `sent_id`: a `int32` feature. - `context`: a dictionary feature containing: - `title`: a `string` feature. - `sentences`: a `list` of `string` features. #### fullwiki - `id`: a `string` feature. - `question`: a `string` feature. - `answer`: a `string` feature. - `type`: a `string` feature. - `level`: a `string` feature. - `supporting_facts`: a dictionary feature containing: - `title`: a `string` feature. - `sent_id`: a `int32` feature. - `context`: a dictionary feature containing: - `title`: a `string` feature. - `sentences`: a `list` of `string` features. ### Data Splits #### distractor | |train|validation| |----------|----:|---------:| |distractor|90447| 7405| #### fullwiki | |train|validation|test| |--------|----:|---------:|---:| |fullwiki|90447| 7405|7405| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information HotpotQA is distributed under a [CC BY-SA 4.0 License](http://creativecommons.org/licenses/by-sa/4.0/). ### Citation Information ``` @inproceedings{yang2018hotpotqa, title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})}, year={2018} } ``` ### Contributions Thanks to [@albertvillanova](https://github.com/albertvillanova), [@ghomasHudson](https://github.com/ghomasHudson) for adding this dataset.

--- annotations_creators: - 众包标注 language: - 英语 language_creators: - 公开资源采集 license: - 知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0) multilinguality: - 单语言 pretty_name: HotpotQA size_categories: - 10万~100万样本 source_datasets: - 原生数据集 task_categories: - 问答(question-answering) task_ids: [] paperswithcode_id: hotpotqa tags: - 多跳问答(multi-hop) dataset_info: - config_name: distractor features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: level dtype: string - name: supporting_facts sequence: - name: title dtype: string - name: sent_id dtype: int32 - name: context sequence: - name: title dtype: string - name: sentences sequence: string splits: - name: train num_bytes: 552948795 num_examples: 90447 - name: validation num_bytes: 45716059 num_examples: 7405 download_size: 359239231 dataset_size: 598664854 - config_name: fullwiki features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: level dtype: string - name: supporting_facts sequence: - name: title dtype: string - name: sent_id dtype: int32 - name: context sequence: - name: title dtype: string - name: sentences sequence: string splits: - name: train num_bytes: 552948795 num_examples: 90447 - name: validation num_bytes: 46848549 num_examples: 7405 - name: test num_bytes: 45999922 num_examples: 7405 download_size: 387387120 dataset_size: 645797266 configs: - config_name: distractor data_files: - split: train path: distractor/train-* - split: validation path: distractor/validation-* - config_name: fullwiki data_files: - split: train path: fullwiki/train-* - split: validation path: fullwiki/validation-* - split: test path: fullwiki/test-* --- # 「HotpotQA」数据集卡片 ## 目录 - [数据集概述](#数据集概述) - [数据集摘要](#数据集摘要) - [支持任务与排行榜](#支持任务与排行榜) - [语言覆盖](#语言覆盖) - [数据集结构](#数据集结构) - [数据样例](#数据样例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注](#标注) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差分析](#偏差分析) - [已知其他局限性](#已知其他局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [授权信息](#授权信息) - [引用信息](#引用信息) - [贡献致谢](#贡献致谢) ## 数据集概述 - **主页:** [https://hotpotqa.github.io/](https://hotpotqa.github.io/) - **代码仓库:** https://github.com/hotpotqa/hotpot - **相关论文:** [《HotpotQA:面向多样可解释多跳问答的数据集》](https://arxiv.org/abs/1809.09600) - **联系方式:** [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 1.27 GB - **生成后数据集大小:** 1.24 GB - **总磁盘占用:** 2.52 GB ### 数据集摘要 HotpotQA是一个基于维基百科构建的问答数据集,包含11.3万条问答样本,具备四大核心特性:(1) 回答问题需要检索并推理多篇支撑文档;(2) 问题类型多样,不受限于任何预先定义的知识库或知识框架;(3) 提供推理所需的句子级支撑事实,使得问答系统可以在强监督下进行推理并可解释其预测结果;(4) 新增一类事实比较类问题,用于评测问答系统提取相关事实并完成必要比较的能力。 ### 支持任务与排行榜 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言覆盖 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据样例 #### 干扰项配置(distractor) - **下载数据集文件大小:** 612.75 MB - **生成后数据集大小:** 598.66 MB - **总磁盘占用:** 1.21 GB 「验证集」样本示例如下: json { "answer": "这是答案", "context": { "sentences": [["句子1"], ["句子21", "句子22"]], "title": ["标题1", "标题2"] }, "id": "000001", "level": "中等", "question": "这是什么答案?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["段落1标题", "段落2标题", "段落3标题"] }, "type": "比较类" } #### 全维基配置(fullwiki) - **下载数据集文件大小:** 660.10 MB - **生成后数据集大小:** 645.80 MB - **总磁盘占用:** 1.31 GB 「训练集」样本示例如下: json { "answer": "这是答案", "context": { "sentences": [["句子1"], ["句子2"]], "title": ["标题1", "标题2"] }, "id": "000001", "level": "困难", "question": "这是什么答案?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["段落1标题", "段落2标题", "段落3标题"] }, "type": "桥接类" } ### 数据字段 所有划分的数据字段均保持一致。 #### 干扰项配置(distractor) - `id`: 字符串类型特征。 - `question`: 字符串类型特征,即问题文本。 - `answer`: 字符串类型特征,即问题答案。 - `type`: 字符串类型特征,即问题类型。 - `level`: 字符串类型特征,即问题难度等级。 - `supporting_facts`: 字典类型特征,包含以下子字段: - `title`: 字符串类型特征,即支撑文档的标题。 - `sent_id`: int32类型特征,即支撑句子在文档中的序号。 - `context`: 字典类型特征,包含以下子字段: - `title`: 字符串类型特征,即上下文文档的标题列表。 - `sentences`: 字符串列表类型特征,即上下文文档的句子列表。 #### 全维基配置(fullwiki) - `id`: 字符串类型特征。 - `question`: 字符串类型特征,即问题文本。 - `answer`: 字符串类型特征,即问题答案。 - `type`: 字符串类型特征,即问题类型。 - `level`: 字符串类型特征,即问题难度等级。 - `supporting_facts`: 字典类型特征,包含以下子字段: - `title`: 字符串类型特征,即支撑文档的标题。 - `sent_id`: int32类型特征,即支撑句子在文档中的序号。 - `context`: 字典类型特征,包含以下子字段: - `title`: 字符串类型特征,即上下文文档的标题列表。 - `sentences`: 字符串列表类型特征,即上下文文档的句子列表。 ### 数据划分 #### 干扰项配置(distractor) | | 训练集 | 验证集 | |----------|-------:|-------:| | 干扰项配置 | 90447 | 7405 | #### 全维基配置(fullwiki) | | 训练集 | 验证集 | 测试集 | |----------|-------:|-------:|-------:| | 全维基配置 | 90447 | 7405 | 7405 | ## 数据集构建 ### 构建初衷 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁? [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注 #### 标注流程 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁? [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差分析 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 已知其他局限性 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息请见](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 授权信息 HotpotQA采用[知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/)进行分发。 ### 引用信息 bibtex @inproceedings{yang2018hotpotqa, title={{HotpotQA}:面向多样可解释多跳问答的数据集}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={实证自然语言处理会议(EMNLP)}, year={2018} } ### 贡献致谢 感谢[@albertvillanova](https://github.com/albertvillanova)、[@ghomasHudson](https://github.com/ghomasHudson)为本数据集添加支持。
提供机构:
dnaori
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作