hotpotqa

Name: hotpotqa
Creator: maas
Published: 2026-05-12 09:24:40
License: 暂无描述

魔搭社区2026-05-12 更新2025-01-11 收录

下载链接：

https://modelscope.cn/datasets/MTEB/hotpotqa

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">HotpotQA</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">MTEB</a> dataset</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div> </div> HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. | | | |---------------|---------------------------------------------| | Task category | t2t | | Domains | Web, Written | | Reference | https://hotpotqa.github.io/ | ## How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: ```python import mteb task = mteb.get_tasks(["HotpotQA"]) evaluator = mteb.MTEB(task) model = mteb.get_model(YOUR_MODEL) evaluator.run(model) ```  To learn more about how to run models on `mteb` task check out the [GitHub repitory](https://github.com/embeddings-benchmark/mteb). ## Citation If you use this dataset, please cite the dataset as well as [mteb](https://github.com/embeddings-benchmark/mteb), as this dataset likely includes additional processing as a part of the [MMTEB Contribution](https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb). ```bibtex @inproceedings{yang-etal-2018-hotpotqa, abstract = {Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems{'} ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.}, address = {Brussels, Belgium}, author = {Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, doi = {10.18653/v1/D18-1259}, editor = {Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun{'}ichi}, month = oct # {-} # nov, pages = {2369--2380}, publisher = {Association for Computational Linguistics}, title = {{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, url = {https://aclanthology.org/D18-1259}, year = {2018}, } @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff}, publisher = {arXiv}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi = {10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, publisher = {arXiv}, journal={arXiv preprint arXiv:2210.07316}, year = {2022} url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, } ``` # Dataset Statistics <details> <summary> Dataset Statistics</summary> The following code contains the descriptive statistics from the task. These can also be obtained using: ```python import mteb task = mteb.get_task("HotpotQA") desc_stats = task.metadata.descriptive_stats ``` ```json { "train": { "num_samples": 5318329, "number_of_characters": 1520922083, "num_documents": 5233329, "min_document_length": 9, "average_document_length": 288.9079517072212, "max_document_length": 8276, "unique_documents": 5233329, "num_queries": 85000, "min_query_length": 13, "average_query_length": 105.54965882352941, "max_query_length": 654, "unique_queries": 85000, "none_queries": 0, "num_relevant_docs": 170000, "min_relevant_docs_per_query": 2, "average_relevant_docs_per_query": 2.0, "max_relevant_docs_per_query": 2, "unique_relevant_docs": 101307, "num_instructions": null, "min_instruction_length": null, "average_instruction_length": null, "max_instruction_length": null, "unique_instructions": null, "num_top_ranked": null, "min_top_ranked_per_query": null, "average_top_ranked_per_query": null, "max_top_ranked_per_query": null }, "dev": { "num_samples": 5238776, "number_of_characters": 1512524238, "num_documents": 5233329, "min_document_length": 9, "average_document_length": 288.9079517072212, "max_document_length": 8276, "unique_documents": 5233329, "num_queries": 5447, "min_query_length": 18, "average_query_length": 105.35634294106848, "max_query_length": 630, "unique_queries": 5447, "none_queries": 0, "num_relevant_docs": 10894, "min_relevant_docs_per_query": 2, "average_relevant_docs_per_query": 2.0, "max_relevant_docs_per_query": 2, "unique_relevant_docs": 10335, "num_instructions": null, "min_instruction_length": null, "average_instruction_length": null, "max_instruction_length": null, "unique_instructions": null, "num_top_ranked": null, "min_top_ranked_per_query": null, "average_top_ranked_per_query": null, "max_top_ranked_per_query": null }, "test": { "num_samples": 5240734, "number_of_characters": 1512632888, "num_documents": 5233329, "min_document_length": 9, "average_document_length": 288.9079517072212, "max_document_length": 8276, "unique_documents": 5233329, "num_queries": 7405, "min_query_length": 32, "average_query_length": 92.17096556380824, "max_query_length": 288, "unique_queries": 7405, "none_queries": 0, "num_relevant_docs": 14810, "min_relevant_docs_per_query": 2, "average_relevant_docs_per_query": 2.0, "max_relevant_docs_per_query": 2, "unique_relevant_docs": 13783, "num_instructions": null, "min_instruction_length": null, "average_instruction_length": null, "max_instruction_length": null, "unique_instructions": null, "num_top_ranked": null, "min_top_ranked_per_query": null, "average_top_ranked_per_query": null, "max_top_ranked_per_query": null } } ``` </details> --- *This dataset card was automatically generated using [MTEB](https://github.com/embeddings-benchmark/mteb)*

<div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">HotpotQA</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">一个<a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">MTEB（Massive Text Embedding Benchmark，大规模文本嵌入基准）</a>数据集</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">大规模文本嵌入基准</div> </div> HotpotQA是一款问答数据集，涵盖自然语言多跳问题，并为支撑事实提供强监督信号，以支持构建更具可解释性的问答系统。 | | | |---------------|---------------------------------------------| | 任务类别 | t2t | | 领域 | 网络、书面文本 | | 参考链接 | https://hotpotqa.github.io/ | ## 该任务的评估方法你可以通过以下代码在该数据集上评估嵌入模型： python import mteb task = mteb.get_tasks(["HotpotQA"]) evaluator = mteb.MTEB(task) model = mteb.get_model(YOUR_MODEL) evaluator.run(model)  若需了解如何在MTEB任务上运行模型，请访问[GitHub repitory](https://github.com/embeddings-benchmark/mteb)。 ## 引用若使用本数据集，请同时引用该数据集与[MTEB](https://github.com/embeddings-benchmark/mteb)，因为本数据集可能已作为[MMTEB（Massive Multilingual Text Embedding Benchmark，大规模多语言文本嵌入基准）贡献项](https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb)的一部分经过额外处理。 bibtex @inproceedings{yang-etal-2018-hotpotqa, abstract = {Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems{'} ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.}, address = {Brussels, Belgium}, author = {Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, doi = {10.18653/v1/D18-1259}, editor = {Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun{'}ichi}, month = oct # {-} # nov, pages = {2369--2380}, publisher = {Association for Computational Linguistics}, title = {{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, url = {https://aclanthology.org/D18-1259}, year = {2018}, } @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff}, publisher = {arXiv}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi = {10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{"i}c and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, publisher = {arXiv}, journal={arXiv preprint arXiv:2210.07316}, year = {2022} url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, } # 数据集统计信息 <details> <summary> 数据集统计信息</summary> 以下代码展示了该任务的描述性统计量，你也可以通过以下代码获取： python import mteb task = mteb.get_task("HotpotQA") desc_stats = task.metadata.descriptive_stats json { "train": { "num_samples": 5318329, "number_of_characters": 1520922083, "num_documents": 5233329, "min_document_length": 9, "average_document_length": 288.9079517072212, "max_document_length": 8276, "unique_documents": 5233329, "num_queries": 85000, "min_query_length": 13, "average_query_length": 105.54965882352941, "max_query_length": 654, "unique_queries": 85000, "none_queries": 0, "num_relevant_docs": 170000, "min_relevant_docs_per_query": 2, "average_relevant_docs_per_query": 2.0, "max_relevant_docs_per_query": 2, "unique_relevant_docs": 101307, "num_instructions": null, "min_instruction_length": null, "average_instruction_length": null, "max_instruction_length": null, "unique_instructions": null, "num_top_ranked": null, "min_top_ranked_per_query": null, "average_top_ranked_per_query": null, "max_top_ranked_per_query": null }, "dev": { "num_samples": 5238776, "number_of_characters": 1512524238, "num_documents": 5233329, "min_document_length": 9, "average_document_length": 288.9079517072212, "max_document_length": 8276, "unique_documents": 5233329, "num_queries": 5447, "min_query_length": 18, "average_query_length": 105.35634294106848, "max_query_length": 630, "unique_queries": 5447, "none_queries": 0, "num_relevant_docs": 10894, "min_relevant_docs_per_query": 2, "average_relevant_docs_per_query": 2.0, "max_relevant_docs_per_query": 2, "unique_relevant_docs": 10335, "num_instructions": null, "min_instruction_length": null, "average_instruction_length": null, "max_instruction_length": null, "unique_instructions": null, "num_top_ranked": null, "min_top_ranked_per_query": null, "average_top_ranked_per_query": null, "max_top_ranked_per_query": null }, "test": { "num_samples": 5240734, "number_of_characters": 1512632888, "num_documents": 5233329, "min_document_length": 9, "average_document_length": 288.9079517072212, "max_document_length": 8276, "unique_documents": 5233329, "num_queries": 7405, "min_query_length": 32, "average_query_length": 92.17096556380824, "max_query_length": 288, "unique_queries": 7405, "none_queries": 0, "num_relevant_docs": 14810, "min_relevant_docs_per_query": 2, "average_relevant_docs_per_query": 2.0, "max_relevant_docs_per_query": 2, "unique_relevant_docs": 13783, "num_instructions": null, "min_instruction_length": null, "average_instruction_length": null, "max_instruction_length": null, "unique_instructions": null, "num_top_ranked": null, "min_top_ranked_per_query": null, "average_top_ranked_per_query": null, "max_top_ranked_per_query": null } } </details> --- *本数据集卡片由[MTEB](https://github.com/embeddings-benchmark/mteb)自动生成*

提供机构：

maas

创建时间：

2024-09-06

搜集汇总

数据集介绍

背景与挑战

背景概述

HotpotQA是MTEB中的一个问答数据集，包含基于维基百科的自然多跳问题，并提供支持事实的强监督，旨在提升问答系统的可解释性。该数据集规模较大，涵盖训练、开发和测试集，用于评估嵌入模型的性能。

以上内容由遇见数据集搜集并总结生成

HotpotQA

HotpotQA 是收集在英语维基百科上的问答数据集，包含大约 113K 众包问题，这些问题的构建需要两篇维基百科文章的介绍段落才能回答。数据集中的每个问题都带有两个黄金段落，以及这些段落中的句子列表，众包工作人员认为这些句子是回答问题所必需的支持事实。 HotpotQA 提供了多种推理策略，包括涉及问题中缺失实体的问题、交叉问题（什么满足属性 A 和属性 B？）和比较问题，其中两个实体通过一个共

OpenCSG2024-03-21 更新180

HotpotQA

OpenDataLab2026-07-12 更新14840

HotpotQA

该数据集被用作评估ReAct代理在处理任务时的性能基准，通过将多个大型语言模型（LLM）调用与诸如网络搜索等行动交替进行。此外，该数据集还用于反映原始ReAct代理框架论文中所描述的设置。其所涉及的任务为问答。

arXiv240

HOTPOTQA

HOTPOTQA是一个包含113,000个基于维基百科的问题-答案对的大型数据集，由卡内基梅隆大学等机构创建。该数据集的特点在于其问题需要通过多文档推理来回答，且问题类型多样，不依赖于预先存在的知识库或知识模式。此外，HOTPOTQA提供了句子级别的支持事实，以帮助QA系统进行强监督推理和解释预测。数据集还引入了新的事实比较问题类型，以测试QA系统提取相关事实和进行必要比较的能力。HOTPOTQA

arXiv2018-09-26 更新2640

HotpotQA

OpenXLab00