five

wiki_qa

收藏
魔搭社区2025-12-05 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/wiki_qa
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "wiki_qa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://www.microsoft.com/en-us/download/details.aspx?id=52419](https://www.microsoft.com/en-us/download/details.aspx?id=52419) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [WikiQA: A Challenge Dataset for Open-Domain Question Answering](https://aclanthology.org/D15-1237/) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 7.10 MB - **Size of the generated dataset:** 6.40 MB - **Total amount of disk used:** 13.50 MB ### Dataset Summary Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 7.10 MB - **Size of the generated dataset:** 6.40 MB - **Total amount of disk used:** 13.50 MB An example of 'train' looks as follows. ``` { "answer": "Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice.", "document_title": "Glacier cave", "label": 0, "question": "how are glacier caves formed?", "question_id": "Q1" } ``` ### Data Fields The data fields are the same among all splits. #### default - `question_id`: a `string` feature. - `question`: a `string` feature. - `document_title`: a `string` feature. - `answer`: a `string` feature. - `label`: a classification label, with possible values including `0` (0), `1` (1). ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default|20360| 2733|6165| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information MICROSOFT RESEARCH DATA LICENSE AGREEMENT FOR MICROSOFT RESEARCH WIKIQA CORPUS These license terms are an agreement between Microsoft Corporation (or based on where you live, one of its affiliates) and you. Please read them. They apply to the data associated with this license above, which includes the media on which you received it, if any. The terms also apply to any Microsoft: - updates, - supplements, - Internet-based services, and - support services for this data, unless other terms accompany those items. If so, those terms apply. BY USING THE DATA, YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT THEM, DO NOT USE THE DATA. If you comply with these license terms, you have the rights below. 1. SCOPE OF LICENSE. a. You may use, copy, modify, create derivative works, and distribute the Dataset: i. for research and technology development purposes only. Examples of research and technology development uses are teaching, academic research, public demonstrations and experimentation ; and ii. to publish (or present papers or articles) on your results from using such Dataset. b. The data is licensed, not sold. This agreement only gives you some rights to use the data. Microsoft reserves all other rights. Unless applicable law gives you more rights despite this limitation, you may use the data only as expressly permitted in this agreement. In doing so, you must comply with any technical limitations in the data that only allow you to use it in certain ways. You may not - work around any technical limitations in the data; - reverse engineer, decompile or disassemble the data, except and only to the extent that applicable law expressly permits, despite this limitation; - rent, lease or lend the data; - transfer the data or this agreement to any third party; or - use the data directly in a commercial product without Microsoft’s permission. 2. DISTRIBUTION REQUIREMENTS: a. If you distribute the Dataset or any derivative works of the Dataset, you will distribute them under the same terms and conditions as in this Agreement, and you will not grant other rights to the Dataset or derivative works that are different from those provided by this Agreement. b. If you have created derivative works of the Dataset, and distribute such derivative works, you will cause the modified files to carry prominent notices so that recipients know that they are not receiving Page 1 of 3the original Dataset. Such notices must state: (i) that you have changed the Dataset; and (ii) the date of any changes. 3. DISTRIBUTION RESTRICTIONS. You may not: (a) alter any copyright, trademark or patent notice in the Dataset; (b) use Microsoft’s trademarks in a way that suggests your derivative works or modifications come from or are endorsed by Microsoft; (c) include the Dataset in malicious, deceptive or unlawful programs. 4. OWNERSHIP. Microsoft retains all right, title, and interest in and to any Dataset provided to you under this Agreement. You acquire no interest in the Dataset you may receive under the terms of this Agreement. 5. LICENSE TO MICROSOFT. Microsoft is granted back, without any restrictions or limitations, a non-exclusive, perpetual, irrevocable, royalty-free, assignable and sub-licensable license, to reproduce, publicly perform or display, use, modify, post, distribute, make and have made, sell and transfer your modifications to and/or derivative works of the Dataset, for any purpose. 6. FEEDBACK. If you give feedback about the Dataset to Microsoft, you give to Microsoft, without charge, the right to use, share and commercialize your feedback in any way and for any purpose. You also give to third parties, without charge, any patent rights needed for their products, technologies and services to use or interface with any specific parts of a Microsoft dataset or service that includes the feedback. You will not give feedback that is subject to a license that requires Microsoft to license its Dataset or documentation to third parties because we include your feedback in them. These rights survive this Agreement. 7. EXPORT RESTRICTIONS. The Dataset is subject to United States export laws and regulations. You must comply with all domestic and international export laws and regulations that apply to the Dataset. These laws include restrictions on destinations, end users and end use. For additional information, see www.microsoft.com/exporting. 8. ENTIRE AGREEMENT. This Agreement, and the terms for supplements, updates, Internet-based services and support services that you use, are the entire agreement for the Dataset. 9. SUPPORT SERVICES. Because this data is “as is,” we may not provide support services for it. 10. APPLICABLE LAW. a. United States. If you acquired the software in the United States, Washington state law governs the interpretation of this agreement and applies to claims for breach of it, regardless of conflict of laws principles. The laws of the state where you live govern all other claims, including claims under state consumer protection laws, unfair competition laws, and in tort. b. Outside the United States. If you acquired the software in any other country, the laws of that country apply. 11. LEGAL EFFECT. This Agreement describes certain legal rights. You may have other rights under the laws of your country. You may also have rights with respect to the party from whom you acquired the Dataset. This Agreement does not change your rights under the laws of your country if the laws of your country do not permit it to do so. 12. DISCLAIMER OF WARRANTY. The Dataset is licensed “as-is.” You bear the risk of using it. Microsoft gives no express warranties, guarantees or conditions. You may have additional consumer rights or statutory guarantees under your local laws which this agreement cannot change. To the extent permitted under your local laws, Microsoft excludes the implied warranties of merchantability, fitness for a particular purpose and non- infringement. 13. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM MICROSOFT AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP TO U.S. $5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL, LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES. This limitation applies to - anything related to the software, services, content (including code) on third party Internet sites, or third party programs; and Page 2 of 3 - claims for breach of contract, breach of warranty, guarantee or condition, strict liability, negligence, or other tort to the extent permitted by applicable law. It also applies even if Microsoft knew or should have known about the possibility of the damages. The above limitation or exclusion may not apply to you because your country may not allow the exclusion or limitation of incidental, consequential or other damages. ### Citation Information ``` @inproceedings{yang-etal-2015-wikiqa, title = "{W}iki{QA}: A Challenge Dataset for Open-Domain Question Answering", author = "Yang, Yi and Yih, Wen-tau and Meek, Christopher", booktitle = "Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing", month = sep, year = "2015", address = "Lisbon, Portugal", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D15-1237", doi = "10.18653/v1/D15-1237", pages = "2013--2018", } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

# "WikiQA" 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据拆分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可协议信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页:** [https://www.microsoft.com/en-us/download/details.aspx?id=52419](https://www.microsoft.com/en-us/download/details.aspx?id=52419) - **代码仓库:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** [WikiQA: A Challenge Dataset for Open-Domain Question Answering](https://aclanthology.org/D15-1237/) - **联系方式:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 7.10 MB - **生成数据集大小:** 6.40 MB - **总磁盘占用:** 13.50 MB ### 数据集摘要 微软出品的维基问答语料库。 WikiQA语料库是一套公开可用的问题-语句对集合,专为开放域问答(Open-Domain Question Answering)研究收集并标注而成。 ### 支持任务与排行榜 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认拆分 - **下载数据集文件大小:** 7.10 MB - **生成数据集大小:** 6.40 MB - **总磁盘占用:** 13.50 MB 训练集的一条示例如下所示。 { "answer": "冰川洞穴常被称作冰洞,但该术语的标准用法是指包含常年不化冰层的基岩洞穴。", "document_title": "Glacier cave", "label": 0, "question": "冰川洞穴是如何形成的?", "question_id": "Q1" } ### 数据字段 所有数据拆分的字段均保持一致。 #### 默认拆分 - `question_id`:字符串类型特征 - `question`:字符串类型特征 - `document_title`:字符串类型特征 - `answer`:字符串类型特征 - `label`:分类标签,可选值包括`0`、`1` ### 数据拆分 | 拆分名称 | 训练集 | 验证集 | 测试集 | |---------|-------:|-------:|-------:| | default | 20360 | 2733 | 6165 | ## 数据集构建 ### 构建初衷 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与归一化 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁? [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注 #### 标注流程 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁? [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可协议信息 # 微软研究院数据许可协议 ## 适用于微软研究院WikiQA语料库 以下许可条款为微软公司(或根据您所在地区,由其关联公司)与您之间达成的协议。请仔细阅读本协议,其适用于本协议上方提及的相关数据(包括您接收该数据时所使用的介质,若有),同时也适用于该数据的任何: - 更新版本 - 补充内容 - 基于互联网的服务 - 支持服务 除非这些项目附带其他条款,若有则以附带条款为准。通过使用本数据,即表示您接受本协议条款。若您不接受本协议条款,请不要使用本数据。 若您遵守本许可协议条款,则您获得以下权利: 1. **许可范围** a. 您可以使用、复制、修改、创建衍生作品并分发本数据集: i. **仅用于研究与技术开发目的**。研究与技术开发用途包括教学、学术研究、公开演示与实验等; ii. **可基于使用本数据集得到的研究结果发表(或展示)论文或文章**。 b. 本数据为授权使用而非出售。本协议仅授予您部分使用本数据的权利,微软保留所有其他权利。除非适用法律另有规定,否则您仅可按照本协议明确允许的方式使用本数据。同时,您必须遵守本数据中的任何技术限制,这些限制仅允许您在特定场景下使用本数据。 您不得: - 规避本数据中的任何技术限制; - 对本数据进行反向工程、反编译或反汇编,但仅在适用法律明确允许的范围内除外; - 出租、租赁或出借本数据; - 将本数据或本协议转让给任何第三方; - 未经微软许可,将本数据直接用于商业产品。 2. **分发要求** a. 若您分发本数据集或其任何衍生作品,则您必须按照本协议的相同条款和条件进行分发,且不得向本数据集或其衍生作品授予与本协议不同的其他权利。 b. 若您创建了本数据集的衍生作品并进行分发,则您必须在修改后的文件中添加显著通知,使接收者知晓其并非收到原始数据集。此类通知必须说明:(i) 您已修改本数据集;(ii) 修改的日期。 3. **分发限制** 您不得:(a) 修改本数据集中的任何版权、商标或专利声明;(b) 以任何暗示您的衍生作品或修改版本由微软出品或获得微软背书的方式使用微软商标;(c) 将本数据集包含在恶意、欺骗性或非法程序中。 4. **所有权** 微软保留根据本协议向您提供的任何数据集的所有权利、所有权和权益。您不会因本协议条款而获得本数据集的任何权益。 5. **微软许可授权** 微软获得无任何限制的非排他性、永久性、不可撤销、免版税、可转让且可再许可的许可,可出于任何目的复制、公开表演或展示、使用、修改、发布、分发、制作并授权制作、销售和转让您对本数据集的修改内容和/或衍生作品。 6. **反馈** 若您向微软提供关于本数据集的反馈,则您无偿授予微软以任何方式、出于任何目的使用、分享和商业化该反馈的权利。您同时也无偿授予第三方所需的专利权,以便其产品、技术和服务能够使用或对接包含该反馈的微软数据集或服务的特定部分。您不得提供会要求微软因其将您的反馈包含在数据集或文档中而向第三方许可其数据集或文档的反馈。本条款的权利在本协议终止后仍然有效。 7. **出口限制** 本数据集受美国出口法律法规约束。您必须遵守适用于本数据集的所有国内和国际出口法律法规。这些法律包括对目的地、最终用户和最终用途的限制。如需更多信息,请访问 www.microsoft.com/exporting。 8. **完整协议** 本协议以及您使用的补充内容、更新版本、基于互联网的服务和支持服务的条款,构成关于本数据集的完整协议。 9. **支持服务** 由于本数据“按原样”提供,我们可能不会为其提供支持服务。 10. **适用法律** a. **美国境内**:若您在美国境内获取本软件,则华盛顿州法律管辖本协议的解释,并适用于因违反本协议而提出的索赔,无论法律冲突原则如何。您所在州的法律管辖所有其他索赔,包括根据州消费者保护法、反不正当竞争法和侵权行为提出的索赔。 b. **美国境外**:若您在其他国家获取本软件,则适用该国家的法律。 11. **法律效力** 本协议描述了特定的法律权利。您可能根据所在国家的法律享有其他权利。您也可能对获取本数据集的一方享有相关权利。若所在国家的法律不允许本协议改变您的权利,则本协议不会改变您的此类权利。 12. **免责声明** 本数据集按“现状”授权提供。您承担使用本数据的全部风险。微软未作出任何明示的担保、保证或条件。您可能根据当地法律享有额外的消费者权利或法定担保,本协议无法改变这些权利。在适用当地法律允许的最大范围内,微软排除对适销性、特定用途的适用性和非侵权的默示担保。 13. **损害赔偿的限制与排除** 您仅可向微软及其供应商追偿最高不超过5.00美元的直接损害赔偿。您不得追偿任何其他损害赔偿,包括间接损害、利润损失、特殊损害、间接损害或附带损害。 本限制适用于: - 任何与软件、服务、第三方互联网网站上的内容(包括代码)或第三方程序相关的索赔; - 因违反合同、违反担保、保证或条件、严格责任、过失或其他侵权行为提出的索赔,只要适用法律允许。 即使微软已知或应当已知存在此类损害的可能性,本限制仍然适用。上述限制或排除可能不适用于您,因为您所在国家可能不允许排除或限制附带、间接或其他损害赔偿。 ### 引用信息 @inproceedings{yang-etal-2015-wikiqa, title = "{W}iki{QA}: A Challenge Dataset for Open-Domain Question Answering", author = "Yang, Yi and Yih, Wen-tau and Meek, Christopher", booktitle = "Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing", month = sep, year = "2015", address = "Lisbon, Portugal", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D15-1237", doi = "10.18653/v1/D15-1237", pages = "2013--2018", } ### 贡献致谢 感谢 [@patrickvonplaten](https://github.com/patrickvonplaten)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)、[@thomwolf](https://github.com/thomwolf) 添加此数据集。
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作