five

nq_open

收藏
魔搭社区2026-05-11 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/nq_open
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for nq_open ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://efficientqa.github.io/ - **Repository:** https://github.com/google-research-datasets/natural-questions/tree/master/nq_open - **Paper:** https://www.aclweb.org/anthology/P19-1612.pdf - **Leaderboard:** https://ai.google.com/research/NaturalQuestions/efficientqa - **Point of Contact:** [Mailing List](efficientqa@googlegroups.com) ### Dataset Summary The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia. ### Supported Tasks and Leaderboards Open Domain Question-Answering, EfficientQA Leaderboard: https://ai.google.com/research/NaturalQuestions/efficientqa ### Languages English (`en`) ## Dataset Structure ### Data Instances ``` { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] } ``` ### Data Fields - `question` - Input open domain question. - `answer` - List of possible answers to the question ### Data Splits - Train : 87925 - validation : 3610 ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization Natural Questions contains question from aggregated queries to Google Search (Kwiatkowski et al., 2019). To gather an open version of this dataset, we only keep questions with short answers and discard the given evidence document. Answers with many tokens often resemble extractive snippets rather than canonical answers, so we discard answers with more than 5 tokens. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases Evaluating on this diverse set of question-answer pairs is crucial, because all existing datasets have inherent biases that are problematic for open domain QA systems with learned retrieval. In the Natural Questions dataset the question askers do not already know the answer. This accurately reflects a distribution of genuine information-seeking questions. However, annotators must separately find correct answers, which requires assistance from automatic tools and can introduce a moderate bias towards results from the tool. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information All of the Natural Questions data is released under the [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license. ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00276, author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav}, title = {Natural Questions: A Benchmark for Question Answering Research}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {453-466}, year = {2019}, doi = {10.1162/tacl\_a\_00276}, URL = { https://doi.org/10.1162/tacl_a_00276 }, eprint = { https://doi.org/10.1162/tacl_a_00276 }, abstract = { We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. } } @inproceedings{lee-etal-2019-latent, title = "Latent Retrieval for Weakly Supervised Open Domain Question Answering", author = "Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1612", doi = "10.18653/v1/P19-1612", pages = "6086--6096", abstract = "Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.", } ``` ### Contributions Thanks to [@Nilanshrajput](https://github.com/Nilanshrajput) for adding this dataset.

# nq_open 数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集摘要](#数据集摘要) - [支持任务与评测榜单](#支持任务与评测榜单) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [构建依据](#构建依据) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献](#贡献) ## 数据集描述 - **主页**:https://efficientqa.github.io/ - **代码仓库**:https://github.com/google-research-datasets/natural-questions/tree/master/nq_open - **相关论文**:https://www.aclweb.org/anthology/P19-1612.pdf - **评测榜单**:https://ai.google.com/research/NaturalQuestions/efficientqa - **联系人**:[邮件列表](efficientqa@googlegroups.com) ### 数据集摘要 NQ-Open任务由Lee等人于2019年提出,是源自自然问题 (Natural Questions)的开放域问答 (Open Domain Question-Answering)基准数据集。其任务目标为针对输入的英文问题,预测对应的英文答案字符串,所有问题均可通过英文维基百科的内容得到解答。 ### 支持任务与评测榜单 开放域问答 (Open Domain Question-Answering),EfficientQA评测榜单:https://ai.google.com/research/NaturalQuestions/efficientqa ### 语言 英语(`en`) ## 数据集结构 ### 数据实例 { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] } ### 数据字段 - `question` - 输入的开放域问题 - `answer` - 该问题的可选答案列表 ### 数据划分 - 训练集:87925条 - 验证集:3610条 ## 数据集构建 ### 构建依据 [需补充更多信息] ### 源数据 #### 初始数据收集与归一化 自然问题 (Natural Questions)数据集包含源自谷歌搜索聚合查询的问题(Kwiatkowski等,2019)。为构建本数据集的开放域版本,我们仅保留带有短答案的问题,并舍弃给定的证据文档。由于Token数超过5的答案往往更接近抽取式片段而非规范答案,因此我们将此类答案予以剔除。 #### 源语言生成者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 在该多样化的问答对集合上开展评估至关重要,因为现有所有数据集均存在固有偏差,这对搭载学习型检索模块的开放域问答系统而言存在不利影响。在自然问题 (Natural Questions)数据集中,提问者本身并不知晓答案,这精准反映了真实的信息寻求型问题分布。然而,标注人员需自行寻找正确答案,这一过程需要借助自动工具,因此可能会引入与工具结果相关的中等程度偏差。 ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 所有自然问题数据均采用[CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)许可协议发布。 ### 引用信息 @article{doi:10.1162/tacl\_a\_00276, author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav}, title = {Natural Questions: A Benchmark for Question Answering Research}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {453-466}, year = {2019}, doi = {10.1162/tacl_a_00276}, URL = { https://doi.org/10.1162/tacl_a_00276 }, eprint = { https://doi.org/10.1162/tacl_a_00276 }, abstract = { We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. } } @inproceedings{lee-etal-2019-latent, title = "Latent Retrieval for Weakly Supervised Open Domain Question Answering", author = "Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1612", doi = "10.18653/v1/P19-1612", pages = "6086--6096", abstract = "Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.", } ### 贡献 感谢[@Nilanshrajput](https://github.com/Nilanshrajput)添加本数据集。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作