google-research-datasets/nq_open|开放域问答数据集|自然语言处理数据集

hugging_face2024-03-22 更新2024-06-15 收录

开放域问答

自然语言处理

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/nq_open

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - expert-generated language_creators: - other language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|natural_questions task_categories: - question-answering task_ids: - open-domain-qa pretty_name: NQ-Open dataset_info: config_name: nq_open features: - name: question dtype: string - name: answer sequence: string splits: - name: train num_bytes: 6651236 num_examples: 87925 - name: validation num_bytes: 313829 num_examples: 3610 download_size: 4678245 dataset_size: 6965065 configs: - config_name: nq_open data_files: - split: train path: nq_open/train-* - split: validation path: nq_open/validation-* default: true --- # Dataset Card for nq_open ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://efficientqa.github.io/ - **Repository:** https://github.com/google-research-datasets/natural-questions/tree/master/nq_open - **Paper:** https://www.aclweb.org/anthology/P19-1612.pdf - **Leaderboard:** https://ai.google.com/research/NaturalQuestions/efficientqa - **Point of Contact:** [Mailing List](efficientqa@googlegroups.com) ### Dataset Summary The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia. ### Supported Tasks and Leaderboards Open Domain Question-Answering, EfficientQA Leaderboard: https://ai.google.com/research/NaturalQuestions/efficientqa ### Languages English (`en`) ## Dataset Structure ### Data Instances ``` { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] } ``` ### Data Fields - `question` - Input open domain question. - `answer` - List of possible answers to the question ### Data Splits - Train : 87925 - validation : 3610 ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization Natural Questions contains question from aggregated queries to Google Search (Kwiatkowski et al., 2019). To gather an open version of this dataset, we only keep questions with short answers and discard the given evidence document. Answers with many tokens often resemble extractive snippets rather than canonical answers, so we discard answers with more than 5 tokens. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases Evaluating on this diverse set of question-answer pairs is crucial, because all existing datasets have inherent biases that are problematic for open domain QA systems with learned retrieval. In the Natural Questions dataset the question askers do not already know the answer. This accurately reflects a distribution of genuine information-seeking questions. However, annotators must separately find correct answers, which requires assistance from automatic tools and can introduce a moderate bias towards results from the tool. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information All of the Natural Questions data is released under the [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license. ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00276, author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav}, title = {Natural Questions: A Benchmark for Question Answering Research}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {453-466}, year = {2019}, doi = {10.1162/tacl\_a\_00276}, URL = { https://doi.org/10.1162/tacl_a_00276 }, eprint = { https://doi.org/10.1162/tacl_a_00276 }, abstract = { We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. } } @inproceedings{lee-etal-2019-latent, title = "Latent Retrieval for Weakly Supervised Open Domain Question Answering", author = "Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1612", doi = "10.18653/v1/P19-1612", pages = "6086--6096", abstract = "Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.", } ``` ### Contributions Thanks to [@Nilanshrajput](https://github.com/Nilanshrajput) for adding this dataset.

提供机构：

google-research-datasets

原始信息汇总

数据集概述

基本信息

数据集名称: NQ-Open
语言: 英语 (en)
许可证: CC BY-SA 3.0
多语言性: 单语种
数据集大小: 10K<n<100K
源数据集: 扩展自 Natural Questions

任务类型

任务类别: 问答系统
任务ID: 开放领域问答

数据集结构

配置名称: nq_open
特征:
- question: 问题，数据类型为字符串
- answer: 答案，数据类型为字符串序列
数据分割:
- 训练集: 87925个样本，6651236字节
- 验证集: 3610个样本，313829字节
下载大小: 4678245字节
数据集大小: 6965065字节

数据文件

配置名称: nq_open
数据文件:
- 训练集: nq_open/train-*
- 验证集: nq_open/validation-*

数据实例

json { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] }

数据字段

question: 输入的开放领域问题
answer: 问题的可能答案列表

数据分割

训练集: 87925个样本
验证集: 3610个样本

AI搜集汇总

数据集介绍

构建方式

NQ-Open数据集源自Natural Questions数据集，经过精心筛选与处理，专注于开放域问答任务。构建过程中，仅保留了具有简短答案的问题，并剔除了原始数据中的证据文档。为确保答案的简洁性与规范性，答案长度超过五个词的条目被舍弃。这一过程旨在提供一个高质量的开放域问答基准，适用于评估问答系统的性能。

特点

NQ-Open数据集的核心特点在于其开放域问答的定位，所有问题均可通过英文维基百科的内容进行解答。数据集包含87925条训练样本和3610条验证样本，结构简洁，包含问题与答案两个主要字段。此外，数据集采用CC BY-SA 3.0许可，确保了其广泛的应用与共享。

使用方法

NQ-Open数据集适用于开放域问答任务的研究与开发，尤其适合用于训练和评估问答模型。用户可通过加载数据集中的问题与答案字段，构建和优化问答系统。数据集的简洁结构和大规模样本使其成为研究和实践的理想选择，尤其在需要处理开放域问题的场景中。

背景与挑战

背景概述

NQ-Open数据集由Lee等人于2019年引入，作为开放域问答（Open Domain Question Answering, QA）的基准测试，源自Natural Questions数据集。该数据集的核心研究问题是通过输入的英文问题预测相应的英文答案，所有问题均可通过英文维基百科的内容进行回答。NQ-Open的创建旨在推动开放域问答系统的发展，特别是在处理真实信息需求问题时，提供了一个高质量的基准。该数据集的发布对问答系统研究领域产生了深远影响，尤其是在评估系统在多样性问题上的表现方面。

当前挑战

NQ-Open数据集在构建过程中面临多项挑战。首先，数据集的来源是Google搜索的聚合查询，这要求对原始数据进行筛选和规范化，以确保问题和答案的质量。其次，由于开放域问答系统的复杂性，如何有效处理和消除数据中的偏见成为一个重要问题。此外，尽管数据集提供了高质量的训练和验证样本，但在实际应用中，系统仍需应对用户提问的多样性和复杂性，这对模型的泛化能力提出了更高要求。

常用场景

经典使用场景

NQ-Open数据集在开放域问答（Open Domain Question-Answering, ODQA）领域中具有经典应用场景。该数据集通过提供大量的英文问题及其对应的简短答案，为研究者构建和评估开放域问答系统提供了丰富的资源。研究者可以利用该数据集训练模型，使其能够从英文维基百科中自动检索并生成准确的答案，从而解决用户在信息检索过程中遇到的实际问题。

解决学术问题

NQ-Open数据集解决了开放域问答系统中常见的学术研究问题，特别是在处理用户真实信息需求时的答案生成与检索问题。通过提供高质量的问答对，该数据集帮助研究者克服了传统信息检索系统（如BM25）在处理复杂问题时的局限性，推动了基于学习的检索方法的发展。其意义在于为开放域问答系统提供了更为真实和多样化的评估基准，促进了该领域的技术进步。

衍生相关工作

NQ-Open数据集的发布催生了一系列相关研究工作，特别是在开放域问答系统的检索与阅读理解方面。例如，Lee等人提出的“Latent Retrieval for Weakly Supervised Open Domain Question Answering”方法，通过联合学习检索器和阅读器，显著提升了开放域问答系统的性能。此外，该数据集还为其他研究者提供了基准，推动了更多基于学习的检索方法和多任务学习模型的开发与应用。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国交通事故深度调查（CIDAS）数据集

交通事故深度调查数据通过采用科学系统方法现场调查中国道路上实际发生交通事故相关的道路环境、道路交通行为、车辆损坏、人员损伤信息，以探究碰撞事故中车损和人伤机理。目前已积累深度调查事故10000余例，单个案例信息包含人、车、路和环境多维信息组成的3000多个字段。该数据集可作为深入分析中国道路交通事故工况特征，探索事故预防和损伤防护措施的关键数据源，为制定汽车安全法规和标准、完善汽车测评试验规程、

北方大数据交易中心收录

中国1km分辨率逐月降水量数据集（1901-2023）

该数据集为中国逐月降水量数据，空间分辨率为0.0083333°（约1km），时间为1901.1-2023.12。数据格式为NETCDF，即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集，通过Delta空间降尺度方案在中国降尺度生成的。并且，使用496个独立气象观测点数据进行验证，验证结果可信。本数据集包含的地理空间范围是全国主要陆地（包含港澳台地区），不含南海岛礁等区域。为了便于存储，数据均为int16型存于nc文件中，降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理，Matlab发布了读入与存储nc文件的函数，读取函数为ncread，切换到nc文件存储文件夹，语句表达为：ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent])，其中XXX.nc为文件名，为字符串需要’’；var是从XXX.nc中读取的变量名，为字符串需要’’；i、j、t分别为读取数据的起始行、列、时间，leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样，研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令，可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心收录

CMAB

CMAB数据集由清华大学创建，是中国首个全国范围的多属性建筑数据集，涵盖了3667个自然城市，总面积达213亿平方米。该数据集通过集成多源数据，如高分辨率Google Earth影像和街景图像，生成了建筑的屋顶、高度、功能、年龄和质量等属性。数据集的创建过程结合了地理人工智能框架和机器学习模型，确保了数据的高准确性。CMAB数据集主要应用于城市规划和可持续发展研究，旨在提供详细的城市3D物理和社会结构信息，支持城市化进程和政府决策。

arXiv 收录

Cultural Dimensions Dataset

该数据集包含了霍夫斯泰德文化维度理论（Hofstede's Cultural Dimensions Theory）的相关数据，涵盖了多个国家和地区的文化维度评分，如权力距离、个人主义与集体主义、男性化与女性化、不确定性规避、长期取向与短期取向等。这些数据有助于研究不同文化背景下的行为模式和价值观。

geerthofstede.com 收录

Fruits-360

一个高质量的水果图像数据集，包含多种水果的图像，如苹果、香蕉、樱桃等，总计42345张图片，分为训练集和验证集，共有64个水果类别。

github 收录