allenai/gooaq

Name: allenai/gooaq
Creator: allenai
Published: 2024-01-18 11:04:22
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/allenai/gooaq

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - machine-generated language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - original task_categories: - question-answering task_ids: - open-domain-qa paperswithcode_id: gooaq pretty_name: 'GooAQ: Open Question Answering with Diverse Answer Types' dataset_info: features: - name: id dtype: int32 - name: question dtype: string - name: short_answer dtype: string - name: answer dtype: string - name: answer_type dtype: class_label: names: '0': feat_snip '1': collection '2': knowledge '3': unit_conv '4': time_conv '5': curr_conv splits: - name: train num_bytes: 974320061 num_examples: 3112679 - name: validation num_bytes: 444553 num_examples: 2500 - name: test num_bytes: 445810 num_examples: 2500 download_size: 2111358901 dataset_size: 975210424 --- # Dataset Card for GooAQ ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [GooAQ 🥑: Google Answers to Google Questions!](https://github.com/allenai/gooaq) - **Repository:** [GooAQ 🥑: Google Answers to Google Questions!](https://github.com/allenai/gooaq) - **Paper:** [GOOAQ: Open Question Answering with Diverse Answer Types](https://arxiv.org/abs/2104.08727) - **Point of Contact:** [Daniel Khashabi](danielk@allenai.org) ### Dataset Summary GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset contains samples in English only. ## Dataset Structure ### Data Instances Each row of the data file should look like this: ``` { "id": 3339543, "question": "what is the difference between collagen and whey protein?", "short_answer": None, "answer": "The main differences between the amino acid profiles of whey and collagen are that whey contains all 9 essential amino acids, while collagen only has 8. ... Collagen is a fibrous protein found in the skin, cartilage, and bones of animals whereas whey comes from milk.", "answer_type": "feat_snip" } ``` where the questions `question` are collected via Google auto-complete. The answers responses (`short_answer` and `answer`) were collected from Google's answer boxes. The answer types (`answer_type`) are inferred based on the html content of Google's response. Here is the dominant types in the current dataset: - `feat_snip`: explanatory responses; the majoriy the question/responses are of this type. - `collection`: list responses (e.g., steps to accomplish something). - `knowledge`: typically short responses for knowledge seeking questions. - `unit_conv`: questions about converting units. - `time_conv`: questions about converting times. - `curr_conv`: questions about converting currencies. Dataset instances which are not part of dominant types are marked with -1 label. ### Data Fields - `id`: an `int` feature. - `question`: a `string` feature. - `short_answer`: a `string` feature (could be None as well in some cases). - `answer`: a `string` feature (could be None as well in some cases). - `answer_type`: a `string` feature. ### Data Splits Number of samples in train/validation/test set are given below: | Split | Number of samples | |------------|-------------------| | Train | 3112679 | | Validation | 2500 | | Test | 2500 | ## Dataset Creation ### Curation Rationale While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. Many of the everyday questions that humans deal with and pose to search engines have a more diverse set of responses. Their answer can be a multi-sentence description (a snippet) (e.g., ‘what is’ or ‘can you’ questions), a collection of items such as ingredients (‘what are’, ‘things to’) or of steps towards a goal such as unlocking a phone (‘how to’), etc. Even when the answer is short, it can have richer types, e.g., unit conversion, time zone conversion, or various kinds of knowledge look-up (‘how much’, ‘when is’, etc.). Such answer type diversity is not represented in any existing dataset. ### Source Data #### Initial Data Collection and Normalization Construction this dataset involved two main steps, extracting questions from search auto-complete and extracting answers from answer boxes. 1) Query Extraction: To extract a rich yet natural set of questions they used Google auto-completion. They start with a seed set of question terms (e.g., “who”, “where”, etc.). They bootstrap based on this set, by repeatedly querying prefixes of previously extracted questions, in order to discover longer and richer sets of questions. Such questions extracted from the autocomplete algorithm are highly reflective of popular questions posed by users of Google. They filter out any questions shorter than 5 tokens as they are often in-complete questions. This process yields over ∼5M questions, which were collected over a span of 6 months. The average length of the questions is about 8 tokens. 2) Answer Extraction: They rely on the Google answer boxes shown on top of the search results when the questions are issued to Google. There are a variety of answer boxes. The most common kind involves highlighted sentences (extracted from various websites) that contain the answer to a given question. These form the snippet and collection answers in GOOAQ. In some cases, the answer box shows the answer directly, possibly in addition to the textual snippet. These form theshort answers in GOOAQ. They first scrape the search results for all questions. This is the main extraction bottleneck, which was done over a span of 2 months. Subsequently, they extract answer strings from the HTML content of the search results. Answer types are also inferred at this stage, based on the HTML tags around the answer. #### Who are the source language producers? Answered above. ### Annotations #### Annotation process Answered in above section. #### Who are the annotators? Since their task is focused on English, they required workers to be based in a country with a population predominantly of native English speakers (e.g., USA, Canada, UK, and Australia) and have completed at least 5000 HITs with ≥ 99% assignment approval rate. Additionally, they have a qualification test with half-a-dozen questions all of which need to be answered correctly by the annotators. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases To prevent biased judgements, they also ask the annotators to avoid using Google search (which is what they used when mined GOOAQ) when annotating the quality of shown instances. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. ### Citation Information ``` @article{gooaq2021, title={GooAQ: Open Question Answering with Diverse Answer Types}, author={Khashabi, Daniel and Ng, Amos and Khot, Tushar and Sabharwal, Ashish and Hajishirzi, Hannaneh and Callison-Burch, Chris}, journal={arXiv preprint}, year={2021} } ``` ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.

提供机构：

allenai

原始信息汇总

数据集概述

数据集摘要

GooAQ是一个大规模的数据集，包含多种答案类型。该数据集包含超过500万个问题和300万个答案，这些数据是从Google搜索引擎中收集的。GooAQ的问题是通过Google搜索引擎的自动完成功能半自动收集的，这些问题具有实用兴趣，且语言简单。GooAQ的答案是从Google对收集问题的响应中挖掘出来的，特别是从搜索结果中的答案框中提取的。这产生了一个丰富的答案类型空间，包括文本答案（短答案和长答案）以及更结构化的答案，如集合。

支持的任务和排行榜

[更多信息需补充]

语言

该数据集仅包含英语样本。

数据集结构

数据实例

每个数据行应如下所示： json { "id": 3339543, "question": "what is the difference between collagen and whey protein?", "short_answer": None, "answer": "The main differences between the amino acid profiles of whey and collagen are that whey contains all 9 essential amino acids, while collagen only has 8. ... Collagen is a fibrous protein found in the skin, cartilage, and bones of animals whereas whey comes from milk.", "answer_type": "feat_snip" }

问题是通过Google自动完成收集的。答案响应（short_answer和answer）是从Google的答案框中收集的。答案类型（answer_type）是根据Google响应的HTML内容推断的。当前数据集中的主要类型包括：

feat_snip：解释性响应；大多数问题/响应属于这种类型。
collection：列表响应（例如，完成某事的步骤）。
knowledge：通常是知识寻求问题的短响应。
unit_conv：关于单位转换的问题。
time_conv：关于时间转换的问题。
curr_conv：关于货币转换的问题。

不属于主要类型的数据实例标记为-1标签。

数据字段

id：一个int特征。
question：一个string特征。
short_answer：一个string特征（在某些情况下可能是None）。
answer：一个string特征（在某些情况下可能是None）。
answer_type：一个string特征。

数据分割

训练/验证/测试集中的样本数量如下：

分割	样本数量
训练	3112679
验证	2500
测试	2500

数据集创建

策划理由

日常问题伴随着多种答案类型，但当前的问答（QA）文献未能充分解决问题的答案多样性。许多日常问题具有更多样化的响应类型，例如多句子描述（片段）、物品集合或步骤集合等。这种答案类型多样性在现有数据集中未得到充分体现。

源数据

初始数据收集和规范化

构建此数据集涉及两个主要步骤：从搜索自动完成中提取问题和从答案框中提取答案。

查询提取：使用Google自动完成功能提取丰富而自然的问题集。从种子问题集（例如“谁”、“哪里”等）开始，通过重复查询先前提取问题的词缀来发现更长和更丰富的问题集。过滤掉少于5个词的问题，因为它们通常是不完整的问题。此过程产生了超过500万个问题，平均长度约为8个词。
答案提取：依赖于Google在搜索结果顶部显示的答案框。最常见的类型涉及包含答案的高亮句子（从各个网站提取），这些构成了GooAQ中的片段和集合答案。在某些情况下，答案框直接显示答案，可能还附带文本片段，这些构成了GooAQ中的短答案。

首先抓取所有问题的搜索结果，这是主要的提取瓶颈，耗时2个月。随后，从搜索结果的HTML内容中提取答案字符串，并在此阶段推断答案类型，基于答案周围的HTML标签。

注释

注释过程

注释过程在上述源数据部分已描述。

注释者

由于任务聚焦于英语，要求注释者来自以英语为母语的国家（例如美国、加拿大、英国和澳大利亚），并且已完成至少5000个HIT，批准率≥99%。此外，注释者需要通过一个包含六个问题的资格测试，所有问题都必须正确回答。

个人和敏感信息

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

为防止偏见判断，要求注释者在注释显示实例的质量时避免使用Google搜索（这是挖掘GooAQ时使用的工具）。

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

[策展人信息需补充]

许可信息

根据Apache License 2.0（“许可证”）许可；除非遵守许可证，否则不得使用此文件。

引用信息

@article{gooaq2021, title={GooAQ: Open Question Answering with Diverse Answer Types}, author={Khashabi, Daniel and Ng, Amos and Khot, Tushar and Sabharwal, Ashish and Hajishirzi, Hannaneh and Callison-Burch, Chris}, journal={arXiv preprint}, year={2021} }

贡献

感谢@bhavitvyamalik添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集