din0s/asqa

Name: din0s/asqa
Creator: din0s
Published: 2022-09-20 16:14:54
License: 暂无描述

Hugging Face2022-09-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/din0s/asqa

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language: - en language_creators: - expert-generated license: - apache-2.0 multilinguality: - monolingual pretty_name: ASQA size_categories: - 1K<n<10K source_datasets: - extended|ambig_qa tags: - factoid questions - long-form answers task_categories: - question-answering task_ids: - open-domain-qa --- # Dataset Card for ASQA ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Additional Information](#additional-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** https://github.com/google-research/language/tree/master/language/asqa - **Paper:** https://arxiv.org/abs/2204.06092 - **Leaderboard:** https://ambigqa.github.io/asqa_leaderboard.html ### Dataset Summary ASQA is the first long-form question answering dataset that focuses on ambiguous factoid questions. Different from previous long-form answers datasets, each question is annotated with both long-form answers and extractive question-answer pairs, which should be answerable by the generated passage. A generated long-form answer will be evaluated using both ROUGE and QA accuracy. In the paper, we show that these evaluation metrics are well-correlated with human judgments. ### Supported Tasks and Leaderboards Long-form Question Answering. [Leaderboard](https://ambigqa.github.io/asqa_leaderboard.html) ### Languages - English ## Dataset Structure ### Data Instances ```py { "ambiguous_question": "Where does the civil liberties act place the blame for the internment of u.s. citizens?", "qa_pairs": [ { "context": "No context provided", "question": "Where does the civil liberties act place the blame for the internment of u.s. citizens by apologizing on behalf of them?", "short_answers": [ "the people of the United States" ], "wikipage": None }, { "context": "No context provided", "question": "Where does the civil liberties act place the blame for the internment of u.s. citizens by making them pay reparations?", "short_answers": [ "United States government" ], "wikipage": None } ], "wikipages": [ { "title": "Civil Liberties Act of 1988", "url": "https://en.wikipedia.org/wiki/Civil%20Liberties%20Act%20of%201988" } ], "annotations": [ { "knowledge": [ { "content": "The Civil Liberties Act of 1988 (Pub.L. 100–383, title I, August 10, 1988, 102 Stat. 904, 50a U.S.C. § 1989b et seq.) is a United States federal law that granted reparations to Japanese Americans who had been interned by the United States government during World War II.", "wikipage": "Civil Liberties Act of 1988" } ], "long_answer": "The Civil Liberties Act of 1988 is a United States federal law that granted reparations to Japanese Americans who had been interned by the United States government during World War II. In the act, the blame for the internment of U.S. citizens was placed on the people of the United States, by apologizing on behalf of them. Furthermore, the blame for the internment was placed on the United States government, by making them pay reparations." } ], "sample_id": -4557617869928758000 } ``` ### Data Fields - `ambiguous_question`: ambiguous question from AmbigQA. - `annotations`: long-form answers to the ambiguous question constructed by ASQA annotators. - `annotations/knowledge`: list of additional knowledge pieces. - `annotations/knowledge/content`: a passage from Wikipedia. - `annotations/knowledge/wikipage`: title of the Wikipedia page the passage was taken from. - `annotations/long_answer`: annotation. - `qa_pairs`: Q&A pairs from AmbigQA which are used for disambiguation. - `qa_pairs/context`: additional context provided. - `qa_pairs/question`: disambiguated question from AmbigQA. - `qa_pairs/short_answers`: list of short answers from AmbigQA. - `qa_pairs/wikipage`: title of the Wikipedia page the additional context was taken from. - `sample_id`: the unique id of the sample - `wikipages`: list of Wikipedia pages visited by AmbigQA annotators. - `wikipages/title`: title of the Wikipedia page. - `wikipages/url`: link to the Wikipedia page. ### Data Splits | **Split** | **Instances** | |-----------|---------------| | Train | 4353 | | Dev | 948 | ## Additional Information ### Contributions Thanks to [@din0s](https://github.com/din0s) for adding this dataset.

annotations_creators: - 众包（crowdsourced） language: - 英语（en） language_creators: - 专家生成（expert-generated） license: - Apache-2.0 multilinguality: - 单语言（monolingual） pretty_name: ASQA size_categories: - 1K<n<10K source_datasets: - 扩展|AmbigQA（ambig_qa） tags: - 事实类问题（factoid questions） - 长格式回答（long-form answers） task_categories: - 问答（question-answering） task_ids: - 开放域问答（open-domain-qa） # ASQA 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [附加信息](#additional-information) - [贡献者](#contributions) ## 数据集描述 - **代码仓库:** https://github.com/google-research/language/tree/master/language/asqa - **论文:** https://arxiv.org/abs/2204.06092 - **评测榜单:** https://ambigqa.github.io/asqa_leaderboard.html ### 数据集概述 ASQA是首个聚焦歧义性事实类问题的长格式问答数据集。与此前的长格式回答数据集不同，每个问题均配有长格式回答与抽取式问答对，且生成的段落需能够应答这些问答对。生成的长格式回答将同时通过ROUGE与问答准确率（QA accuracy）进行评估。本研究论文证明，上述评估指标与人工评判结果具有良好的相关性。 ### 支持任务与评测榜单长格式问答。[评测榜单](https://ambigqa.github.io/asqa_leaderboard.html) ### 语言 - 英语 ## 数据集结构 ### 数据实例 py { "ambiguous_question": "《公民自由法案》将美国公民被拘禁的责任归咎于何处？", "qa_pairs": [ { "context": "未提供上下文", "question": "《公民自由法案》通过代表美国民众致歉的方式，将美国公民被拘禁的责任归咎于何处？", "short_answers": [ "美国民众" ], "wikipage": null }, { "context": "未提供上下文", "question": "《公民自由法案》通过要求其支付赔偿的方式，将美国公民被拘禁的责任归咎于何处？", "short_answers": [ "美国政府" ], "wikipage": null } ], "wikipages": [ { "title": "1988年《公民自由法案》", "url": "https://en.wikipedia.org/wiki/Civil%20Liberties%20Act%20of%201988" } ], "annotations": [ { "knowledge": [ { "content": "1988年《公民自由法案》（Pub.L. 100–383，第一编，1988年8月10日，102号法令第904条，50a U.S.C. § 1989b 及后续条款）是美国联邦法律，旨在向二战期间被美国政府拘禁的日裔美国人提供赔偿。", "wikipage": "1988年《公民自由法案》" } ], "long_answer": "1988年《公民自由法案》是美国联邦法律，旨在向二战期间被美国政府拘禁的日裔美国人提供赔偿。该法案通过代表美国民众致歉的方式，将美国公民被拘禁的责任归咎于美国民众；同时通过要求美国政府支付赔偿的方式，将该责任归咎于美国政府。" } ], "sample_id": -4557617869928758000 } ### 数据字段 - `ambiguous_question`: 来自AmbigQA的歧义问题。 - `annotations`: ASQA注释者构建的针对该歧义问题的长格式回答。 - `annotations/knowledge`: 附加知识片段列表。 - `annotations/knowledge/content`: 取自维基百科的段落文本。 - `annotations/knowledge/wikipage`: 该段落来源的维基百科页面标题。 - `annotations/long_answer`: 注释内容。 - `qa_pairs`: 来自AmbigQA的用于歧义消歧的问答对。 - `qa_pairs/context`: 提供的附加上下文。 - `qa_pairs/question`: 来自AmbigQA的消歧后问题。 - `qa_pairs/short_answers`: 来自AmbigQA的简短答案列表。 - `qa_pairs/wikipage`: 附加上下文来源的维基百科页面标题。 - `sample_id`: 样本的唯一标识符。 - `wikipages`: AmbigQA注释者访问过的维基百科页面列表。 - `wikipages/title`: 维基百科页面标题。 - `wikipages/url`: 维基百科页面链接。 ### 数据划分 | **划分** | **样本数** | |-----------|---------------| | 训练集 | 4353 | | 验证集 | 948 | ## 附加信息 ### 贡献者感谢[@din0s](https://github.com/din0s) 为本数据集的收录提供支持。

提供机构：

din0s

原始信息汇总

数据集概述

数据集名称

名称：ASQA
别名：无

数据集基本信息

语言：英语
许可证：Apache-2.0
多语言性：单语
大小：1K<n<10K
来源数据集：扩展自Ambig_QA
标签：事实性问题、长篇答案
任务类别：问答
任务ID：开放领域问答

数据集描述

概述：ASQA是首个专注于模糊事实性问题的长篇问答数据集。与以往的长篇答案数据集不同，每个问题都标注了长篇答案和可从生成文本中提取的问题-答案对。长篇答案的评估使用ROUGE和QA准确性两种指标，这些指标与人类判断高度相关。

支持的任务和排行榜

任务：长篇问答
排行榜：ASQA Leaderboard

数据集结构

数据实例：包含模糊问题、Q&A对、长篇答案等。
数据字段：包括ambiguous_question、annotations、qa_pairs等。
数据分割：训练集4353个实例，开发集948个实例。

附加信息

贡献者：@din0s

搜集汇总

数据集介绍

构建方式

ASQA数据集的构建采取了对AmbigQA数据集中每个问题进行长篇回答和提取式问答对标注的方法。该数据集特别关注模糊的事实性问题，旨在通过长篇回答和精确的答案对来评估回答的质量。数据集的构建过程中，每个问题都被专家生成的标注者赋予了长篇答案，并且提供了额外的知识片段以支持答案。

特点

ASQA数据集的特点在于它专注于模糊的事实性问题，并提供了一种新的评估方式，结合了ROUGE和QA准确性两种指标。该数据集包含的问题和答案对经过精心设计，旨在评估模型在生成连贯长篇回答方面的能力，同时确保答案的准确性。此外，数据集的构建考虑了多语言性，尽管目前只提供英语版本。

使用方法

使用ASQA数据集时，用户可以访问包含模糊问题、长篇答案、提取式问答对以及相关维基百科页面的数据实例。数据集分为训练集和开发集，便于模型训练和性能评估。用户可通过Hugging Face的库直接加载数据集，利用其提供的数据字段进行模型的训练和测试，进而提升模型在长篇问答任务上的表现。

背景与挑战

背景概述

ASQA数据集，全称为Ambiguous long-form Question Answering，是由Google Research团队在2022年推出的一项创新性研究。该数据集专注于模糊事实性问题，旨在通过提供长篇回答和可被生成段落回答的提取式问答对，以更全面地解决开放域问答任务。ASQA的创建填补了长篇回答数据集在模糊性问题处理上的空白，其独特的标注方式使得该数据集在评估指标上与人类判断高度相关，对自然语言处理领域，尤其是长篇问答研究产生了重要影响。

当前挑战

ASQA数据集在构建过程中面临的挑战主要包括：如何准确捕捉和标注模糊事实性问题，以及如何确保长篇回答的准确性和相关性。此外，数据集的构建还需解决如何高效地从大量文本中提取有用信息，并构建能够同时满足ROUGE和QA准确度评估的标注系统。在研究领域，ASQA所解决的模糊事实性问题处理，对提升开放域问答系统的准确性和鲁棒性提出了新的挑战。

常用场景

经典使用场景

在深入探讨自然语言处理领域中的长篇问答任务时，ASQA数据集以其独特的专注于模糊事实性问题而备受瞩目。该数据集通过提供既包含长篇回答又包含可由生成段落回答的提取式问答对，成为评估长篇回答质量的重要资源。研究者和开发者通常利用ASQA数据集进行模型训练与测试，以提升模型在理解和生成复杂长篇回答方面的能力。

衍生相关工作

基于ASQA数据集的研究成果，衍生出了一系列相关工作，包括但不限于模糊性问题理解、长篇文本生成、复杂问答系统设计等，进一步推动了自然语言处理领域的学术研究和产业发展。

数据集最近研究