SQuAD2.0
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-177.html
下载链接
链接失效反馈官方服务:
资源简介:
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. We employed crowd workers on the Daemo crowd-sourcing platform to write unanswerable questions. Each task consisted of an entire article from SQuAD 1.1. For each paragraph in the article, workers were asked to pose up to five questions that were impossible to answer based on the paragraph alone, while referencing entities in the paragraph and ensuring that a plausible answer is present. As inspiration, we also showed questions from SQuAD 1.1 for each paragraph; this further encouraged unanswerable questions to look similar to answerable ones. We removed questions from workers who wrote 25 or fewer questions on that article; this filter helped remove noise from workers who had trouble understanding the task, and therefore quit before completing the whole article. We applied this filter to both our new data and the existing answerable questions from SQuAD 1.1. To generate train, development, and test splits, we used the same partition of articles as SQuAD 1.1, and combined the existing data with our new data for each split. For the SQuAD 2.0 development and test sets, we removed articles for which we did not collect unanswerable questions. This resulted in a roughly one-to-one ratio of answerable to unanswerable questions in these splits, whereas the train data has roughly twice as many answerable questions as unanswerable ones. To confirm that our dataset is clean, we hired additional crowd workers to answer all questions in the SQuAD 2.0 development and test sets. In each task, we showed workers an entire article from the dataset. For each paragraph, we showed all associated questions; unanswerable and answerable questions were shuffled together. For each question, workers were told to either highlight the answer in the paragraph, or mark it as unanswerable. Workers were told to expect every paragraph to have some answerable and some unanswerable questions. They were asked to spend one minuteper question, and were paid $10.50 per hour. To reduce crowd worker noise, we collected multiple human answers for each question and selected the final answer by majority vote, breaking ties in favor of answering questions and preferring shorter answers to longer ones. On average, we collected 4.8 answers per question.
斯坦福问答数据集(Stanford Question Answering Dataset, SQuAD)是一款阅读理解数据集,由众包工作者针对一组维基百科文章提出的问题构成,每个问题的答案均取自对应阅读段落中的一段文本,即文本跨度(span),部分问题也可能无法作答。SQuAD2.0将SQuAD1.1中的10万个问题,与众包工作者以对抗方式编写的超5万个无法作答的问题相结合,这些问题的表述与可作答问题高度相似。要在SQuAD2.0上取得优异性能,模型不仅需要在可行时回答问题,还需判断段落中是否不存在对应答案,并主动放弃作答。
我们通过Daemo众包平台招募众包工作者编写无法作答的问题。每个任务均包含SQuAD1.1中的一整篇维基百科文章。针对文章中的每个段落,要求工作者提出至多5个仅基于该段落无法作答的问题,同时需引用段落内的实体,并确保存在看似合理的干扰答案。为提供创作灵感,我们还为每个段落展示了SQuAD1.1中的对应问题,这进一步提升了无法作答问题与可作答问题的表述相似度。
我们会剔除那些在单篇文章中仅编写了25个及以下问题的工作者所提交的问题;该过滤机制可有效排除因难以理解任务而中途放弃的工作者所带来的噪声数据。我们将该过滤机制同时应用于新构建的数据集与SQuAD1.1中原有可作答问题。
为生成训练集、开发集与测试集划分,我们沿用了SQuAD1.1的文章分区方案,并将原有数据集与新构建的数据按划分方式进行合并。针对SQuAD2.0的开发集与测试集,我们剔除了未收集到无法作答问题的文章;这使得该两类划分中可作答问题与无法作答问题的比例大致为1:1,而训练集中可作答问题的数量约为无法作答问题的两倍。
为验证数据集的干净性,我们额外招募了众包工作者对SQuAD2.0开发集与测试集中的所有问题进行作答。在每个任务中,我们向工作者展示数据集中的一整篇文章;针对每个段落,我们展示其对应的所有问题,且将可作答与无法作答的问题进行随机打乱。针对每个问题,要求工作者要么在段落中高亮标出答案,要么将其标记为无法作答。我们告知工作者,每个段落均包含可作答与无法作答两类问题。要求工作者为每个问题花费一分钟作答,并按每小时10.50美元的标准支付报酬。
为降低众包工作者带来的噪声,我们为每个问题收集了多份人工答案,并通过多数投票确定最终答案;当出现平票时,优先选择可作答的判定,且偏好更短的答案。平均而言,我们为每个问题收集了4.8份答案。
提供机构:
帕依提提
搜集汇总
数据集介绍

背景与挑战
背景概述
SQuAD2.0是一个阅读理解数据集,包含维基百科文章上的问题和答案,特点是结合了可回答和不可回答的问题,旨在测试系统在确定问题是否可回答方面的能力。
以上内容由遇见数据集搜集并总结生成



