google-research-datasets/disfl_qa

Name: google-research-datasets/disfl_qa
Creator: google-research-datasets
Published: 2024-08-08 06:10:54
License: 暂无描述

Hugging Face2024-08-08 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/disfl_qa

下载链接

链接失效反馈

官方服务：

资源简介：

DISFL-QA是一个专门用于理解问答系统中不流畅表达的基准数据集。该数据集基于SQuAD-v2数据集构建，每个问题都被注释以添加上下文不流畅性，使用段落作为干扰源。数据集包含约12k个（不流畅问题，答案）对，其中超过90%的不流畅性是修正或重启，使其成为不流畅性校正的更难测试集。数据集旨在填补语音和NLP研究社区之间的主要差距，并希望作为测试模型对不流畅输入鲁棒性的基准数据集。

DISFL-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors. The final dataset consists of ~12k (disfluent question, answer) pairs. Over 90% of the disfluencies are corrections or restarts, making it a much harder test set for disfluency correction. DISFL-QA aims to fill a major gap between speech and NLP research community. The authors hope the dataset can serve as a benchmark dataset for testing robustness of models against disfluent inputs.

提供机构：

google-research-datasets

原始信息汇总

数据集概述

数据集基本信息

名称: DISFL-QA
描述: 一个用于理解问答中不流畅表达的基准数据集。
语言: 英语
许可证: CC BY 4.0
多语言性: 单语种
大小: 10K<n<100K
源数据: 原始数据
任务类别: 问答
任务ID: 抽取式问答、开放领域问答

数据集结构

数据字段

squad_v2_id: 字符串类型
original question: 原始问题，字符串类型
disfluent question: 不流畅问题，字符串类型
title: 标题，字符串类型
context: 上下文，字符串类型
answers: 包含以下字段
- text: 答案文本，字符串类型
- answer_start: 答案起始位置，整数类型

数据分割

训练集: 7182个样本，7712523字节
测试集: 3643个样本，3865097字节
验证集: 1000个样本，1072731字节

数据集创建

数据集来源

初始数据收集和规范化: 通过要求人工评分员在SQUAD-v2数据集的问题中插入不流畅表达来构建DISFL-QA数据集。

标注过程

标注任务: 每个与段落相关的问题被发送给人工标注任务，以使用段落作为干扰源添加上下文不流畅表达。
质量保证: 进行后续的人工评估，并提供重新标注的选项。

许可证信息

许可证: CC BY 4.0

引用信息

@inproceedings{gupta-etal-2021-disflqa, title = "{Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering}", author = "Gupta, Aditya and Xu, Jiacheng and Upadhyay, Shyam and Yang, Diyi and Faruqui, Manaal", booktitle = "Findings of ACL", year = "2021" }

5,000+

优质数据集

54 个

任务类型

进入经典数据集