lmqg/qa_harvesting_from_wikipedia

Name: lmqg/qa_harvesting_from_wikipedia
Creator: lmqg
Published: 2024-08-24 05:02:17
License: 暂无描述

Hugging Face2024-08-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/lmqg/qa_harvesting_from_wikipedia

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个从Wikipedia文章中收集的问答对数据集，主要用于问答任务。数据集包含超过一百万个问答对，数据分为训练集、验证集和测试集。数据集的语言为英语，且是单语言的。

提供机构：

lmqg

原始信息汇总

数据集概述

数据集基本信息

许可证: cc-by-4.0
名称: Harvesting QA paris from Wikipedia
语言: 英语 (en)
多语言性: 单语种
大小: 小于1M
来源数据集: 扩展自Wikipedia
任务类别: 问答
任务ID: 抽取式问答 (extractive-qa)

数据集描述

摘要: 本数据集是通过《Harvesting Paragraph-level Question-Answer Pairs from Wikipedia》(Du & Cardie, ACL 2018) 收集的问答数据集。
支持的任务: 问答

数据集结构

数据字段

id: 字符串类型的标识符
title: 字符串类型的段落标题
context: 字符串类型的段落内容
question: 字符串类型的问题
answers: JSON格式的答案

数据分割

分割	数量
训练集	1,204,925
验证集	30,293
测试集	24,473

引用信息

@inproceedings{du-cardie-2018-harvesting, title = "Harvesting Paragraph-level Question-Answer Pairs from {W}ikipedia", author = "Du, Xinya and Cardie, Claire", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2018", address = "Melbourne, Australia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P18-1177", doi = "10.18653/v1/P18-1177", pages = "1907--1917", abstract = "We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. As compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We provide qualitative analysis for the this large-scale generated corpus from Wikipedia.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集