five

lmqg/qa_harvesting_from_wikipedia

收藏
Hugging Face2024-08-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lmqg/qa_harvesting_from_wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个从Wikipedia文章中收集的问答对数据集,主要用于问答任务。数据集包含超过一百万个问答对,数据分为训练集、验证集和测试集。数据集的语言为英语,且是单语言的。
提供机构:
lmqg
原始信息汇总

数据集概述

数据集基本信息

  • 许可证: cc-by-4.0
  • 名称: Harvesting QA paris from Wikipedia
  • 语言: 英语 (en)
  • 多语言性: 单语种
  • 大小: 小于1M
  • 来源数据集: 扩展自Wikipedia
  • 任务类别: 问答
  • 任务ID: 抽取式问答 (extractive-qa)

数据集描述

  • 摘要: 本数据集是通过《Harvesting Paragraph-level Question-Answer Pairs from Wikipedia》(Du & Cardie, ACL 2018) 收集的问答数据集。
  • 支持的任务: 问答

数据集结构

数据字段

  • id: 字符串类型的标识符
  • title: 字符串类型的段落标题
  • context: 字符串类型的段落内容
  • question: 字符串类型的问题
  • answers: JSON格式的答案

数据分割

分割 数量
训练集 1,204,925
验证集 30,293
测试集 24,473

引用信息

@inproceedings{du-cardie-2018-harvesting, title = "Harvesting Paragraph-level Question-Answer Pairs from {W}ikipedia", author = "Du, Xinya and Cardie, Claire", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2018", address = "Melbourne, Australia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P18-1177", doi = "10.18653/v1/P18-1177", pages = "1907--1917", abstract = "We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. As compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We provide qualitative analysis for the this large-scale generated corpus from Wikipedia.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作