lmqg/qa_harvesting_from_wikipedia
收藏数据集概述
数据集基本信息
- 许可证: cc-by-4.0
- 名称: Harvesting QA paris from Wikipedia
- 语言: 英语 (en)
- 多语言性: 单语种
- 大小: 小于1M
- 来源数据集: 扩展自Wikipedia
- 任务类别: 问答
- 任务ID: 抽取式问答 (extractive-qa)
数据集描述
- 摘要: 本数据集是通过《Harvesting Paragraph-level Question-Answer Pairs from Wikipedia》(Du & Cardie, ACL 2018) 收集的问答数据集。
- 支持的任务: 问答
数据集结构
数据字段
id: 字符串类型的标识符title: 字符串类型的段落标题context: 字符串类型的段落内容question: 字符串类型的问题answers: JSON格式的答案
数据分割
| 分割 | 数量 |
|---|---|
| 训练集 | 1,204,925 |
| 验证集 | 30,293 |
| 测试集 | 24,473 |
引用信息
@inproceedings{du-cardie-2018-harvesting, title = "Harvesting Paragraph-level Question-Answer Pairs from {W}ikipedia", author = "Du, Xinya and Cardie, Claire", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2018", address = "Melbourne, Australia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P18-1177", doi = "10.18653/v1/P18-1177", pages = "1907--1917", abstract = "We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. As compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We provide qualitative analysis for the this large-scale generated corpus from Wikipedia.", }



