ccdv/pubmed-summarization
收藏数据集概述
基本信息
- 语言: 英语
- 多语言性: 单语种
- 大小: 100K<n<1M
- 任务类别:
- 摘要生成
- 文本生成
- 标签: 条件文本生成
数据集描述
- 名称: PubMed数据集用于摘要生成
- 用途: 用于长文档的摘要生成
- 原始数据源: GitHub仓库
- 数据处理: 原始数据预先分词,数据集返回" ".join(text)并添加" "以表示段落
- 兼容性: 与
run_summarization.py脚本兼容,需在summarization_name_mapping变量中添加配置
数据字段
id: 论文IDarticle: 包含论文主体的字符串abstract: 包含论文摘要的字符串
数据分割
- 分割: 训练集、验证集、测试集
- 统计信息:
分割 实例数 平均词数 训练 119,924 3043 / 215 验证 6,633 3111 / 216 测试 6,658 3092 / 219
引用信息
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }




