abisee/cnn_dailymail
收藏数据集概述
名称: CNN / Daily Mail
语言: 英语(en)
许可证: Apache-2.0
多语言性: 单语
大小: 100K<n<1M
源数据: 原始数据
任务类别: 摘要生成
任务ID: news-articles-summarization
论文代码ID: cnn-daily-mail-1
配置版本: 1.0.0, 2.0.0, 3.0.0
数据集结构
数据实例
- id: 字符串,文章来源URL的SHA1哈希值
- article: 字符串,新闻文章内容
- highlights: 字符串,文章摘要
数据分割
- 训练集: 287,113个实例
- 验证集: 13,368个实例
- 测试集: 11,490个实例
数据集创建
源数据
- 来源: CNN和Daily Mail的新闻文章
- 时间范围: CNN (2007年4月至2015年4月), Daily Mail (2010年6月至2015年4月)
数据处理
- 初始收集: 使用Wayback Machine下载文章
- 数据限制: 文章长度不超过2000个词
- 数据格式: 文章和摘要的文本格式
许可证
- 版本1.0.0: Apache-2.0许可证
引用信息
@inproceedings{see-etal-2017-get, title = "Get To The Point: Summarization with Pointer-Generator Networks", author = "See, Abigail and Liu, Peter J. and Manning, Christopher D.", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P17-1099", doi = "10.18653/v1/P17-1099", pages = "1073--1083", abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.", }
@inproceedings{DBLP:conf/nips/HermannKGEKSB15, author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom}, title={Teaching Machines to Read and Comprehend}, year={2015}, cdate={1420070400000}, pages={1693-1701}, url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend}, booktitle={NIPS}, crossref={conf/nips/2015} }




