five

ccdv/cnn_dailymail

收藏
Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ccdv/cnn_dailymail
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - summarization - text-generation task_ids: [] paperswithcode_id: cnn-daily-mail-1 pretty_name: CNN / Daily Mail tags: - conditional-text-generation --- **Copy of the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset fixing the "NotADirectoryError: [Errno 20]".** # Dataset Card for CNN Dailymail Dataset ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** [CNN / DailyMail Dataset repository](https://github.com/abisee/cnn-dailymail) - **Paper:** [Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond](https://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf), [Get To The Point: Summarization with Pointer-Generator Networks](https://www.aclweb.org/anthology/K16-1028.pdf) - **Leaderboard:** [Papers with Code leaderboard for CNN / Dailymail Dataset](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail) - **Point of Contact:** [Abigail See](mailto:abisee@stanford.edu) ### Dataset Summary The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. ### Supported Tasks and Leaderboards - 'summarization': [Versions 2.0.0 and 3.0.0 of the CNN / DailyMail Dataset](https://www.aclweb.org/anthology/K16-1028.pdf) can be used to train a model for abstractive and extractive summarization ([Version 1.0.0](https://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf) was developed for machine reading and comprehension and abstractive question answering). The model performance is measured by how high the output summary's [ROUGE](https://huggingface.co/metrics/rouge) score for a given article is when compared to the highlight as written by the original article author. [Zhong et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.552.pdf) report a ROUGE-1 score of 44.41 when testing a model trained for extractive summarization. See the [Papers With Code leaderboard](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail) for more models. ### Languages The BCP-47 code for English as generally spoken in the United States is en-US and the BCP-47 code for English as generally spoken in the United Kingdom is en-GB. It is unknown if other varieties of English are represented in the data. ## Dataset Structure ### Data Instances For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the [CNN / Daily Mail dataset viewer](https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0) to explore more examples. ``` {'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62', 'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.' 'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'} ``` The average token count for the articles and the highlights are provided below: | Feature | Mean Token Count | | ---------- | ---------------- | | Article | 781 | | Highlights | 56 | ### Data Fields - `id`: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from - `article`: a string containing the body of the news article - `highlights`: a string containing the highlight of the article as written by the article author ### Data Splits The CNN/DailyMail dataset has 3 splits: _train_, _validation_, and _test_. Below are the statistics for Version 3.0.0 of the dataset. | Dataset Split | Number of Instances in Split | | ------------- | ------------------------------------------- | | Train | 287,113 | | Validation | 13,368 | | Test | 11,490 | ## Dataset Creation ### Curation Rationale Version 1.0.0 aimed to support supervised neural methodologies for machine reading and question answering with a large amount of real natural language training data and released about 313k unique articles and nearly 1M Cloze style questions to go with the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization rather than question answering. Version 3.0.0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels. ### Source Data #### Initial Data Collection and Normalization The data consists of news articles and highlight sentences. In the question answering setting of the data, the articles are used as the context and entities are hidden one at a time in the highlight sentences, producing Cloze style questions where the goal of the model is to correctly guess which entity in the context has been hidden in the highlight. In the summarization setting, the highlight sentences are concatenated to form a summary of the article. The CNN articles were written between April 2007 and April 2015. The Daily Mail articles were written between June 2010 and April 2015. The code for the original data collection is available at <https://github.com/deepmind/rc-data>. The articles were downloaded using archives of <www.cnn.com> and <www.dailymail.co.uk> on the Wayback Machine. Articles were not included in the Version 1.0.0 collection if they exceeded 2000 tokens. Due to accessibility issues with the Wayback Machine, Kyunghyun Cho has made the datasets available at <https://cs.nyu.edu/~kcho/DMQA/>. An updated version of the code that does not anonymize the data is available at <https://github.com/abisee/cnn-dailymail>. Hermann et al provided their own tokenization script. The script provided by See uses the PTBTokenizer. It also lowercases the text and adds periods to lines missing them. #### Who are the source language producers? The text was written by journalists at CNN and the Daily Mail. ### Annotations The dataset does not contain any additional annotations. #### Annotation process [N/A] #### Who are the annotators? [N/A] ### Personal and Sensitive Information Version 3.0 is not anonymized, so individuals' names can be found in the dataset. Information about the original author is not included in the dataset. ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to help develop models that can summarize long paragraphs of text in one or two sentences. This task is useful for efficiently presenting information given a large quantity of text. It should be made clear that any summarizations produced by models trained on this dataset are reflective of the language used in the articles, but are in fact automatically generated. ### Discussion of Biases [Bordia and Bowman (2019)](https://www.aclweb.org/anthology/N19-3002.pdf) explore measuring gender bias and debiasing techniques in the CNN / Dailymail dataset, the Penn Treebank, and WikiText-2. They find the CNN / Dailymail dataset to have a slightly lower gender bias based on their metric compared to the other datasets, but still show evidence of gender bias when looking at words such as 'fragile'. Because the articles were written by and for people in the US and the UK, they will likely present specifically US and UK perspectives and feature events that are considered relevant to those populations during the time that the articles were published. ### Other Known Limitations News articles have been shown to conform to writing conventions in which important information is primarily presented in the first third of the article [(Kryściński et al, 2019)](https://www.aclweb.org/anthology/D19-1051.pdf). [Chen et al (2016)](https://www.aclweb.org/anthology/P16-1223.pdf) conducted a manual study of 100 random instances of the first version of the dataset and found 25% of the samples to be difficult even for humans to answer correctly due to ambiguity and coreference errors. It should also be noted that machine-generated summarizations, even when extractive, may differ in truth values when compared to the original articles. ## Additional Information ### Dataset Curators The data was originally collected by Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom of Google DeepMind. Tomáš Kočiský and Phil Blunsom are also affiliated with the University of Oxford. They released scripts to collect and process the data into the question answering format. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, and Bing Xiang of IMB Watson and Çağlar Gu̇lçehre of Université de Montréal modified Hermann et al's collection scripts to restore the data to a summary format. They also produced both anonymized and non-anonymized versions. The code for the non-anonymized version is made publicly available by Abigail See of Stanford University, Peter J. Liu of Google Brain and Christopher D. Manning of Stanford University at <https://github.com/abisee/cnn-dailymail>. The work at Stanford University was supported by the DARPA DEFT ProgramAFRL contract no. FA8750-13-2-0040. ### Licensing Information The CNN / Daily Mail dataset version 1.0.0 is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ``` @inproceedings{see-etal-2017-get, title = "Get To The Point: Summarization with Pointer-Generator Networks", author = "See, Abigail and Liu, Peter J. and Manning, Christopher D.", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P17-1099", doi = "10.18653/v1/P17-1099", pages = "1073--1083", abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.", } ``` ``` @inproceedings{DBLP:conf/nips/HermannKGEKSB15, author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom}, title={Teaching Machines to Read and Comprehend}, year={2015}, cdate={1420070400000}, pages={1693-1701}, url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend}, booktitle={NIPS}, crossref={conf/nips/2015} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@jplu](https://github.com/jplu), [@jbragg](https://github.com/jbragg), [@patrickvonplaten](https://github.com/patrickvonplaten) and [@mcmillanmajora](https://github.com/mcmillanmajora) for adding this dataset.

annotations_creators: - 无标注 language_creators: - 采集获取 language: - 英语 license: - apache-2.0 multilinguality: - 单语言 size_categories: - 100K<n<1M source_datasets: - 原始数据集 task_categories: - 摘要生成 - 文本生成 task_ids: [] paperswithcode_id: cnn-daily-mail-1 pretty_name: CNN / 每日邮报 tags: - 条件文本生成 --- **本数据集为[cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail)的副本,修复了“NotADirectoryError: [Errno 20]”错误。** # CNN/每日邮报数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集摘要](#数据集摘要) - [支持任务与榜单](#支持任务与榜单) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据集划分](#数据集划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏见讨论](#偏见讨论) - [已知局限性](#已知局限性) - [附加信息](#附加信息) - [数据集管理者](#数据集管理者) - [许可证信息](#许可证信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页:** 无 - **仓库:** [CNN / DailyMail 数据集仓库](https://github.com/abisee/cnn-dailymail) - **论文:** [《基于序列到序列循环神经网络及其他方法的抽象式文本摘要》](https://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf)、[《直击要点:基于指针生成网络的摘要生成》](https://www.aclweb.org/anthology/K16-1028.pdf) - **榜单:** [PapersWithCode 上的 CNN / 每日邮报数据集榜单](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail) - **联系人:** [Abigail See](mailto:abisee@stanford.edu) ### 数据集摘要 CNN/每日邮报数据集是一个英语数据集,包含超过30万篇由CNN与《每日邮报》记者撰写的独家新闻文章。当前版本支持抽取式摘要(extractive summarization)与抽象式摘要(abstractive summarization)两种任务,而原始版本最初是为机器阅读与理解以及抽象式问答(abstractive question answering)任务设计的。 ### 支持任务与榜单 - **摘要生成**:CNN/每日邮报数据集的2.0.0与3.0.0版本可用于训练抽象式与抽取式摘要生成模型(1.0.0版本最初是为机器阅读与理解以及抽象式问答任务开发的)。模型性能通过生成的摘要与原文章作者撰写的高亮文本之间的[ROUGE指标](https://huggingface.co/metrics/rouge)得分来衡量。[Zhong等人(2020)](https://www.aclweb.org/anthology/2020.acl-main.552.pdf)在测试一个经过抽取式摘要生成训练的模型时,取得了44.41的ROUGE-1得分。更多模型的结果可参见[PapersWithCode榜单](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail)。 ### 语言 美国通用英语的BCP-47代码为en-US,英国通用英语的BCP-47代码为en-GB。目前尚不清楚数据集中是否包含其他英语变体。 ## 数据集结构 ### 数据实例 每个数据实例包含文章文本字符串、高亮文本字符串以及ID字符串。可通过[CNN/每日邮报数据集查看器](https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0)探索更多示例。 {'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62', 'article': '(CNN) -- 据巴西官方通讯社Agencia Brasil报道,一名美国女性于周二在停靠里约热内卢的邮轮上离世,该邮轮此前已有86名乘客患病。这名美国游客在荷美邮轮公司运营的MS Veendam号邮轮上去世。巴西联邦警察告诉Agencia Brasil,法医正在对其死因展开调查。据该通讯社报道,邮轮医生称该女性年事已高,且患有糖尿病与高血压。邮轮医生表示,在她去世前,其他乘客在航行途中出现了腹泻症状。MS Veendam号于36天前从纽约出发,开启南美之旅。' 'highlights': '邮轮医生称,这名年长女性患有糖尿病与高血压。 据Agencia Brasil报道,该邮轮此前已有86名乘客患病。'} 以下为文章与高亮文本的平均Token数: | 特征 | 平均Token数 | | ---------- | ----------- | | 文章 | 781 | | 高亮文本 | 56 | ### 数据字段 - `id`: 包含文章来源URL的SHA1哈希(SHA1)十六进制字符串 - `article`: 新闻文章正文的字符串 - `highlights`: 原文章作者撰写的文章高亮摘要的字符串 ### 数据集划分 CNN/每日邮报数据集包含三个划分:训练集(train)、验证集(validation)与测试集(test)。以下为3.0.0版本数据集的统计信息: | 数据集划分 | 样本数量 | | ---------- | ----------------------- | | 训练集 | 287,113 | | 验证集 | 13,368 | | 测试集 | 11,490 | ## 数据集构建 ### 构建初衷 1.0.0版本旨在为机器阅读与问答任务提供大规模的真实自然语言训练数据,共发布约31.3万篇独家文章以及近100万个完形填空式问题(Cloze style questions)。2.0.0与3.0.0版本修改了数据集结构,使其适配摘要生成任务而非问答任务。3.0.0版本提供了未经过匿名化处理的数据集,而此前两个版本均经过预处理,将命名实体替换为唯一标识符标签。 ### 源数据 #### 初始数据收集与标准化 本数据集包含新闻文章与高亮句子。在问答任务设定中,文章作为上下文,高亮句子中的实体被逐一隐藏,生成完形填空式问题,模型需要预测上下文被隐藏的实体。在摘要生成任务设定中,高亮句子会被拼接为文章的摘要。CNN文章的撰写时间为2007年4月至2015年4月,《每日邮报》文章的撰写时间为2010年6月至2015年4月。 原始数据收集代码可参见<https://github.com/deepmind/rc-data>。文章通过Wayback Machine上的<www.cnn.com>与<www.dailymail.co.uk>存档下载。若文章长度超过2000个Token,则不会被纳入1.0.0版本的数据集。由于Wayback Machine的访问限制,Kyunghyun Cho已将数据集上传至<https://cs.nyu.edu/~kcho/DMQA/>。一个无需匿名化处理的更新版代码可参见<https://github.com/abisee/cnn-dailymail>。 Hermann等人提供了自定义分词脚本,而See等人提供的脚本使用PTB分词器(PTBTokenizer),同时会将文本转为小写,并为缺少句点的行添加句点。 #### 源语言生产者是谁? 文本由CNN与《每日邮报》的记者撰写。 ### 标注信息 本数据集不包含额外标注。 #### 标注流程 无适用内容 #### 标注者 无适用内容 ### 个人与敏感信息 3.0.0版本未经过匿名化处理,因此数据集中可直接获取个人姓名。数据集中不包含原文章作者的相关信息。 ## 数据使用注意事项 ### 数据集的社会影响 本数据集旨在助力开发能够将长文本段落压缩为一至两句摘要的模型。 该任务对于高效呈现大量文本信息具有实用价值。需要明确的是,基于本数据集训练的模型生成的摘要仅反映原文章所用的语言风格,本质上为自动生成内容。 ### 偏见讨论 [Bordia与Bowman(2019)](https://www.aclweb.org/anthology/N19-3002.pdf)探讨了CNN/每日邮报数据集、Penn Treebank以及WikiText-2中的性别偏见与去偏技术。他们发现,基于其评估指标,CNN/每日邮报数据集的性别偏见略低于另外两个数据集,但仍在“脆弱”等词汇的使用中体现出性别偏见。 由于文章由英美两国的记者撰写并面向当地受众,数据集大概率仅体现英美视角,且聚焦于文章发布时段内与两国相关的事件。 ### 已知局限性 已有研究表明,新闻文章遵循“重要信息优先出现在文章前三分之一”的写作惯例[(Kryściński等人, 2019)](https://www.aclweb.org/anthology/D19-1051.pdf)。[Chen等人(2016)](https://www.aclweb.org/anthology/P16-1223.pdf)对第一版数据集中的100个随机样本进行了人工分析,发现25%的样本由于歧义与指代错误,即使是人类也难以正确作答。 此外需要注意,即使是抽取式摘要生成模型产出的结果,也可能与原文章在事实细节上存在出入。 ## 附加信息 ### 数据集管理者 原始数据由Google DeepMind的Karl Moritz Hermann、Tomáš Kočiský、Edward Grefenstette、Lasse Espeholt、Will Kay、Mustafa Suleyman与Phil Blunsom收集。Tomáš Kočiský与Phil Blunsom同时隶属于牛津大学。他们发布了用于将数据处理为问答任务格式的收集与预处理脚本。 IBM Watson的Ramesh Nallapati、Bowen Zhou、Cicero dos Santos、Bing Xiang以及蒙特利尔大学的Çağlar Gu̇lçehre修改了Hermann等人的收集脚本,将数据恢复为摘要生成任务格式,并同时生成了匿名化与非匿名化两个版本。 斯坦福大学的Abigail See、Google Brain的Peter J. Liu以及斯坦福大学的Christopher D. Manning将非匿名化版本的代码开源至<https://github.com/abisee/cnn-dailymail>。斯坦福大学的相关工作得到了DARPA DEFT项目AFRL合同编号FA8750-13-2-0040的支持。 ### 许可证信息 CNN/每日邮报数据集1.0.0版本基于[Apache-2.0许可证](http://www.apache.org/licenses/LICENSE-2.0)发布。 ### 引用信息 @inproceedings{see-etal-2017-get, title = "Get To The Point: Summarization with Pointer-Generator Networks", author = "See, Abigail and Liu, Peter J. and Manning, Christopher D.", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P17-1099", doi = "10.18653/v1/P17-1099", pages = "1073--1083", abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.", } @inproceedings{DBLP:conf/nips/HermannKGEKSB15, author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom}, title={Teaching Machines to Read and Comprehend}, year={2015}, cdate={1420070400000}, pages={1693-1701}, url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend}, booktitle={NIPS}, crossref={conf/nips/2015} } ### 贡献者 感谢[@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@jplu](https://github.com/jplu)、[@jbragg](https://github.com/jbragg)、[@patrickvonplaten](https://github.com/patrickvonplaten)与[@mcmillanmajora](https://github.com/mcmillanmajora)为本数据集的添加工作。
提供机构:
ccdv
原始信息汇总

数据集概述

基本信息

  • 名称: CNN / Daily Mail
  • 语言: 英语(en)
  • 许可证: Apache-2.0
  • 多语言性: 单语
  • 大小: 100K<n<1M
  • 源数据: 原始数据
  • 任务类别: 摘要生成、文本生成
  • 标签: 条件文本生成

数据集描述

  • 概述: 包含超过30万篇由CNN和Daily Mail记者撰写的独特新闻文章,支持提取式和抽象式摘要生成。
  • 任务支持: 用于训练模型进行抽象和提取摘要生成,模型性能通过ROUGE评分衡量。
  • 语言: 数据集主要使用美式英语(en-US)和英式英语(en-GB),其他英语变体情况未知。

数据集结构

  • 数据实例: 每个实例包含文章、亮点和ID。
  • 数据字段:
    • id: 文章来源URL的SHA1哈希值
    • article: 新闻文章主体
    • highlights: 文章作者撰写的文章亮点
  • 数据分割: 训练集(287,113实例)、验证集(13,368实例)、测试集(11,490实例)

数据集创建

  • 采集理由: 最初用于支持机器阅读和问答的监督神经方法,后调整为支持摘要生成。
  • 源数据: 数据包括新闻文章和亮点句子,文章来自CNN和Daily Mail,时间跨度为2007年至2015年。
  • 注释: 数据集不包含额外注释。

使用数据注意事项

  • 社会影响: 用于开发能够高效总结大量文本的模型。
  • 偏见讨论: 数据集显示轻微性别偏见,且可能反映美国和英国的观点。
  • 其他限制: 新闻文章的重要信息通常集中在前三分之一,机器生成的摘要可能与原文在真实性上存在差异。

附加信息

  • 数据集维护者: 由Karl Moritz Hermann等Google DeepMind团队成员最初收集,后由Ramesh Nallapati等修改为摘要格式。
  • 许可证: Apache-2.0
  • 引用信息: 参考相关论文和代码库。
  • 贡献者: 感谢多位GitHub用户的贡献。
搜集汇总
数据集介绍
main_image_url
构建方式
CNN/DailyMail数据集的构建基于CNN和《每日邮报》的新闻报道,涵盖了超过30万篇独特的新闻文章。数据集的初始版本(1.0.0)旨在支持机器阅读和问答任务,后续版本(2.0.0和3.0.0)则调整为支持摘要生成任务。数据收集通过Wayback Machine从CNN和《每日邮报》的存档中获取,并经过预处理以去除命名实体的匿名化处理。数据集的构建过程包括文章下载、实体隐藏、以及摘要生成等步骤,确保了数据的多样性和实用性。
特点
CNN/DailyMail数据集的主要特点在于其丰富的新闻文章内容和高质量的摘要标注。每篇文章均附有由记者撰写的摘要,这些摘要不仅简洁明了,而且能够准确捕捉文章的核心信息。数据集涵盖了广泛的新闻主题,确保了模型训练的多样性和泛化能力。此外,数据集的非匿名化版本(3.0.0)保留了原始文章中的命名实体,进一步提升了数据的真实性和可用性。
使用方法
CNN/DailyMail数据集广泛应用于文本摘要任务,尤其是抽象摘要和抽取摘要的模型训练。研究人员可以通过该数据集训练模型,生成与原文高度一致的摘要。数据集的使用通常涉及将文章作为输入,摘要作为目标输出,通过ROUGE等指标评估模型性能。此外,数据集还可用于研究新闻文本的语言特征、摘要生成中的偏差问题等。通过Hugging Face平台,用户可以轻松加载和探索该数据集,快速应用于各类自然语言处理任务。
背景与挑战
背景概述
CNN/DailyMail数据集由Google DeepMind的研究团队于2015年首次发布,旨在为机器阅读与理解任务提供大规模的自然语言训练数据。该数据集最初包含约31.3万篇新闻文章,并生成了近100万个Cloze风格的问答对。随着研究的深入,数据集逐渐演变为支持文本摘要任务,特别是抽象摘要和抽取摘要。数据集的核心研究问题在于如何通过神经网络模型生成高质量的文本摘要,以帮助用户快速理解长篇新闻内容。该数据集在自然语言处理领域具有重要影响力,尤其是在文本摘要和生成任务中,推动了多项前沿技术的发展。
当前挑战
CNN/DailyMail数据集在解决文本摘要任务时面临多重挑战。首先,新闻文章的结构通常遵循‘倒金字塔’模式,重要信息多集中于文章开头,这可能导致模型过度依赖前文而忽略后续内容。其次,数据集中存在一定的偏见问题,例如性别偏见和文化偏见,这可能影响模型的公平性和泛化能力。此外,构建过程中也面临技术挑战,例如数据匿名化与非匿名化版本的切换,以及从问答任务到摘要任务的格式转换。这些挑战不仅影响了数据集的构建质量,也对模型的训练和评估提出了更高的要求。
常用场景
经典使用场景
CNN/DailyMail数据集在自然语言处理领域中被广泛用于文本摘要任务。该数据集包含了超过30万篇新闻文章及其对应的摘要,适用于训练和评估自动摘要生成模型。研究人员通常使用该数据集来开发能够从长篇文章中提取关键信息的模型,尤其是在抽象摘要和抽取摘要任务中,模型的表现通过ROUGE分数进行评估。
衍生相关工作
基于CNN/DailyMail数据集,许多经典的研究工作得以展开。例如,Abigail See等人提出的指针生成网络(Pointer-Generator Networks)在该数据集上取得了显著的性能提升。此外,该数据集还催生了大量关于抽象摘要、抽取摘要以及多文档摘要的研究,推动了自然语言处理领域的技术进步。
数据集最近研究
最新研究方向
在自然语言处理领域,CNN/DailyMail数据集作为文本摘要任务的重要基准,近年来吸引了广泛的研究关注。随着深度学习技术的不断进步,研究者们致力于提升模型在抽象摘要和抽取摘要任务中的表现。最新的研究方向集中在如何通过改进的序列到序列模型、指针生成网络以及覆盖机制来增强摘要的准确性和连贯性。此外,针对数据集中的性别偏见和语言风格偏差,研究者们也在探索去偏技术,以提高模型的公平性和泛化能力。这些研究不仅推动了文本摘要技术的发展,也为新闻自动摘要系统的实际应用提供了理论支持。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作