NewsEdits
收藏arXiv2022-07-01 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2104.09647v2
下载链接
链接失效反馈官方服务:
资源简介:
NewsEdits数据集由南加州大学信息科学研究所创建,是首个公开的新闻文章修订历史数据集。该数据集包含1,278,804篇文章和4,609,430个版本,涵盖超过22个英语和法语报纸来源,分布在三个国家。数据集内容丰富,包括10.9百万新增句子、8.9百万修改句子和6.8百万删除句子,以及72百万原子编辑。创建过程中,数据集通过监测文章URL并下载新版本的文章文本来收集。该数据集适用于语言建模、事件排序和计算新闻学等领域,旨在解决新闻文章中信息更新和事件描述的问题。
The NewsEdits dataset, created by the Information Sciences Institute of the University of Southern California, is the first publicly available dataset of news article revision histories. It contains 1,278,804 articles and 4,609,430 article versions, spanning over 22 English and French newspaper sources across three countries. Featuring rich content, the dataset includes 10.9 million newly added sentences, 8.9 million modified sentences, 6.8 million deleted sentences, and 72 million atomic edits. During its creation, the dataset was collected by monitoring article URLs and downloading the text of new article versions. This dataset is applicable to fields such as language modeling, event ordering, and computational journalism, and aims to address issues related to information updates and event descriptions in news articles.
提供机构:
南加州大学信息科学研究所
创建时间:
2021-04-20



