NewsEdits
收藏arXiv2022-06-15 更新2024-06-21 收录
下载链接:
https://github.com/isi-nlp/NewsEdits.git
下载链接
链接失效反馈官方服务:
资源简介:
NewsEdits是由南加州大学信息科学研究所开发的第一个公开可用的新闻修订历史数据集。该数据集规模庞大,支持多语言,包含来自22个英语和法语报纸来源的1.2百万篇文章的4.6百万个版本,时间跨度为2006至2021年。数据集的创建旨在促进对新闻文章中叙述和事实演变的分析。数据集中的编辑操作包括添加、删除、编辑和重构,这些操作通过高精度提取算法进行识别。此外,数据集还引入了三个新颖的任务,旨在预测版本更新期间执行的操作,这些任务对于大型自然语言处理模型来说具有挑战性,但对于专家人类记者来说则是可行的。NewsEdits数据集的应用领域包括事件时间关系提取、文章链接预测、基于事实的更新、错误信息检测、标题生成、作者归属以及计算新闻学和通信领域的多个研究方向。
NewsEdits is the first publicly available news revision history dataset developed by the Information Sciences Institute of the University of Southern California. Boasting a large scale and multilingual support, the dataset contains 4.6 million versions of 1.2 million articles sourced from 22 English and French newspaper outlets, spanning the period from 2006 to 2021. It was created to facilitate analysis of the narrative and factual evolution of news articles. The editing operations in the dataset include addition, deletion, revision, and restructuring, which are identified via high-precision extraction algorithms. Additionally, the dataset introduces three novel tasks focused on predicting the operations performed during article version updates; these tasks are challenging for state-of-the-art large natural language processing models yet feasible for professional human journalists. Applications of the NewsEdits dataset cover event temporal relation extraction, article link prediction, fact-based updating, misinformation detection, headline generation, author attribution, and multiple research directions in the fields of computational journalism and communication studies.
提供机构:
南加州大学信息科学研究所
创建时间:
2022-06-15



