Wikipedia Current Events Portal (WCEP) dataset
收藏arXiv2020-05-20 更新2024-06-21 收录
下载链接:
https://github.com/complementizer/wcep-mds-dataset
下载链接
链接失效反馈官方服务:
资源简介:
WCEP数据集是由Aylien Ltd.和Insight Centre for Data Analytics, University College Dublin创建的,旨在解决大规模多文档摘要问题。该数据集包含10,200个新闻事件集群,每个集群平均有235篇文章,涵盖了从Wikipedia Current Events Portal提取的新闻事件摘要和相关文章。数据集的创建过程涉及从Wikipedia和Common Crawl中提取和扩展文章,以增加每个事件的文档数量。该数据集主要用于新闻聚类、搜索结果展示和时间线生成等应用领域,以支持深度学习模型的训练和评估。
The WCEP dataset was developed by Aylien Ltd. and the Insight Centre for Data Analytics at University College Dublin, with the aim of addressing the challenge of large-scale multi-document summarization. Comprising 10,200 news event clusters, each containing an average of 235 articles, the dataset encompasses news event summaries and associated articles extracted from the Wikipedia Current Events Portal. The dataset construction process involves extracting and augmenting articles from Wikipedia and Common Crawl to increase the volume of documents per event. This dataset is primarily applied in scenarios including news clustering, search result presentation, and timeline generation, to support the training and evaluation of deep learning models.
提供机构:
Aylien Ltd. 和 Insight Centre for Data Analytics, University College Dublin
创建时间:
2020-05-20



