NewsHomepages
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/alex2awesome/homepage-newsworthiness-with-internet-archive
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了超过3000个新闻网站首页,在三年时间内每天两次捕获的数据,旨在研究新闻版面中的信息优先级。数据集不仅包括网页链接、全页截图,还为一部分页面提供了压缩的HTML快照。目前,该数据集仍在持续收集中,并得到了社区贡献。总计有363,340个快照,来自3,489个新闻首页。这项工作的任务是分析首页布局及其编辑提示。
This dataset consists of data captured twice daily over a three-year period from over 3,000 news website homepages, with the objective of studying information prioritization in news page layouts. The dataset includes webpage links, full-page screenshots, and compressed HTML snapshots for a subset of these pages. Currently, this dataset is still being actively collected with community contributions. In total, there are 363,340 snapshots from 3,489 news homepages. The core task of this work is to analyze homepage layouts and their editorial cues.
提供机构:
Contributing community of activists, developers, and journalists



