five

NewsHomepages

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/alex2awesome/homepage-newsworthiness-with-internet-archive
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了超过3000个新闻网站首页,在三年时间内每天两次捕获的数据,旨在研究新闻版面中的信息优先级。数据集不仅包括网页链接、全页截图,还为一部分页面提供了压缩的HTML快照。目前,该数据集仍在持续收集中,并得到了社区贡献。总计有363,340个快照,来自3,489个新闻首页。这项工作的任务是分析首页布局及其编辑提示。

This dataset consists of data captured twice daily over a three-year period from over 3,000 news website homepages, with the objective of studying information prioritization in news page layouts. The dataset includes webpage links, full-page screenshots, and compressed HTML snapshots for a subset of these pages. Currently, this dataset is still being actively collected with community contributions. In total, there are 363,340 snapshots from 3,489 news homepages. The core task of this work is to analyze homepage layouts and their editorial cues.
提供机构:
Contributing community of activists, developers, and journalists
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作