five

One Week of Global News Feeds

收藏
www.kaggle.com2020-05-13 更新2025-03-23 收录
下载链接:
https://www.kaggle.com/therohk/global-news-week
下载链接
链接失效反馈
官方服务:
资源简介:
# Context This dataset is a snapshot of most of the new news content published online over one week. It covers the 7 Day-period of August 24 through August 30 for the years 2017 and 2018. Year 2017: **1,398,431** ; Year 2018: **1,912,872** It includes approximately **3.3 million** articles, with **20,000 news sources** and **20+ languages**. This dataset has just four fields (as per the [column metadata](https://www.kaggle.com/therohk/global-news-week/data)): - **publish_time** - earliest known time of the url appearing online in yyyyMMddHHmm format, IST timezone - **feed_code** - unique identifier for the publisher or domain - **source_url** - url of the article - **headline_text** - Headline of the article (UTF8, Any possible languages) See the ["Basic Feed-Code Exploration"](https://www.kaggle.com/therohk/basic-feed-code-exploration) notebook for a quick look at the dataset contents. # Inspiration The sources include news feeds, news websites, government agencies, tech journals, company websites, blogs and wikipedia updates. The data has been collected by polling RSS feeds and by crawling other large news aggregators. As of 2018, these 7-Day slices were selected as there wasn't any downtime or outage during the intervals. New news content is produced at this rate by publishers everyday, throughout the year. # Acknowledgements This dataset is free to use with the following citation: **Rohit Kulkarni** (2018), One Week of Global Feeds [News CSV Dataset], doi:10.7910/DVN/ILAT5B, Retrieved from: [this url] Original paper by M Trampus, B Novak: Internals of An Aggregated Web News Feed Hosted By: Josef Stefan Institute, Slovenia : (http://ailab.ijs.si/si/people) Further Exploration and Live News: (eventregistry.org)

本数据集是对一周内在线发布的大部分新闻内容的快照。它涵盖了2017年和2018年8月24日至8月30日的7天时段。2017年:**1,398,431**;2018年:**1,912,872**。数据集包含约**330万**篇文章,来自**20,000个新闻来源**,涵盖**20多种语言**。该数据集仅包含四个字段(详见[列元数据](https://www.kaggle.com/therohk/global-news-week/data)):- **publish_time** - 网络上最早出现的url时间,格式为yyyyMMddHHmm,IST时区。- **feed_code** - 发布者或域的唯一标识符。- **source_url** - 文章的url。- **headline_text** - 文章标题(UTF8,任何可能的语言)。可通过查看["基本Feed-Code探索](https://www.kaggle.com/therohk/basic-feed-code-exploration)"笔记本快速了解数据集内容。数据来源包括新闻推送、新闻网站、政府机构、科技期刊、公司网站、博客以及维基百科更新。数据通过轮询RSS推送和爬取其他大型新闻聚合器收集而来。截至2018年,这些7天切片被选中,因为在这些时间段内没有出现停机或故障。新闻出版商每天全年以这种速率生产新的新闻内容。#致谢本数据集免费使用,以下为引用信息:**Rohit Kulkarni** (2018), 一周全球新闻快照[新闻CSV数据集],doi:10.7910/DVN/ILAT5B,检索自:[此链接]。原始论文由M Trampus和B Novak撰写:聚合网络新闻推送的内部结构。主办单位:斯洛文尼亚约瑟夫·斯蒂芬研究所:(http://ailab.ijs.si/si/people)。进一步探索和实时新闻:(eventregistry.org)
提供机构:
Kaggle
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作