five

bitsinthesky/jsv_news

收藏
Hugging Face2024-05-09 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/bitsinthesky/jsv_news
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: int64 - name: title dtype: string - name: time_published dtype: timestamp[us, tz=+03:00] - name: organization dtype: string - name: url dtype: string - name: body dtype: string splits: - name: train num_bytes: 327664848 num_examples: 96601 download_size: 196562266 dataset_size: 327664848 configs: - config_name: default data_files: - split: train path: data/train-* --- --- # Dataset Details ~96,000 news articles I collected from the 2nd half of 2019 through 2021, made for my own project. Articles were cleaned with the readability project in python. # Feautures: 1) ID number (int) 1) (0, 96601) 2) Title of article (str) 1) 13 words on average, min: 1, max: 46 3) Time published (str) 1) Collected between 2019 and 2021ish, NOT evenly distributed. There are large gaps of no collection. 2) Idk how time zones have been handled, so take the timestamps with grains of salt. I'd say you can trust them within the day. 4) Organization (str) 1) Below are (org: count) 2) ('www.rt.com', 22303), 3) ('www.theepochtimes.com', 9930), 4) ('www.nytimes.com', 9823), 5) ('rssfeeds.usatoday.com', 9201), 6) ('www.businessinsider.com', 7478), 7) ('www.nationalreview.com', 4962), 8) ('english.sina.com', 4635), 9) ('abcnews.go.com', 4568), 10) ('www.foxnews.com', 3755), 11) ('www.theatlantic.com', 3482), 12) ('www.oann.com', 3392), 13) ('feeds.foxnews.com', 3307), 14) ('foreignpolicy.com', 3234), 15) ('www.washingtonpost.com', 2934), 16) ('webfeeds.brookings.edu', 1484), 17) ('meduza.io', 1223), 18) ('markets.businessinsider.com', 886), 19) ('businessinsider.com', 1), 20) ('projects.fivethirtyeight.com', 1), 21) ('www.foxbusiness.com', 1), 22) ('fivethirtyeight.com', 1) 5) Url (str) 1) Full URL as source 6) Body (str) 1) mean 496 words, min: 2, max: 15272
提供机构:
bitsinthesky
原始信息汇总

数据集概述

数据集特征

  • id: 整数类型,范围(0, 96601)
  • title: 字符串类型,平均13个单词,最小1个单词,最大46个单词
  • time_published: 时间戳类型,收集时间为2019至2021年,分布不均匀,时间戳的时区处理不明确,建议按日信任
  • organization: 字符串类型,包含多个新闻机构的统计数据,如(www.rt.com, 22303)等
  • url: 字符串类型,完整的新闻来源URL
  • body: 字符串类型,平均496个单词,最小2个单词,最大15272个单词

数据集划分

  • train: 训练集,包含96601个样本,总大小为327664848字节

数据集大小

  • 下载大小: 196562266字节
  • 数据集总大小: 327664848字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作