CC-News (CommonCrawl News dataset)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/CC-News
下载链接
链接失效反馈官方服务:
资源简介:
我们很高兴地宣布发布一个新数据集,其中包含来自世界各地新闻网站的新闻文章。
数据可在crawl-data/CC-NEWS/的commoncrawl存储桶中的AWS S3上获得。WARC文件每天发布,可通过文件名前缀 (包括年份和月份) 进行标识。我们提供已发布的WARC文件列表,2016年到目前为止按年份和月份组织。
We are pleased to announce the release of a new dataset containing news articles from news websites across the globe. The data is available on AWS S3 in the Common Crawl bucket under the path crawl-data/CC-NEWS/. WARC files are released on a daily basis and can be identified by their filename prefixes, which include the year and month. We provide a list of the released WARC files, organized by year and month spanning from 2016 to the present.
提供机构:
OpenDataLab
创建时间:
2022-11-02
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



