greenfish/ccnews-filtered
收藏Hugging Face2025-11-01 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/greenfish/ccnews-filtered
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从2016年到2021年的网络内容数据,每个年份的数据被分为多个子集。每个子集都包含了网络内容的详细信息,如请求的URL、纯文本内容、发布日期、标题、标签、分类、作者、网站名称、图片URL、语言、语言评分、响应的URL、发布者和爬取日期等。数据集以不同年份的配置提供,每个配置包含多个子集,每个子集的大小和包含的示例数量都在README文件中列出。
The dataset contains web content data from 2016 to 2021, divided into multiple subsets for each year. Each subset includes detailed information about the web content such as the requested URL, plain text content, publication date, title, tags, categories, author, site name, image URL, language, language score, responded URL, publisher, WARC path, and crawl date. The dataset is provided in configurations for different years, each with multiple subsets, and the size and number of examples in each subset are listed in the README file.
提供机构:
greenfish



