stanford-oval/ccnews
收藏Hugging Face2024-08-31 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/stanford-oval/ccnews
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是处理了Common Crawl新闻语料库中从2016年到2024年6月的所有WARC文件的结果。数据经过清洗和去重处理,并添加了语言检测信息。数据集包含来自全球的超过100种语言的约6亿篇新闻文章。
This dataset is the result of processing all WARC files in the [CCNews Corpus](https://commoncrawl.org/blog/news-dataset-available), from the beginning (2016) to June of 2024. The data has been cleaned and deduplicated, and language of articles have been detected and added. It contains about 600 million news articles in more than 100 languages, suitable for tasks such as text classification, question answering, text generation, and text-to-text generation. The dataset size is between 100M and 1B, tagged as news, and includes multiple configurations, each corresponding to data files from different years.
提供机构:
stanford-oval
原始信息汇总
数据集概述
数据集名称
- stanford-oval/ccnews
数据集描述
- 该数据集包含多个子集,每个子集对应不同年份的新闻数据,转换为Parquet格式存储。
数据集分布
-
repo:
- 描述: HF Mirror git 仓库。
- 格式: git+https
- URL: https://hf-mirror.com/datasets/stanford-oval/ccnews/tree/refs%2Fconvert%2Fparquet
-
parquet-files-for-config-default:
- 描述: HF Mirror转换的底层Parquet文件。
- 格式: application/x-parquet
- 包含: default//.parquet
-
parquet-files-for-config-2016 至 parquet-files-for-config-2024:
- 描述: 对应年份的HF Mirror转换的底层Parquet文件。
- 格式: application/x-parquet
- 包含: 对应年份的Parquet文件
数据集字段
-
default 子集:
- 描述: stanford-oval/ccnews - default 子集 (前5GB)
- 字段:
- requested_url
- plain_text
- published_date
- title
- tags
- categories
- author
- sitename
- image_url
- language
- language_score
- responded_url
- publisher
- warc_path
- crawl_date
-
2016 子集:
- 描述: stanford-oval/ccnews - 2016 子集 (前5GB)
- 字段:
- requested_url
- plain_text
- published_date
- title
- tags
- categories
- author
- sitename
- image_url
- language
- language_score
- responded_url
- publisher
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



