CCNEWS

Name: CCNEWS
Creator: CommonCrawl
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

http://commoncrawl.org/2016/10/news-dataset-available

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从2016年9月至2019年2月期间全球发布的CommonCrawl新闻数据集中的英文部分整理而来的大规模去重子集，包含了大量的新闻文章。这个子集源自一个更大的数据集，后者包含了大约10亿个句子或270亿个单词。为了实验目的，我们使用了该数据集的前10%。具体来说，这个数据集包含了1亿个句子和27亿个单词，这些数据被用于对话响应生成的预训练任务。

This is a large deduplicated subset curated from the English portion of the CommonCrawl news dataset, which was globally released between September 2016 and February 2019, and it contains a substantial number of news articles. This subset is derived from a larger parent dataset that encompasses approximately 1 billion sentences or 27 billion words. For experimental purposes, we utilized the top 10% of this parent dataset. Specifically, the resulting subset comprises 100 million sentences and 2.7 billion words, and it has been applied to pretraining tasks for dialogue response generation.

提供机构：

CommonCrawl

5,000+

优质数据集

54 个

任务类型

进入经典数据集