CCNEWS
收藏arXiv2025-09-30 收录
下载链接:
http://commoncrawl.org/2016/10/news-dataset-available
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从2016年9月至2019年2月期间全球发布的CommonCrawl新闻数据集中的英文部分整理而来的大规模去重子集,包含了大量的新闻文章。这个子集源自一个更大的数据集,后者包含了大约10亿个句子或270亿个单词。为了实验目的,我们使用了该数据集的前10%。具体来说,这个数据集包含了1亿个句子和27亿个单词,这些数据被用于对话响应生成的预训练任务。
This is a large deduplicated subset curated from the English portion of the CommonCrawl news dataset, which was globally released between September 2016 and February 2019, and it contains a substantial number of news articles. This subset is derived from a larger parent dataset that encompasses approximately 1 billion sentences or 27 billion words. For experimental purposes, we utilized the top 10% of this parent dataset. Specifically, the resulting subset comprises 100 million sentences and 2.7 billion words, and it has been applied to pretraining tasks for dialogue response generation.
提供机构:
CommonCrawl



