common-pile/cccc
收藏Hugging Face2025-06-06 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/common-pile/cccc
下载链接
链接失效反馈官方服务:
资源简介:
Creative Commons Common Crawl数据集包含来自52个Common Crawl快照的文本,涵盖了至今可用的Common Crawl快照的一半以上,并包括Common Crawl运营的所有年份。数据集通过一系列处理步骤进行了清洗和优化,包括版权验证、去除重复和低质量内容等。该数据集共有5100万多个文档,总大小为260GB。
The Creative Commons Common Crawl dataset contains text from 52 Common Crawl snapshots, covering more than half of the available Common Crawl snapshots to date and includes all years of Common Crawl operations. The dataset has undergone a series of processing steps including license verification, duplicate and low-quality content removal, and more. It consists of over 51 million documents with a total size of 260GB.
提供机构:
common-pile



