BramVanroy/CommonCrawl-CreativeCommons-strict

Name: BramVanroy/CommonCrawl-CreativeCommons-strict
Creator: BramVanroy
Published: 2025-08-28 09:20:02
License: 暂无描述

Hugging Face2025-08-28 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/BramVanroy/CommonCrawl-CreativeCommons-strict

下载链接

链接失效反馈

官方服务：

资源简介：

Common Crawl Creative Commons Corpus Strict (C5s) 是一个经过筛选的Common Crawl Creative Commons语料库版本，只保留符合特定标准的样本。这些标准包括：样本也存在于FineWeb或FineWeb-2数据集中；没有许可证争议（所有发现的许可证类型相同，版本号可能不同）；不是“非商业性”（许可证中包含nc）；不是cc-unknown；名称中不包含wiki。C5s还包括一个不那么严格的版本C5f，它只基于FineWeb数据。该数据集支持多种语言，并提供不同的配置，每个配置都有特定的数据文件和语言。数据集特征包括文本、ID、转储、URL、日期、文件路径和各种与许可证相关的字段。README文件提供了一个表格，详细列出了每种语言在strict版本中的文档和标记数量，突出了过滤后的数据量减少。

The Common Crawl Creative Commons Corpus Strict (C5s) is a filtered version of the Common Crawl Creative Commons Corpus, retaining only samples that meet specific criteria. These criteria include being present in the FineWeb or FineWeb-2 datasets, having no license disagreement, not being non-commercial, not being cc-unknown, and not having wiki in their name. C5s also includes a less strict version called C5f, which is based solely on the FineWeb data. The dataset supports multiple languages and is available in various configurations, each with specific data files and languages. The dataset features include text, id, dump, url, date, file_path, and various license-related fields. The README provides a table with detailed counts of documents and tokens for each language in the strict release, highlighting the reduction in data after filtering.

提供机构：

BramVanroy

5,000+

优质数据集

54 个

任务类型

进入经典数据集