BramVanroy/CommonCrawl-CreativeCommons

Name: BramVanroy/CommonCrawl-CreativeCommons
Creator: BramVanroy
Published: 2025-08-28 09:23:12
License: 暂无描述

Hugging Face2025-08-28 更新2025-08-30 收录

下载链接：

https://hf-mirror.com/datasets/BramVanroy/CommonCrawl-CreativeCommons

下载链接

链接失效反馈

官方服务：

资源简介：

Common Crawl Creative Commons Corpus (C5) 是一个包含网络数据集的语料库，这些数据已经被标注了Creative Commons许可信息。该数据集适用于文本生成和语言建模等任务。它包含了多种语言的数据，并针对每种语言和抓取期提供了特定的数据文件和配置。该数据集旨在用于学术研究和其他需要Creative Commons许可数据的应用。README还提供了如何使用数据集、数据集的进度、包含的语言、数据的数量以及可用的字段等信息。此外，它还提供了使用数据集的建议和注意事项，处理个人信息和敏感信息的信息，引用指南和致谢。

The Common Crawl Creative Commons Corpus (C5) is a dataset containing web data annotated with Creative Commons license information. It is designed for tasks such as text generation and language modeling. The dataset includes data in multiple languages and configurations, with specific data files and configurations available for each language and crawl period. It is intended for academic research and other applications that require Creative Commons licensed data. The README provides information on how to use the dataset, its progress, the languages included, the quantity of data, and the fields available. It also includes recommendations and caveats for using the dataset, information on handling personal and sensitive information, citation guidelines, and acknowledgments.

提供机构：

BramVanroy

5,000+

优质数据集

54 个

任务类型

进入经典数据集