CommonCrawl-CreativeCommons
收藏魔搭社区2026-01-06 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/CommonCrawl-CreativeCommons
下载链接
链接失效反馈官方服务:
资源简介:
# The Common Crawl Creative Commons Corpus (C5)
> **Raw CommonCrawl crawls, annotated with Creative Commons license information**
C5 is an effort to collect Creative Commons-licensed web data in one place.
The licensing information is extracted from the web pages based on whether they link to Creative Commons licenses either overtly in `a` tags (like in the footer of Wikipedia) or in metadata fields indicating deliberate Creative Commons publication. **However, false positives may occur! See Recommendations and Caveats below!** Also see [Personal and Sensitive Information](#personal-and-sensitive-information).
## Code
I am very grateful to the Flemish Supercomputer to provide compute necessary to create this dataset, but as you can tell there is still a lot of data left to be processed. Therefore, I am happy to collaborate to process as many Common Crawl crawls as possible. [Shoot me a message](mailto:bram.vanroy@kuleuven.be) if you want to sponsor this project with compute! You can also simply run the code yourself if you'd like. You can find the whole code base, based on `datatrove`, on [Github](https://github.com/BramVanroy/CommonCrawl-CreativeCommons). If you use the code, please [reference my work](https://github.com/BramVanroy/CommonCrawl-CreativeCommons?tab=readme-ov-file#citation) accordingly and share your processed crawls with the rest of the world (or get in touch with me so I can add them to this repo).
The approach to creating this dataset is different from similar endeavors such as the awesome [common-pile/dolma-cccc](https://huggingface.co/datasets/common-pile/dolma-cccc) and [C4Corpus](https://data.commoncrawl.org/contrib/c4corpus/CC-MAIN-2016-07/index.html) datasets. They rely on intricately crafted regular expressions to quickly extract potential licenses from a web page (string-based matching). However, doing so makes it hard to retrieve any structural meta information about the license such as where it was found on the page. In C5, the whole webpage is parsed into a programmatic structure, allowing for an iterative search through this parsed "tree". That makes it possible to track where licenses were found (in the head of a document, for instance). Such information is crucial to minimise false positives: if a license is referred in a `meta` tag in the `head` of an HTML page, it is more trustworthy than a "random link" referring to a copyright license in the middle of a web page, which might just be discussing the license in general or providing a license for a picture on the website. Metadata *about* the license is powerful to attach confidence to the extracted licenses, enabling robust filtering to avoid false positives. While I strongly believe this approach is valuable it also makes it very *slow* compared to a regex search!
## Usage
```python
from datasets import load_dataset
# Everything, most recent -- massive, you will need streaming
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", streaming=True)
# v1 (2019-30, 2020-05, 2022-05, 2023-06, 2024-51, 2025-05, 2024-46)
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "v1", streaming=True)
# Single dump, all languages -- large, you may need streaming on non-server hardware
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "CC-MAIN-2019-30")
# Single language, all dumps -- very large, you will likely need streaming
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "nld", streaming=True)
# Single language, single dump
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "CC-MAIN-2019-30-nld")
```
## Progress
In the `v1` release, the following crawls are included
- CC-MAIN-2019-30
- CC-MAIN-2020-05
- CC-MAIN-2023-06
- CC-MAIN-2024-51
- CC-MAIN-2024-46
- CC-MAIN-2025-05
- CC-MAIN-2022-05
## Languages
The following languages are included. This is a limited set due to computational and storage limitations.
- Afrikaans: afr
- German: deu
- English: eng
- French: fra
- Frysian: fry
- Italian: ita
- Dutch: nld
- Spanish: spa
## Quantity
Detailed number of tokens (Llama 3.3 tokenizer) and number of documents are given in the [counts.json](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons/blob/main/counts.json) file.
| Language | Number of Documents | Number of Tokens |
| --------- | -------------------:| -------------------:|
| afr | 312,262 | 358,873,448 |
| deu | 9,530,746 | 11,362,859,534 |
| eng | 92,635,372 | 87,537,859,958 |
| fra | 9,234,900 | 12,366,480,025 |
| fry | 230,910 | 197,430,774 |
| ita | 10,734,597 | 11,913,669,333 |
| nld | 2,827,636 | 2,757,074,705 |
| spa | 22,226,944 | 22,515,709,432 |
| **Total** | **147,733,367** | **149,009,957,209** |
## Fields
In some cases, multiple licenses are found on a single page. All licenses are collected in `potential_licenses`. These are then sorted based on three criteria (first option is most preferred, last option is least preferred, e.g. a license found in a `meta` tag is more trustworthy than a license in an `a` tag, a license in a footer is more trustworthy than a license not in the footer of a page).
1. location_preference_order: meta_tag, json-ld, link_tag, a_tag
2. head_preference_order: True, False
3. footer_preference_order: True, False
Based on these criteria, the "best" license is picked as the one in the `license_*` columns. Potential disagreement between multiple licenses is given in `license_disagreement`.
- text: the extracted text (unmodified)
- id: WARC-Record-ID
- dump: Common Crawl crawl
- url: original url for document
- date: crawl date
- file_path: file path on the S3 bucket
- license_abbr: the license type. Possible values: "cc-unknown" (recommended to filter this one out), "by", "by-sa", "by-nd", "by-nc", "by-nc-sa", "by-nc-nd", "zero", "certification", "mark". If multiple licenses were found (`potential_licenses`)
- license_version: the license version, e.g. "4.0"
- license_location: the location where the license was found. Possible values: "meta_tag", "json-ld", "link_tag", "a_tag"
- license_in_head: whether the license was found inside a `head` HTML element
- license_in_footer: whether the license was found inside a `footer` HTML element, or an HTML element that had `footer` in the ID or class name
- potential_licenses:
- abbr: list of all found license abbreviations
- version: list of all found license versions
- location: list of all found license locations
- in_head: list of whether licenses were found in the head
- in_footer: list of whether licenses were found in a footer
- license_parse_error: whether there was a problem when trying to extract the license, e.g. an unparseable HTML document
- license_disagreement: whether the `potential_licenses["abbr"]` disagree, i.e., different types of licenses were found. License *versions* are not included in the comparison!
- language: the language, as detected by glotlid
- language_score: the language identification confidence score
- found_in_fw: whether this sample was found in FineWeb(-2). For non-English, crawls that are more recent than FW2 (everything after 2024-18) is marked as None. For English, crawls that are more recent than FW v1.3 is marked as None (after 2024-51).
## Recommendations and Caveats
- Raw CommonCrawl data is processed in an attempt to extract licensing information. No quality filtering is done!! It is **highly** recommended to filter this data further on quality, fluency, toxicity, etc.
- Similarly, the data has **not been deduplicated**.
- The licenses include all possible Creative Commons licenses, including non-commercial ones. Take care about what kind of data you wish to use, and filter out non-commercial licenses when needed.
- The column `license_disagreement` indicates whether multiple licenses were found that have not the same abbreviation, e.g. `cc-by` and `cc-by-nc`. It is recommended to filter these out.
- The column `license_parse_error` indicates whether an error occurred when parsing the license. You probably want to filter out documents where this was the case, though this should be extremely rare.
- Unsurpisingly, the data contains a lot of Wikipedia/Wikimedia content. Depending on what you need, you may wish to filter those out. For Wikipedia specifically, you may opt to use the more thoroughly parsed (but potentially more outdated) [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) set.
- In exceptional cases, a link to creativecommons.org is found but the exact license could not be found. These are under `license_abbr="cc-unknown"` which you may wish to filter out.
Based on these recommendations, two subsets are available:
- [BramVanroy/CommonCrawl-CreativeCommons-fine](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retaining items containing samples that are also in FineWeb(-2)
- [BramVanroy/CommonCrawl-CreativeCommons-strict](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-strict): additional filtering that also removes Wikipedia and non-commercial data
## Personal and Sensitive Information
C5 is a heavily filtered version of the Common Crawl dataset. CommonCrawl respects robots.txt and will not include websites if their robots.txt say so. Even so, if you find that your website was included you can submit a [removal request](https://docs.google.com/forms/d/e/1FAIpQLSddAIuUui5xnAzBqft6MnzPYihr-AaS-Nj8x01Y6AM8NQ0YLQ/viewform?usp=sharing) indicating that you are the owner of the website.
Take-down notices on other Common Crawl-based datasets such as FineWeb are considered. Domains specified and verified in those take-down notices are not included in this dataset.
In this dataset, measures are taken to anonymise email addresses and public IP addresses following the [FineWeb-2 approach](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2#personal-and-sensitive-information-and-opt-out). Email addresses matching a regular expression are replaced with `firstname.lastname@example.org`. Similarly, IP addresses allocated for [public networks](https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml) are replaced by unused IP addresses. Despite these best efforts on such large volumes of text, you may still encounter that your personal information is present in the dataset. In that case you can submit a [removal request](https://docs.google.com/forms/d/e/1FAIpQLSddAIuUui5xnAzBqft6MnzPYihr-AaS-Nj8x01Y6AM8NQ0YLQ/viewform?usp=sharing).
## Citation
In the current absence of a publication, please cite [the dataset](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) as follows. Including a footnote url to this page is also appreciated!
```bibtex
@misc{vanroy2025C5,
author = { Bram Vanroy },
title = { CommonCrawl CreativeCommons Corpus (C5) },
year = 2025,
url = { https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons },
doi = { 10.57967/hf/5340 },
publisher = { Hugging Face }
}
```
If you use or modify [the software](https://github.com/BramVanroy/CommonCrawl-CreativeCommons), please cite:
```bibtex
@software{Vanroy_CommonCrawl-CreativeCommons_2025,
author = {Vanroy, Bram},
license = {GPL-3.0},
month = feb,
title = {{CommonCrawl-CreativeCommons}},
url = {https://github.com/BramVanroy/CommonCrawl-CreativeCommons},
version = {1.3.0},
year = {2025}
}
```
## Acknowledgments
- The [Common Crawl](https://commoncrawl.org/) non-profit organization.
- The computational resources and services used in this work were provided by the [VSC (Flemish Supercomputer Center)](https://www.vscentrum.be/), funded by the Research Foundation Flanders (FWO) and the Flemish Government – department EWI under grant 2024-107.
- Guilherme Penedo ([@guipenedo](https://huggingface.co/guipenedo)) and the rest of the [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [datatrove](https://github.com/huggingface/datatrove) team for the help and insights
- [TNO](https://www.tno.nl/nl/), who funded the work hours to accomplish this code. They intend to use (parts of) [the generated material](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) for the [GPT-NL project](https://gpt-nl.nl/).
- ML6 and specifically Robin Van Craenenbroek for their [Fondant Creative Commons](https://github.com/ml6team/fondant-usecase-filter-creative-commons/tree/add-fondant-usecase-cc-image-extraction) filter for image datasets. While my approach is different, their code did serve as inspiration.
# 通用爬虫(Common Crawl)知识共享语料库(C5)
> **标注了知识共享(Creative Commons)许可信息的原始通用爬虫爬取数据**
C5旨在将所有带有知识共享许可的网页数据整合至一处。
许可信息从网页中提取,提取依据为网页是否在`<a>`标签(如维基百科页脚处)显式链接至知识共享许可,或是在表明主动发布知识共享内容的元数据字段中包含此类信息。**但需注意,可能存在假阳性结果!请参阅下文的建议与注意事项!** 另请参阅[个人与敏感信息](#personal-and-sensitive-information)。
## 代码
本人衷心感谢佛兰德斯超级计算机中心(Flemish Supercomputer)为本数据集的构建提供所需的计算资源,但如您所见,仍有大量数据待处理。因此,本人乐于合作处理尽可能多的通用爬虫爬取批次。若您希望为本项目提供计算资源赞助,请[发送邮件](mailto:bram.vanroy@kuleuven.be)联系我!您也可以自行运行代码。本项目的完整代码基于`datatrove`库,托管于[GitHub](https://github.com/BramVanroy/CommonCrawl-CreativeCommons)。若您使用了本代码,请务必[正确引用本人的成果](https://github.com/BramVanroy/CommonCrawl-CreativeCommons?tab=readme-ov-file#citation),并将您处理后的爬取数据分享给全球社区(或联系我,以便我将其添加至本代码仓库)。
本数据集的构建方法与同类项目存在差异,例如出色的[common-pile/dolma-cccc](https://huggingface.co/datasets/common-pile/dolma-cccc)与[C4语料库(C4Corpus)](https://data.commoncrawl.org/contrib/c4corpus/CC-MAIN-2016-07/index.html)。此类项目依赖精心编写的正则表达式,快速从网页中提取潜在许可(基于字符串匹配)。但该方法难以获取许可的结构化元信息,例如许可在页面中的具体位置。而在C5中,整个网页被解析为程序化结构,支持对该解析后的“树状结构”进行迭代搜索,从而能够追踪许可的发现位置(例如在文档的`<head>`部分)。此类信息对降低假阳性结果至关重要:若许可在HTML页面`<head>`的`<meta>`标签中被提及,其可信度远高于网页中部随机链接指向的版权许可——后者可能仅为一般性讨论许可,或是为网站内的图片提供许可。与许可相关的元数据可有效为提取的许可赋予置信度,从而实现可靠过滤以避免假阳性结果。尽管本人坚信该方法具有重要价值,但相较于正则表达式搜索,其速度会显著较慢!
## 使用方法
python
from datasets import load_dataset
# 全量数据(最新版本——数据量庞大,需启用流式加载)
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", streaming=True)
# v1 版本(包含2019-30、2020-05、2022-05、2023-06、2024-51、2025-05、2024-46批次)
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "v1", streaming=True)
# 单个爬取批次,包含所有语言——数据量较大,非服务器硬件可能需要流式加载
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "CC-MAIN-2019-30")
# 单语言,包含所有爬取批次——数据量极大,大概率需要流式加载
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "nld", streaming=True)
# 单语言,单个爬取批次
ds = load_dataset("BramVanroy/CommonCrawl-CreativeCommons", "CC-MAIN-2019-30-nld")
## 进展情况
在v1版本中,本数据集包含以下爬取批次:
- CC-MAIN-2019-30
- CC-MAIN-2020-05
- CC-MAIN-2023-06
- CC-MAIN-2024-51
- CC-MAIN-2024-46
- CC-MAIN-2025-05
- CC-MAIN-2022-05
## 支持语言
由于计算与存储资源限制,本数据集仅包含以下语言:
- 南非语:afr
- 德语:deu
- 英语:eng
- 法语:fra
- 弗里斯兰语:fry
- 意大利语:ita
- 荷兰语:nld
- 西班牙语:spa
## 数据规模
详细的Token数量(基于Llama 3.3分词器)与文档数量请参阅[counts.json](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons/blob/main/counts.json)文件。
| 语言 | 文档数量 | Token数量 |
| ------ | ----------------: | ------------------: |
| afr | 312,262 | 358,873,448 |
| deu | 9,530,746 | 11,362,859,534 |
| eng | 92,635,372 | 87,537,859,958 |
| fra | 9,234,900 | 12,366,480,025 |
| fry | 230,910 | 197,430,774 |
| ita | 10,734,597 | 11,913,669,333 |
| nld | 2,827,636 | 2,757,074,705 |
| spa | 22,226,944 | 22,515,709,432 |
| **总计** | **147,733,367** | **149,009,957,209** |
## 数据字段
在部分场景下,单个页面可能包含多个许可。所有识别出的许可均收录于`potential_licenses`字段中,并基于三项标准进行排序(第一项优先级最高,最后一项优先级最低:例如,在`<meta>`标签中发现的许可可信度高于`<a>`标签中的许可;页脚处的许可可信度高于非页脚位置的许可):
1. 位置偏好顺序:`meta_tag` > `json-ld` > `link_tag` > `a_tag`
2. 头部偏好顺序:`True`(位于`<head>`内) > `False`(不位于`<head>`内)
3. 页脚偏好顺序:`True`(位于页脚内) > `False`(不位于页脚内)
基于上述标准,将“最优”许可作为`license_*`列的取值。若多个许可存在分歧,则在`license_disagreement`字段中予以标注。
各字段说明如下:
- `text`:提取的原始文本(未做修改)
- `id`:WARC-Record-ID
- `dump`:通用爬虫爬取批次标识
- `url`:文档的原始URL
- `date`:爬取日期
- `file_path`:S3存储桶中的文件路径
- `license_abbr`:许可类型缩写。可选值包括:`cc-unknown`(建议过滤此类数据)、`by`、`by-sa`、`by-nd`、`by-nc`、`by-nc-sa`、`by-nc-nd`、`zero`、`certification`、`mark`。若识别到多个许可(收录于`potential_licenses`),则取最优许可的缩写
- `license_version`:许可版本号,例如`4.0`
- `license_location`:许可发现的位置。可选值:`meta_tag`、`json-ld`、`link_tag`、`a_tag`
- `license_in_head`:许可是否位于HTML的`<head>`元素内
- `license_in_footer`:许可是否位于HTML的`<footer>`元素内,或位于ID或类名中包含`footer`的HTML元素内
- `potential_licenses`:
- `abbr`:所有识别到的许可类型缩写列表
- `version`:所有识别到的许可版本号列表
- `location`:所有识别到的许可位置列表
- `in_head`:各许可是否位于`<head>`内的列表
- `in_footer`:各许可是否位于页脚内的列表
- `license_parse_error`:提取许可时是否出现错误,例如HTML文档无法解析
- `license_disagreement`:`potential_licenses["abbr"]`中是否存在分歧,即识别到不同类型的许可(许可版本号不参与比较)
- `language`:通过glotlid检测得到的语言
- `language_score`:语言识别的置信度得分
- `found_in_fw`:该样本是否在FineWeb(-2)中出现。对于非英语数据,爬取批次晚于FW2(2024-18之后)的样本将标记为`None`;对于英语数据,爬取批次晚于FW v1.3(2024-51之后)的样本将标记为`None`
## 建议与注意事项
- 本数据集基于原始通用爬虫数据处理,旨在提取许可信息,但未进行任何质量过滤!**强烈建议**您进一步对数据进行质量、流畅性、毒性等维度的过滤。
- 同理,本数据集未进行去重处理。
- 本数据集包含所有类型的知识共享许可,包括非商业许可。请根据您的使用需求谨慎选择,并在必要时过滤掉非商业许可。
- `license_disagreement`字段表明是否识别到不同类型的许可(例如`cc-by`与`cc-by-nc`),建议过滤此类样本。
- `license_parse_error`字段表明在解析许可时是否出现错误,尽管此类情况极为罕见,但您大概率需要过滤掉出现该错误的样本。
- 不出所料,本数据集包含大量维基百科/维基媒体内容。若您有相关需求,可考虑过滤掉此类数据;针对维基百科数据,您也可以使用经过更彻底解析但可能更陈旧的[wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)数据集。
- 在极少数场景下,可能会发现指向creativecommons.org的链接,但无法识别具体许可类型,此类数据的`license_abbr`为`cc-unknown`,您可选择过滤此类样本。
基于上述建议,本项目提供了两个子集:
- [BramVanroy/CommonCrawl-CreativeCommons-fine](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine):仅保留同时在FineWeb(-2)中出现的样本
- [BramVanroy/CommonCrawl-CreativeCommons-strict](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-strict):额外增加过滤规则,移除维基百科数据与非商业许可数据
## 个人与敏感信息
C5是通用爬虫数据集的高度过滤版本。通用爬虫(Common Crawl)会遵守robots.txt协议,若网站的robots.txt声明不允许爬取,则不会将其纳入数据集。即便如此,若您发现自己的网站被包含在本数据集中,可提交[移除申请](https://docs.google.com/forms/d/e/1FAIpQLSddAIuUui5xnAzBqft6MnzPYihr-AaS-Nj8x01Y6AM8NQ0YLQ/viewform?usp=sharing),并证明您为该网站的所有者。
本项目会参考其他基于通用爬虫的数据集(如FineWeb)的下架通知,将已验证的下架通知中提及的域名排除在本数据集之外。
本数据集参照[FineWeb-2方案](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2#personal-and-sensitive-information-and-opt-out)对电子邮件地址与公共IP地址进行匿名化处理:匹配正则表达式的电子邮件地址将被替换为`firstname.lastname@example.org`;同理,分配给[公共网络](https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml)的IP地址将被替换为未使用的IP地址。尽管已针对海量文本采取了上述措施,但您仍可能在数据集中发现个人信息。若出现此类情况,您可提交[移除申请](https://docs.google.com/forms/d/e/1FAIpQLSddAIuUui5xnAzBqft6MnzPYihr-AaS-Nj8x01Y6AM8NQ0YLQ/viewform?usp=sharing)。
## 引用方式
目前尚无正式出版物,因此请按以下方式引用[本数据集](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons),同时欢迎在脚注中添加本页面的URL!
bibtex
@misc{vanroy2025C5,
author = { Bram Vanroy },
title = { CommonCrawl CreativeCommons Corpus (C5) },
year = 2025,
url = { https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons },
doi = { 10.57967/hf/5340 },
publisher = { Hugging Face }
}
若您使用或修改了[本代码](https://github.com/BramVanroy/CommonCrawl-CreativeCommons),请引用:
bibtex
@software{Vanroy_CommonCrawl-CreativeCommons_2025,
author = {Vanroy, Bram},
license = {GPL-3.0},
month = feb,
title = {{CommonCrawl-CreativeCommons}},
url = { https://github.com/BramVanroy/CommonCrawl-CreativeCommons },
version = {1.3.0},
year = {2025}
}
## 致谢
- [通用爬虫(Common Crawl)](https://commoncrawl.org/)非营利组织。
- 本工作使用的计算资源与服务由[佛兰德斯超级计算机中心(VSC, Flemish Supercomputer Center)](https://www.vscentrum.be/)提供,该中心由佛兰德斯研究基金会(FWO)与佛兰德斯政府EWI部门资助(项目编号2024-107)。
- Guilherme Penedo([@guipenedo](https://huggingface.co/guipenedo))与FineWeb及[datatrove](https://github.com/huggingface/datatrove)团队的其他成员,感谢他们提供的帮助与见解。
- [TNO](https://www.tno.nl/nl/),资助本代码开发的工作时长。他们计划将[本数据集](https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons)的(部分)内容用于[GPT-NL项目](https://gpt-nl.nl/)。
- ML6与特别感谢Robin Van Craenenbroek,他们的[Fondant知识共享过滤器](https://github.com/ml6team/fondant-usecase-filter-creative-commons/tree/add-fondant-usecase-cc-image-extraction)用于图像数据集。尽管本项目的方法有所不同,但他们的代码为本人提供了灵感。
提供机构:
maas
创建时间:
2025-05-07



