monology/c5_2022-05_nofalsepositives
收藏Hugging Face2025-08-28 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/monology/c5_2022-05_nofalsepositives
下载链接
链接失效反馈官方服务:
资源简介:
这是对CommonCrawl-CreativeCommons数据集2022-05快照的过滤版本,通过过滤去除了错误的正例。移除了所有许可证位置为<a>标签的行,除非它们位于文档的头部或底部。这样做可以去除那些CC许可证仅适用于图像而非整个文档的页面。此数据集保留了约65%的原始文档,大约有200万文档同时也在FineWeb中。
This is a filtered version of the 2022-05 snapshot of the CommonCrawl-CreativeCommons dataset, filtered to avoid false positives. All rows where the license location is an <a> tag are discarded, unless they are also in the head or the footer. This seems to get rid of pages where the CC license applies to an image rather than the entire document. The dataset retains approximately 65% of the original documents, with about 2 million of those also found in FineWeb.
提供机构:
monology



