five

TiWu-Lab/CC-zh

收藏
Hugging Face2025-10-20 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/TiWu-Lab/CC-zh
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个经过高质量清洗的中文文本数据集,来源于Common Crawl。数据集通过移除包含非中英文字符、数字或大写字母比例过高的文档,以及使用fasttext模型筛选中文文档进行了清洗。所有繁体中文文本都被转换为简体中文。此外,使用Qwen2.5-32B-Instruct模型对文档进行了语言质量标注,并使用XLM-RoBERT-large分类器进一步筛选出质量较高的文档。最终数据集包含623,807,180个样本,总大小为1.1TB。

This is a high-quality cleaned Chinese text dataset sourced from Common Crawl. The dataset has been cleaned by removing documents with high proportions of non-Chinese and non-English characters, digits, or uppercase letters, and by filtering Chinese documents using the fasttext model. All Traditional Chinese text has been converted into Simplified Chinese. Additionally, documents have been annotated for language quality using the Qwen2.5-32B-Instruct model, and further filtered using an XLM-RoBERT-large classifier based on these annotations. The final dataset contains 623,807,180 samples, with a total size of 1.1TB.
提供机构:
TiWu-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作