TiWu-Lab/CC-zh
收藏Hugging Face2025-10-20 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/TiWu-Lab/CC-zh
下载链接
链接失效反馈官方服务:
资源简介:
这是一个经过高质量清洗的中文文本数据集,来源于Common Crawl。数据集通过移除包含非中英文字符、数字或大写字母比例过高的文档,以及使用fasttext模型筛选中文文档进行了清洗。所有繁体中文文本都被转换为简体中文。此外,使用Qwen2.5-32B-Instruct模型对文档进行了语言质量标注,并使用XLM-RoBERT-large分类器进一步筛选出质量较高的文档。最终数据集包含623,807,180个样本,总大小为1.1TB。
This is a high-quality cleaned Chinese text dataset sourced from Common Crawl. The dataset has been cleaned by removing documents with high proportions of non-Chinese and non-English characters, digits, or uppercase letters, and by filtering Chinese documents using the fasttext model. All Traditional Chinese text has been converted into Simplified Chinese. Additionally, documents have been annotated for language quality using the Qwen2.5-32B-Instruct model, and further filtered using an XLM-RoBERT-large classifier based on these annotations. The final dataset contains 623,807,180 samples, with a total size of 1.1TB.
提供机构:
TiWu-Lab



