Geralt-Targaryen/CC-zh
收藏Hugging Face2025-04-17 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/Geralt-Targaryen/CC-zh
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从Common Crawl中提取的高质量中文文本,经过多步骤清洗和预处理,包括移除包含非中英文字符和数字/大写字母比例过高的文档、将繁体中文转换为简体中文、去除低质量文档和重复文档,并使用Qwen2.5-32B-Instruct模型对文档进行质量评分,最终得到一个高质量的中文文本数据集。
This dataset is composed of high-quality Chinese text extracted from Common Crawl, which has undergone multiple cleaning and preprocessing steps. This includes removing documents with a high proportion of non-Chinese/English characters and digits/uppercase letters, converting Traditional Chinese to Simplified Chinese, removing low-quality documents and duplicates, and using the Qwen2.5-32B-Instruct model to score the quality of documents, resulting in a high-quality Chinese text dataset.
提供机构:
Geralt-Targaryen



