Geralt-Targaryen/C4-zh
收藏Hugging Face2025-03-29 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Geralt-Targaryen/C4-zh
下载链接
链接失效反馈官方服务:
资源简介:
这是一个从C4数据集清洗得到的中文文本数据集。清洗过程中移除了包含非中文非英文文本的文档,以及包含超过30%英文的文档。所有繁体中文文本被转换为简体中文。此外,还移除了低质量的文本,如模板和广告。数据集共有32,485,463个样本,Parquet文件大小为61G。还有一个经过模型筛选的版本,包含16,751,263个样本,文件大小为33G。对于398K个中文样本和250K个英文样本,使用了Qwen2.5-32B-Instruct模型进行语言质量标注,并使用XLM-RoBERT-large分类器进行回归训练,移除了得分在1或2的文档。
This is a Chinese text dataset cleaned from the C4 dataset. The cleaning process includes removing documents with non-Chinese and non-English text, as well as documents with more than 30% English text. All Traditional Chinese text is converted into Simplified Chinese. Low-quality text such as boilerplates and advertisements is also removed. The dataset contains 32,485,463 samples with a total size of 61G in Parquet files. There is also a model-filtered version with 16,751,263 samples, totaling 33G in parquet files. Language quality annotations (on a scale of 1-5) are generated for 398K Chinese samples and 250K English samples using the Qwen2.5-32B-Instruct model, and an XLM-RoBERT-large classifier trained with regression is used to remove documents scoring 1 or 2.
提供机构:
Geralt-Targaryen



