Geralt-Targaryen/C4-zh

Name: Geralt-Targaryen/C4-zh
Creator: Geralt-Targaryen
Published: 2025-03-29 04:29:21
License: 暂无描述

Hugging Face2025-03-29 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Geralt-Targaryen/C4-zh

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个从C4数据集清洗得到的中文文本数据集。清洗过程中移除了包含非中文非英文文本的文档，以及包含超过30%英文的文档。所有繁体中文文本被转换为简体中文。此外，还移除了低质量的文本，如模板和广告。数据集共有32,485,463个样本，Parquet文件大小为61G。还有一个经过模型筛选的版本，包含16,751,263个样本，文件大小为33G。对于398K个中文样本和250K个英文样本，使用了Qwen2.5-32B-Instruct模型进行语言质量标注，并使用XLM-RoBERT-large分类器进行回归训练，移除了得分在1或2的文档。

This is a Chinese text dataset cleaned from the C4 dataset. The cleaning process includes removing documents with non-Chinese and non-English text, as well as documents with more than 30% English text. All Traditional Chinese text is converted into Simplified Chinese. Low-quality text such as boilerplates and advertisements is also removed. The dataset contains 32,485,463 samples with a total size of 61G in Parquet files. There is also a model-filtered version with 16,751,263 samples, totaling 33G in parquet files. Language quality annotations (on a scale of 1-5) are generated for 398K Chinese samples and 250K English samples using the Qwen2.5-32B-Instruct model, and an XLM-RoBERT-large classifier trained with regression is used to remove documents scoring 1 or 2.

提供机构：

Geralt-Targaryen

5,000+

优质数据集

54 个

任务类型

进入经典数据集