TiWu-Lab/C4-zh
收藏Hugging Face2025-03-29 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/TiWu-Lab/C4-zh
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
---
Chinese text cleaned from [C4](https://huggingface.co/datasets/allenai/c4) with the following steps:
- documents containing non-Chinese, non-English text are removed
- documents containing more than 30% English text are removed
- all text in Traditional Chinese is converted into Simplified Chinese using [zhconv](https://github.com/gumblex/zhconv)
- low-quality text (e.g. boilerplates, advertisements) are heuristically removed
### statistics
Number of samples: 32,485,463.
Size of parquet files: 61G.
### Filtered Version
Model-filtered version in the filter branch, including 16,751,263 samples (33G parqeut files).
Qwen2.5-32B-Instruct is used to generate language quality annotation (on a scale of 1-5) for 398K Chinese samples and 250K English samples. An XLM-RoBERT-large classifier is trained with regression on these annotations. Any document receiving a score of 1 or 2 from the classifier is removed. The remaining documents are also accompanied by their scores.
提供机构:
TiWu-Lab



