chinese-babylm-org/babylm-zho-100M
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/chinese-babylm-org/babylm-zho-100M
下载链接
链接失效反馈官方服务:
资源简介:
# babylm-zho-100M
A filtered version of [BabyLM-community/babylm-zho](https://huggingface.co/datasets/BabyLM-community/babylm-zho), a Chinese-language corpus designed for the BabyLM Challenge.
This is the official training data for [Chinese BabyLM Challenge](https://chinese-babylm.github.io/).
## Size
The filtered dataset contains approximately **101,343,320 tokens** (tokenized with [jieba](https://github.com/fxsjy/jieba)).
## Modifications
The original `babylm-zho` dataset was filtered to reduce the proportion of speech-derived text. Specifically, 1/2 of the entries sourced from **WenetSpeech** were removed. All other entries are retained unchanged.
| Source | Original entries | Entries removed | Entries kept |
|--------|-----------------|-----------------|--------------|
| WenetSpeech | 40,586 | 20,293 | 20,293 |
| All other sources | — | 0 | unchanged |
## Dataset Fields
| Field | Description |
|-------|-------------|
| `text` | The text content |
| `doc-id` | Document identifier |
| `category` | Content category |
| `data-source` | Original data source (e.g. WenetSpeech, Wikipedia) |
| `script` | Writing script |
| `age-estimate` | Estimated target age |
| `license` | License information |
| `misc` | Miscellaneous metadata |
| `num-tokens` | Token count |
| `language` | Language tag |
## Source Dataset
- **Original dataset:** [BabyLM-community/babylm-zho](https://huggingface.co/datasets/BabyLM-community/babylm-zho)
- **Filtering script:** `filter_and_upload_babylm_zho.py`
## Cite
Please cite the following paper if you are using this dataset.
```
@inproceedings{jumelet2026babybabellm,
title={Babybabellm: A multilingual benchmark of developmentally plausible training data},
author={Jumelet, Jaap and Fourtassi, Abdellah and Haga, Akari and Bunzeck, Bastian and Shandilya, Bhargav and Galvan-Sosa, Diana and Haznitrama, Faiz Ghifari and Padovani, Francesca and Meyer, Francois and Hu, Hai and others},
booktitle={Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={3297--3329},
year={2026}
}
```
提供机构:
chinese-babylm-org



