five

chinese-babylm-org/babylm-zho-100M

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/chinese-babylm-org/babylm-zho-100M
下载链接
链接失效反馈
官方服务:
资源简介:
# babylm-zho-100M A filtered version of [BabyLM-community/babylm-zho](https://huggingface.co/datasets/BabyLM-community/babylm-zho), a Chinese-language corpus designed for the BabyLM Challenge. This is the official training data for [Chinese BabyLM Challenge](https://chinese-babylm.github.io/). ## Size The filtered dataset contains approximately **101,343,320 tokens** (tokenized with [jieba](https://github.com/fxsjy/jieba)). ## Modifications The original `babylm-zho` dataset was filtered to reduce the proportion of speech-derived text. Specifically, 1/2 of the entries sourced from **WenetSpeech** were removed. All other entries are retained unchanged. | Source | Original entries | Entries removed | Entries kept | |--------|-----------------|-----------------|--------------| | WenetSpeech | 40,586 | 20,293 | 20,293 | | All other sources | — | 0 | unchanged | ## Dataset Fields | Field | Description | |-------|-------------| | `text` | The text content | | `doc-id` | Document identifier | | `category` | Content category | | `data-source` | Original data source (e.g. WenetSpeech, Wikipedia) | | `script` | Writing script | | `age-estimate` | Estimated target age | | `license` | License information | | `misc` | Miscellaneous metadata | | `num-tokens` | Token count | | `language` | Language tag | ## Source Dataset - **Original dataset:** [BabyLM-community/babylm-zho](https://huggingface.co/datasets/BabyLM-community/babylm-zho) - **Filtering script:** `filter_and_upload_babylm_zho.py` ## Cite Please cite the following paper if you are using this dataset. ``` @inproceedings{jumelet2026babybabellm, title={Babybabellm: A multilingual benchmark of developmentally plausible training data}, author={Jumelet, Jaap and Fourtassi, Abdellah and Haga, Akari and Bunzeck, Bastian and Shandilya, Bhargav and Galvan-Sosa, Diana and Haznitrama, Faiz Ghifari and Padovani, Francesca and Meyer, Francois and Hu, Hai and others}, booktitle={Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages={3297--3329}, year={2026} } ```
提供机构:
chinese-babylm-org
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作