chinese-babylm-org/babylm-zho-100M

Name: chinese-babylm-org/babylm-zho-100M
Creator: chinese-babylm-org
Published: 2026-04-15 05:22:39
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/chinese-babylm-org/babylm-zho-100M

下载链接

链接失效反馈

官方服务：

资源简介：

# babylm-zho-100M A filtered version of [BabyLM-community/babylm-zho](https://huggingface.co/datasets/BabyLM-community/babylm-zho), a Chinese-language corpus designed for the BabyLM Challenge. This is the official training data for [Chinese BabyLM Challenge](https://chinese-babylm.github.io/). ## Size The filtered dataset contains approximately **101,343,320 tokens** (tokenized with [jieba](https://github.com/fxsjy/jieba)). ## Modifications The original `babylm-zho` dataset was filtered to reduce the proportion of speech-derived text. Specifically, 1/2 of the entries sourced from **WenetSpeech** were removed. All other entries are retained unchanged. | Source | Original entries | Entries removed | Entries kept | |--------|-----------------|-----------------|--------------| | WenetSpeech | 40,586 | 20,293 | 20,293 | | All other sources | — | 0 | unchanged | ## Dataset Fields | Field | Description | |-------|-------------| | `text` | The text content | | `doc-id` | Document identifier | | `category` | Content category | | `data-source` | Original data source (e.g. WenetSpeech, Wikipedia) | | `script` | Writing script | | `age-estimate` | Estimated target age | | `license` | License information | | `misc` | Miscellaneous metadata | | `num-tokens` | Token count | | `language` | Language tag | ## Source Dataset - **Original dataset:** [BabyLM-community/babylm-zho](https://huggingface.co/datasets/BabyLM-community/babylm-zho) - **Filtering script:** `filter_and_upload_babylm_zho.py` ## Cite Please cite the following paper if you are using this dataset. ``` @inproceedings{jumelet2026babybabellm, title={Babybabellm: A multilingual benchmark of developmentally plausible training data}, author={Jumelet, Jaap and Fourtassi, Abdellah and Haga, Akari and Bunzeck, Bastian and Shandilya, Bhargav and Galvan-Sosa, Diana and Haznitrama, Faiz Ghifari and Padovani, Francesca and Meyer, Francois and Hu, Hai and others}, booktitle={Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages={3297--3329}, year={2026} } ```

提供机构：

chinese-babylm-org

5,000+

优质数据集

54 个

任务类型

进入经典数据集