five

BabyLM-community/BabyLM-2026-Strict

收藏
Hugging Face2026-04-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/BabyLM-community/BabyLM-2026-Strict
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # Detoxified 100M BabyLM Training Dataset (BabyLM Turns 4, 2026 BabyLM) BabyLM 2026 strict training set. Total: **100M tokens**. Please cite the following: ``` @misc{choshen2026babylmturns4papers, title={BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop}, author={Leshem Choshen and Ryan Cotterell and Mustafa Omer Gul and Jaap Jumelet and Tal Linzen and Aaron Mueller and Suchir Salhan and Raj Sanjay Shah and Alex Warstadt and Ethan Gotlieb Wilcox}, year={2026}, eprint={2602.20092}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.20092}, } ``` ## Token Counts Breakdown | File | Tokens | |------|-------:| | bnc_spoken.train.txt | 7,620,671 | | childes.train.txt | 28,410,878 | | gutenberg.train.txt | 25,576,896 | | open_subtitles.train.txt | 22,828,747 | | simple_wiki.train.txt | 15,314,317 | | switchboard.train.txt | 248,491 | | **Total** | **100,000,000** | ## Data Decontamination All training data was subjected to precorpus debiasing using pipeline for detecting and removing toxic content from naturalistic language corpora. The pipeline applies hate speech detection, sentiment scoring, and emotion analysis to flag problematic sentences, which are then filtered using gender and race word lists alongside an explicit slur lexicon. This ensures that demographic mentions and identity-related language in the corpus do not carry harmful associations that could be learned by models trained on this data. Training Data Decontamination was motivated by efforts on the Interaction Track (e.g., Salhan et al 2025; Trhlik, Caines & Buttery 2026). ``` @misc{salhan2025teacherdemonstrationsbabylmszone, title={Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction}, author={Suchir Salhan and Hongyi Gu and Donya Rooein and Diana Galvan-Sosa and Gabrielle Gaudeau and Andrew Caines and Zheng Yuan and Paula Buttery}, year={2025}, eprint={2510.20411}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.20411}, } ``` ``` @misc{trhlik2026biasdynamicsbabylmscomputeefficient, title={Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing}, author={Filip Trhlik and Andrew Caines and Paula Buttery}, year={2026}, eprint={2601.09421}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.09421}, }

license: MIT # 去毒化1亿Token婴儿语言模型(BabyLM)训练数据集(BabyLM四岁啦——2026年BabyLM研讨会) BabyLM 2026 严格版训练集,总数据量:**1亿Token**。 请引用以下文献: @misc{choshen2026babylmturns4papers, title={BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop}, author={Leshem Choshen and Ryan Cotterell and Mustafa Omer Gul and Jaap Jumelet and Tal Linzen and Aaron Mueller and Suchir Salhan and Raj Sanjay Shah and Alex Warstadt and Ethan Gotlieb Wilcox}, year={2026}, eprint={2602.20092}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.20092}, } ## Token 数量统计详情 | 文件名 | Token 数量 | |------|-------:| | bnc_spoken.train.txt | 7,620,671 | | childes.train.txt | 28,410,878 | | gutenberg.train.txt | 25,576,896 | | open_subtitles.train.txt | 22,828,747 | | simple_wiki.train.txt | 15,314,317 | | switchboard.train.txt | 248,491 | | **总计** | **100,000,000** | ## 数据去毒化处理 所有训练数据均经过语前去偏处理,采用针对自然语言语料库中有害内容的检测与移除流水线。该流水线通过仇恨言论检测、情感评分与情感分析标记存在问题的语句,并结合性别、种族词汇表与明确的蔑称词汇表完成过滤。此举可确保语料库中涉及人口统计的表述与身份相关语言,不会携带有害关联,避免基于该数据集训练的模型学习到此类内容。本次训练数据去毒化工作的灵感源自交互赛道(Interaction Track)的相关研究(例如Salhan等人2025年;Trhlik、Caines与Buttery 2026年)。 @misc{salhan2025teacherdemonstrationsbabylmszone, title={Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction}, author={Suchir Salhan and Hongyi Gu and Donya Rooein and Diana Galvan-Sosa and Gabrielle Gaudeau and Andrew Caines and Zheng Yuan and Paula Buttery}, year={2025}, eprint={2510.20411}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.20411}, } @misc{trhlik2026biasdynamicsbabylmscomputeefficient, title={Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing}, author={Filip Trhlik and Andrew Caines and Paula Buttery}, year={2026}, eprint={2601.09421}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.09421}, }
提供机构:
BabyLM-community
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作