ROOTS dataset
收藏数据集概述
数据集名称
ROOTS数据集
数据集用途
用于训练BLOOM模型及其Tokenizer的构建。
数据集准备过程
- 数据清洗、过滤和去重:详细过程参见论文中的第二节和第三节。
- 数据处理流程:包含一个图形化的处理流程图,展示了从原始数据到最终数据集的转换过程。
关键资源
引用信息
@inproceedings{ bigscience-roots:2022, title={The BigScience {ROOTS} Corpus: A 1.6{TB} Composite Multilingual Dataset}, author={Hugo Lauren{c{c}}on and Lucile Saulnier and Thomas Wang and Christopher Akiki and Albert Villanova del Moral and Teven Le Scao and Leandro Von Werra and Chenghao Mou and Eduardo Gonz{a}lez Ponferrada and Huu Nguyen and J{"o}rg Frohberg and Mario {v{S}}a{v{s}}ko and Quentin Lhoest and Angelina McMillan-Major and G{e}rard Dupont and Stella Biderman and Anna Rogers and Loubna Ben allal and Francesco De Toni and Giada Pistilli and Olivier Nguyen and Somaieh Nikpoor and Maraim Masoud and Pierre Colombo and Javier de la Rosa and Paulo Villegas and Tristan Thrush and Shayne Longpre and Sebastian Nagel and Leon Weber and Manuel Romero Mu{~n}oz and Jian Zhu and Daniel Van Strien and Zaid Alyafeai and Khalid Almubarak and Vu Minh Chien and Itziar Gonzalez-Dios and Aitor Soroa and Kyle Lo and Manan Dey and Pedro Ortiz Suarez and Aaron Gokaslan and Shamik Bose and David Ifeoluwa Adelani and Long Phan and Hieu Tran and Ian Yu and Suhas Pai and Jenny Chim and Violette Lepercq and Suzana Ilic and Margaret Mitchell and Sasha Luccioni and Yacine Jernite}, booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2022}, url={https://openreview.net/forum?id=UoEw6KigkUn} }




