five

sail/regmix-data-sample

收藏
Hugging Face2024-07-11 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/sail/regmix-data-sample
下载链接
链接失效反馈
官方服务:
资源简介:
RegMix Data Sample是一个从Pile-Uncopyrighted中提取的精选数据集,专门为RegMix论文设计。该数据集旨在通过回归任务自动识别高性能的数据混合物用于语言模型预训练。数据集包含约20GB的磁盘空间和50亿个令牌,按照不同领域的示例进行组织,并分为训练和验证两个主要目录。数据集的使用建议下载整个数据集快照,并提供了下载代码示例。数据预处理步骤包括将领域文件转换为二进制格式,并允许用户定义的数据混合物进行随机采样。

The RegMix Data Sample is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper. This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task. Key features include its size (approximately 20GB disk space, 5B tokens), distribution following the natural token distribution of domain examples, and organization with examples from different domains separated into individual files. The dataset is organized into train and valid directories with domain-specific JSONL files. The README also provides instructions for downloading the dataset and preprocessing the data, as well as acknowledgements and citation information for the original dataset and the RegMix paper.
提供机构:
sail
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作