instruction-pretrain/general-instruction-augmented-corpora
收藏Hugging Face2024-07-15 更新2024-06-25 收录
下载链接:
https://hf-mirror.com/datasets/instruction-pretrain/general-instruction-augmented-corpora
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含200M条指令-响应对,覆盖40多个任务类别,用于指令预训练框架。数据集通过一个高效的指令合成器生成,该合成器基于开源模型构建。此外,数据集还包括从100B到250B的预训练令牌,以及500M条合成的指令-响应对。数据集的目的是验证指令预训练的有效性,并提供了一系列预训练模型和数据集资源。
This dataset contains 200M instruction-response pairs covering over 40 task categories, used for the Instruction Pre-Training framework. The dataset is generated by an efficient instruction synthesizer built on open-source models. Additionally, the dataset includes pre-trained tokens from 100B to 250B, and 500M synthesized instruction-response pairs. The purpose of the dataset is to verify the effectiveness of Instruction Pre-Training and provides a series of pre-trained models and dataset resources.
提供机构:
instruction-pretrain
原始信息汇总
数据集许可证
- 许可证类型: Open Data Commons Attribution License (ODC-BY)



