five

H-D-T/Buzz-slice-4-10-V1.2

收藏
Hugging Face2024-09-02 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/H-D-T/Buzz-slice-4-10-V1.2
下载链接
链接失效反馈
官方服务:
资源简介:
Buzz数据集是一个高质量的预训练助手数据集,结合了强化学习(RL)和监督微调(SFT)技术。该数据集包含了435个高质量的指令跟随和对话数据集,去重后格式化为与当前本地生态系统兼容的结构。数据集包含了多种类型的数据,如指令跟随、对话、故事讲述和编码数据集,新增了超过500万行数据,并重新增强了数百万行数据。数据集的总对话轮次约为8500万次,包含单轮和多轮对话。数据集的格式与Axolotl和lmsys的FastChat兼容,结构包含source、stack、question_index和conversations等字段。

The Buzz dataset is a highly curated pretraining scale assistant dataset that unifies reinforcement learning (RL) and supervised fine-tuning (SFT). It contains 435 high-quality instruction following and conversational datasets, deduplicated and formatted to maintain compatibility with the current local ecosystem. The dataset includes various types of data such as instruction following, conversational, storytelling, and coding datasets, with over 5 million new rows of data and several million reaugmented rows of data. The total number of conversation turns is approximately 85 million, including both single and multiturn rows. The dataset is compatible with Axolotl and lmsys FastChat, and its structure includes fields such as source, stack, question_index, and conversations.
提供机构:
H-D-T
原始信息汇总

数据集概述

基本信息

  • 许可证: CC BY 4.0
  • 语言: 英语
  • 标签: 合成数据、代码、Orca、Alignment-Lab-AI、DPO、强化学习、RLHF、ShareGPT、ChatML、文本生成、指令
  • 名称: Select Stack
  • 大小: 1B < n < 10B

数据集描述

  • 名称: Buzz
  • 类型: 预训练规模助手数据集
  • 特点: 包含435个高质量的指令跟随和对话数据集,去重处理,格式化以保持和扩展训练类型与当前本地生态系统的兼容性。
  • 数据量: 包含超过500万条新数据和数百万条重新增强的数据,总计约8500万次对话。

数据结构

json { "source": "string containing the source dataset", "stack": "chosen/rejected for RL techniques", "question_index": "optional row, only contained in DPO specific dataset to match dpo pairs - int64", "conversations": [ { "from": "system", "value": "an initial system prompt or user query, may or may not be present depending on the row" }, { "from": "human or system", "value": "an initial human query" }, { "from": "gpt", "value": "a response to the previous turn, may be followed by additional human/gpt alternations" } ] }

数据来源

  • 总对话次数: 81,167,793
  • 总行数: 31,249,070
序号 来源 百分比 对话次数 行数
1 Flan: English 20.33% 16,500,966 8,250,483
2 Flan: Non English 18.47% 14,995,714 7,497,857
3 sodey 9.71% 7,883,090 917,016
4 OIG soda_dialog 7.93% 6,436,873 1,191,582
5 various orca style reaugmentations 3.62% 2,934,794 878,547
6 Select Stack 3.59% 2,911,650 1,455,825
7 sft-distil 3.59% 2,911,634 1,455,817
8 OIG abstract_infill 3.52% 2,858,795 232,188
9 medical_meadow_cord19 2.79% 2,265,654 755,218
10 EverythingIsAllYouNeed0.25 2.39% 1,941,198 970,599
... ... ... ... ...
300 chapel 0.00% 60 20
301 sparql 0.00% 60 23
302 coldfusion-cfc 0.00% 58 20
303 applescript 0.00% 57 19
304 parrot-internal-representation 0.00% 56 20
305 logos 0.00% 55 19
306 mistral-7b-instruct-v0.2 0.00% 54 27
307 literate-coffeescript 0.00% 53 18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作