H-D-T/Buzz-slice-4-10-V1.2
收藏Hugging Face2024-09-02 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/H-D-T/Buzz-slice-4-10-V1.2
下载链接
链接失效反馈官方服务:
资源简介:
Buzz数据集是一个高质量的预训练助手数据集,结合了强化学习(RL)和监督微调(SFT)技术。该数据集包含了435个高质量的指令跟随和对话数据集,去重后格式化为与当前本地生态系统兼容的结构。数据集包含了多种类型的数据,如指令跟随、对话、故事讲述和编码数据集,新增了超过500万行数据,并重新增强了数百万行数据。数据集的总对话轮次约为8500万次,包含单轮和多轮对话。数据集的格式与Axolotl和lmsys的FastChat兼容,结构包含source、stack、question_index和conversations等字段。
The Buzz dataset is a highly curated pretraining scale assistant dataset that unifies reinforcement learning (RL) and supervised fine-tuning (SFT). It contains 435 high-quality instruction following and conversational datasets, deduplicated and formatted to maintain compatibility with the current local ecosystem. The dataset includes various types of data such as instruction following, conversational, storytelling, and coding datasets, with over 5 million new rows of data and several million reaugmented rows of data. The total number of conversation turns is approximately 85 million, including both single and multiturn rows. The dataset is compatible with Axolotl and lmsys FastChat, and its structure includes fields such as source, stack, question_index, and conversations.
提供机构:
H-D-T
原始信息汇总
数据集概述
基本信息
- 许可证: CC BY 4.0
- 语言: 英语
- 标签: 合成数据、代码、Orca、Alignment-Lab-AI、DPO、强化学习、RLHF、ShareGPT、ChatML、文本生成、指令
- 名称: Select Stack
- 大小: 1B < n < 10B
数据集描述
- 名称: Buzz
- 类型: 预训练规模助手数据集
- 特点: 包含435个高质量的指令跟随和对话数据集,去重处理,格式化以保持和扩展训练类型与当前本地生态系统的兼容性。
- 数据量: 包含超过500万条新数据和数百万条重新增强的数据,总计约8500万次对话。
数据结构
json { "source": "string containing the source dataset", "stack": "chosen/rejected for RL techniques", "question_index": "optional row, only contained in DPO specific dataset to match dpo pairs - int64", "conversations": [ { "from": "system", "value": "an initial system prompt or user query, may or may not be present depending on the row" }, { "from": "human or system", "value": "an initial human query" }, { "from": "gpt", "value": "a response to the previous turn, may be followed by additional human/gpt alternations" } ] }
数据来源
- 总对话次数: 81,167,793
- 总行数: 31,249,070
| 序号 | 来源 | 百分比 | 对话次数 | 行数 |
|---|---|---|---|---|
| 1 | Flan: English | 20.33% | 16,500,966 | 8,250,483 |
| 2 | Flan: Non English | 18.47% | 14,995,714 | 7,497,857 |
| 3 | sodey | 9.71% | 7,883,090 | 917,016 |
| 4 | OIG soda_dialog | 7.93% | 6,436,873 | 1,191,582 |
| 5 | various orca style reaugmentations | 3.62% | 2,934,794 | 878,547 |
| 6 | Select Stack | 3.59% | 2,911,650 | 1,455,825 |
| 7 | sft-distil | 3.59% | 2,911,634 | 1,455,817 |
| 8 | OIG abstract_infill | 3.52% | 2,858,795 | 232,188 |
| 9 | medical_meadow_cord19 | 2.79% | 2,265,654 | 755,218 |
| 10 | EverythingIsAllYouNeed0.25 | 2.39% | 1,941,198 | 970,599 |
| ... | ... | ... | ... | ... |
| 300 | chapel | 0.00% | 60 | 20 |
| 301 | sparql | 0.00% | 60 | 23 |
| 302 | coldfusion-cfc | 0.00% | 58 | 20 |
| 303 | applescript | 0.00% | 57 | 19 |
| 304 | parrot-internal-representation | 0.00% | 56 | 20 |
| 305 | logos | 0.00% | 55 | 19 |
| 306 | mistral-7b-instruct-v0.2 | 0.00% | 54 | 27 |
| 307 | literate-coffeescript | 0.00% | 53 | 18 |



