H-D-T/Buzz-slice-4-10-V1.2

Name: H-D-T/Buzz-slice-4-10-V1.2
Creator: H-D-T
Published: 2024-09-02 13:34:14
License: 暂无描述

Hugging Face2024-09-02 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/H-D-T/Buzz-slice-4-10-V1.2

下载链接

链接失效反馈

官方服务：

资源简介：

Buzz数据集是一个高质量的预训练助手数据集，结合了强化学习（RL）和监督微调（SFT）技术。该数据集包含了435个高质量的指令跟随和对话数据集，去重后格式化为与当前本地生态系统兼容的结构。数据集包含了多种类型的数据，如指令跟随、对话、故事讲述和编码数据集，新增了超过500万行数据，并重新增强了数百万行数据。数据集的总对话轮次约为8500万次，包含单轮和多轮对话。数据集的格式与Axolotl和lmsys的FastChat兼容，结构包含source、stack、question_index和conversations等字段。

The Buzz dataset is a highly curated pretraining scale assistant dataset that unifies reinforcement learning (RL) and supervised fine-tuning (SFT). It contains 435 high-quality instruction following and conversational datasets, deduplicated and formatted to maintain compatibility with the current local ecosystem. The dataset includes various types of data such as instruction following, conversational, storytelling, and coding datasets, with over 5 million new rows of data and several million reaugmented rows of data. The total number of conversation turns is approximately 85 million, including both single and multiturn rows. The dataset is compatible with Axolotl and lmsys FastChat, and its structure includes fields such as source, stack, question_index, and conversations.

提供机构：

H-D-T

原始信息汇总

数据集概述

基本信息

许可证: CC BY 4.0
语言: 英语
标签: 合成数据、代码、Orca、Alignment-Lab-AI、DPO、强化学习、RLHF、ShareGPT、ChatML、文本生成、指令
名称: Select Stack
大小: 1B < n < 10B

数据集描述

名称: Buzz
类型: 预训练规模助手数据集
特点: 包含435个高质量的指令跟随和对话数据集，去重处理，格式化以保持和扩展训练类型与当前本地生态系统的兼容性。
数据量: 包含超过500万条新数据和数百万条重新增强的数据，总计约8500万次对话。

数据结构

json { "source": "string containing the source dataset", "stack": "chosen/rejected for RL techniques", "question_index": "optional row, only contained in DPO specific dataset to match dpo pairs - int64", "conversations": [ { "from": "system", "value": "an initial system prompt or user query, may or may not be present depending on the row" }, { "from": "human or system", "value": "an initial human query" }, { "from": "gpt", "value": "a response to the previous turn, may be followed by additional human/gpt alternations" } ] }

数据来源

总对话次数: 81,167,793
总行数: 31,249,070

序号	来源	百分比	对话次数	行数
1	Flan: English	20.33%	16,500,966	8,250,483
2	Flan: Non English	18.47%	14,995,714	7,497,857
3	sodey	9.71%	7,883,090	917,016
4	OIG soda_dialog	7.93%	6,436,873	1,191,582
5	various orca style reaugmentations	3.62%	2,934,794	878,547
6	Select Stack	3.59%	2,911,650	1,455,825
7	sft-distil	3.59%	2,911,634	1,455,817
8	OIG abstract_infill	3.52%	2,858,795	232,188
9	medical_meadow_cord19	2.79%	2,265,654	755,218
10	EverythingIsAllYouNeed0.25	2.39%	1,941,198	970,599
...	...	...	...	...
300	chapel	0.00%	60	20
301	sparql	0.00%	60	23
302	coldfusion-cfc	0.00%	58	20
303	applescript	0.00%	57	19
304	parrot-internal-representation	0.00%	56	20
305	logos	0.00%	55	19
306	mistral-7b-instruct-v0.2	0.00%	54	27
307	literate-coffeescript	0.00%	53	18

5,000+

优质数据集

54 个

任务类型

进入经典数据集