H-D-T/Buzz-slice-10-10-V1.2
收藏Hugging Face2024-09-02 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/H-D-T/Buzz-slice-10-10-V1.2
下载链接
链接失效反馈官方服务:
资源简介:
Buzz数据集是一个高质量的预训练规模助手数据集,结合了强化学习(RL)和监督微调(SFT)技术。该数据集包含435个高质量的指令跟随和对话数据集,去重后格式化为与当前本地生态系统兼容的形式。数据集包含多种类型的数据,如指令跟随、对话、故事讲述和编码数据集,以及超过500万条新数据和数百万条重新增强的数据。数据集的总对话轮次约为8500万次,包含单轮和多轮对话。数据集的格式与Axolotl和FastChat兼容,主要用于训练和优化语言模型。
The Buzz dataset is a highly curated pretraining scale assistant dataset that unifies reinforcement learning (RL) and supervised fine-tuning (SFT). It contains 435 high-quality instruction-following and conversational datasets, deduplicated and formatted to maintain compatibility with the current local ecosystem. The dataset includes various types of data such as instruction-following, conversational, storytelling, and coding datasets, along with over 5 million new rows of data and several million reaugmented rows. The dataset comprises approximately 85 million turns of conversations, including both single and multi-turn rows. The datasets format is compatible with Axolotl and FastChat, and it is primarily used for training and optimizing language models.
提供机构:
H-D-T
原始信息汇总
Buzz 数据集概述
基本信息
- 许可证: CC BY 4.0
- 语言: 英语
- 标签: 合成, 代码, Orca, Alignment-Lab-AI, DPO, 强化学习, RLHF, ShareGPT, ChatML, 文本生成, 指令
- 数据集名称: Select Stack
- 数据集大小: 1B < n < 10B
数据集介绍
- 数据集类型: 高质量指令遵循和对话数据集
- 数据来源: 包含435个高质量指令遵循、对话、故事讲述和编码数据集
- 数据量: 超过500万新行数据和数百万重新增强的数据行,总计约8500万次对话轮次
数据结构
json { "source": "string containing the source dataset", "stack": "chosen/rejected for RL techniques", "question_index": "optional row, only contained in DPO specific dataset to match dpo pairs - int64", "conversations": [ { "from": "system", "value": "an initial system prompt or user query, may or may not be present depending on the row" }, { "from": "human or system", "value": "an initial human query" }, { "from": "gpt", "value": "a response to the previous turn, may be followed by additional human/gpt alternations" } ] }
数据集来源
- 总轮次: 81,167,793
- 总行数: 31,249,070
| 序号 | 来源 | 百分比 | 轮次 | 行数 |
|---|---|---|---|---|
| 1 | Flan: English | 20.33% | 16,500,966 | 8,250,483 |
| 2 | Flan: Non English | 18.47% | 14,995,714 | 7,497,857 |
| 3 | sodey | 9.71% | 7,883,090 | 917,016 |
| 4 | OIG soda_dialog | 7.93% | 6,436,873 | 1,191,582 |
| 5 | various orca style reaugmentations | 3.62% | 2,934,794 | 878,547 |
| 6 | Select Stack | 3.59% | 2,911,650 | 1,455,825 |
| 7 | sft-distil | 3.59% | 2,911,634 | 1,455,817 |
| 8 | OIG abstract_infill | 3.52% | 2,858,795 | 232,188 |
| 9 | medical_meadow_cord19 | 2.79% | 2,265,654 | 755,218 |
| 10 | EverythingIsAllYouNeed0.25 | 2.39% | 1,941,198 | 970,599 |
| 11 | MATH-plus | 2.04% | 1,658,976 | 829,488 |
| 12 | OIG unifiedskg_instructions | 1.14% | 927,267 | 214,793 |
| 13 | OIG nq | 1.03% | 836,194 | 307,373 |
| 14 | MetaMath_DPO_FewShot | 0.97% | 787,998 | 393,999 |
| 15 | MetaMathQA | 0.95% | 770,166 | 385,083 |
| 16 | OpenHermes-2.5 | 0.95% | 769,503 | 367,336 |
| 17 | wildchat-sharegpt | 0.94% | 764,896 | 123,596 |
| 18 | hotdog-gpt | 0.73% | 591,467 | 190,543 |
| 19 | Tess-Coder-v1.0 | 0.72% | 585,038 | 117,008 |
| 20 | OIG canadian_parliament | 0.72% | 581,708 | 290,854 |
| 21 | openhermes | 0.66% | 536,782 | 240,894 |
| 22 | Text-to-sql-v1 | 0.65% | 524,412 | 262,206 |
| 23 | MathInstruct | 0.61% | 491,666 | 245,833 |
| 24 | OIG unnatural_instructions | 0.59% | 476,087 | 238,035 |
| 25 | OIG openai_summarize_tldr | 0.58% | 466,796 | 233,398 |
| 26 | OIG chip2 | 0.52% | 420,564 | 210,282 |
| 27 | orcamath-sharegpt | 0.49% | 399,414 | 199,707 |
| 28 | OIG xp3_sample | 0.46% | 376,276 | 188,138 |
| 29 | anthropic-hh-nectar | 0.43% | 346,892 | 73,687 |
| 30 | reasoningData_200k | 0.41% | 334,004 | 167,002 |
| 31 | OpenCodeInterpreterData | 0.41% | 331,715 | 36,836 |
| 32 | Synthia-v1.3 | 0.41% | 329,115 | 118,841 |
| 33 | yaml | 0.40% | 321,755 | 110,572 |
| 34 | GPTscience_maths_csml | 0.37% | 297,310 | 148,655 |
| 35 | OIG squad_v2 | 0.32% | 260,638 | 19,585 |
| 36 | OIG squad_v2_more_neg | 0.32% | 259,902 | 13,946 |
| 37 | OIG rallio_safety_and_prosocial | 0.31% | 250,534 | 125,235 |
| 38 | MIMIC-medical-report | 0.31% | 250,362 | 83,454 |
| 39 | OIG mathqa_flanv2_kojma_cot | 0.30% | 243,420 | 107,564 |
| 40 | openai_summarize_tldr | 0.29% | 233,336 | 116,668 |
| 41 | OIG sqlv2 | 0.28% | 224,270 | 24,546 |
| 42 | ruby | 0.24% | 197,135 | 68,086 |
| 43 | RPGuild-sharegpt-filtered | 0.24% | 196,309 | 27,053 |
| 44 | OIG multi_news | 0.22% | 179,888 | 89,944 |
| 45 | markdown | 0.22% | 174,608 | 61,260 |
| 46 | javascript | 0.19% | 156,109 | 52,289 |
| 47 | python | 0.19% | 151,866 | 55,045 |
| 48 | know_sql | 0.18% | 148,368 | 49,456 |
| 49 | text | 0.16% | 133,033 | 44,926 |
| 50 | saraswati_stem_formatted | 0.15% | 119,750 | 59,875 |
| 51 | know_saraswati_cot_formatted | 0.14% | 116,408 | 58,204 |
| 52 | json | 0.14% | 115,682 | 39,124 |
| 53 | OIG hc3_human | 0.14% | 112,112 | 56,056 |
| 54 | medical_meadow_medical_flashcards | 0.12% | 100,575 | 33,527 |
| 55 | lmsys-chat-1m-nectar | 0.11% | 86,770 | 43,385 |
| 56 | shell | 0.11% | 85,901 | 30,327 |
| 57 | cogstack-opengpt-sharegpt | 0.10% | 81,667 | 31,532 |
| 58 | Quanta | 0.10% | 78,096 | 26,032 |
| 59 | php | 0.08% | 68,256 | 24,302 |
| 60 | know_logic | 0.08% | 68,208 | 34,104 |
| 61 | html | 0.07% | 57,384 | 19,750 |
| 62 | OIG plot_screenplay_books_dialog | 0.07% | 54,981 | 7,924 |
| 63 | java | 0.07% | 53,574 | 20,150 |
| 64 | Open-Platypus | 0.07% | 53,373 | 24,109 |
| 65 | RFT-GSM-28K | 0.06% | 51,092 | 25,546 |
| 66 | OIG conv_finqa | 0.06% | 50,472 | 9,102 |
| 67 | sharegpt-nectar | 0.06% | 49,896 | 24,948 |
| 68 | OIG cuad | 0.05% | 41,390 | 510 |
| 69 | OpenCerebrum-dpo | 0.05% | 40,534 | 17,013 |
| 70 | Tested-22k-Python-Alpaca | 0.04% | 36,224 | 18,112 |
| 71 | OIG sqlv1 | 0.04% | 34,174 | 17,087 |
| 72 | MedQuad-MedicalQnADataset | 0.04% | 32,718 | 16,359 |
| 73 | piqa | 0.04% | 32,212 | 16,106 |
| 74 | html+erb | 0.04% | 31,679 | 10,708 |
| 75 | OIG image_prompts_instructions | 0.04% | 30,932 | 15,466 |
| 76 | medical_meadow_medqa | 0.04% | 30,534 | 10,178 |
| 77 | ini | 0.04% | 30,461 | 10,396 |
| 78 | medical_meadow_wikidoc | 0.04% | 29,998 | 10,000 |
| 79 | c# | 0.03% | 26,796 | 9,220 |
| 80 | xml | 0.03% | 26,054 | 9,085 |
| 81 | medical_meadow_health_advice | 0.03% | 25,995 | 8,665 |
| 82 | OIG poetry_2_song | 0.03% | 25,462 | 12,731 |
| 83 | flan_v2_niv2-nectar | 0.03% | 24,036 | 12,018 |
| 84 | c | 0.03% | 23,203 | 8,250 |
| 85 | scss | 0.02% | 20,156 | 6,730 |
| 86 | evol_instruct-nectar | 0.02% | 19,930 | 9,965 |
| 87 | ultrachat-nectar | 0.02% | 19,822 | 9,911 |
| 88 | restructuredtext | 0.02% | 18,901 | 6,481 |
| 89 | OpenCerebrum-2.0-SFT | 0.02% | 18,793 | 4,382 |
| 90 | gpteacher-role-play-chatml | 0.02% | 18,222 | 9,111 |
| 91 | OIG grade_school_math_instructions | 0.02% | 17,584 | 8,792 |
| 92 | OIG essays | 0.02% | 17,581 | 2,064 |
| 93 | medical_meadow_wikidoc_patient_information | 0.02% | 17,550 | 5,850 |
| 94 | typescript | 0.02% | 16,912 | 5,816 |
| 95 | coffeescript | 0.02% | 15,836 | 5,403 |
| 96 | go | 0.02% | 14,814 | 4,939 |
| 97 | css | 0.02% | 14,654 | 4,979 |
| 98 | scala | 0.02% | 14,184 | 4,988 |
| 99 | c++ | 0.02% | 13,391 | 4,838 |
| 100 | swift | 0.02% | 13,361 | 4,724 |
| 101 | haml |



