Replete-AI/OpenHermes-2.5-Uncensored
收藏数据集卡片
数据集描述
Open Hermes 2/2.5 和 Nous Hermes 2 模型在过去几个月中取得了显著的进展,这些进展基于多个开源数据集和自定义生成的合成数据集的精确编译和策划。
Open Hermes 2.5 数据集是 Open Hermes 1 数据集的延续,规模更大、多样性更高、质量更高,主要包含 100 万个合成生成的指令和聊天样本。
数据集来源
主要来源
-
Airoboros 2.2
- 作者:Jon Durbin
- 链接:https://huggingface.co/datasets/jondurbin/airoboros-2.2
-
CamelAI 领域专家数据集(物理、数学、化学和生物学)
- 作者:CamelAI
- 链接:https://huggingface.co/camel-ai
-
ChatBot Arena (仅 GPT-4)
- 作者:LMSys
- 链接:https://huggingface.co/datasets/lmsys/lmsys-chat-1m
-
Collective Cognition (2023-09-22)
- 作者:Teknium
- 链接:https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22
-
CoT Alpaca GPT4
- 作者:未知
-
Evol Instruct 70K 和 140K
- 作者:WizardLM
- 链接:
- 70K: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
- 140K: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
-
Glaive Code Assistant
- 作者:Sahil & Glaive
- 链接:https://huggingface.co/datasets/glaiveai/glaive-code-assistant
-
GPT4-LLM
- 作者:Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao
- 链接:https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
-
GPTeacher
- 作者:Teknium & Kuruminha
- 链接:https://github.com/teknium1/GPTeacher
-
Medical Tasks
- 作者:CogStack
- 链接:https://github.com/CogStack/OpenGPT
-
MetaMath 40k
- 作者:MetaMath
- 链接:https://huggingface.co/datasets/meta-math/MetaMathQA
-
SlimOrca 550K
- 作者:Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, Teknium
- 链接:https://huggingface.co/datasets/Open-Orca/SlimOrca
-
Platypus
- 作者:Garage-bAInd
- 链接:https://huggingface.co/datasets/garage-bAInd/Open-Platypus
-
ShareGPT (仅 GPT-4)
- 作者:LMSys
- 链接:https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
-
Unnatural Instructions GPT4
- 作者:Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao
- 链接:https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
数据集结构
数据集遵循 sharegpt 结构,即一个字典列表,每个字典包含一个名为 conversations 的新列表。每个对话中的每一轮有两个字典,一个是 "from" 字段,表示该轮的角色,另一个是 "value" 字段,包含实际文本。
示例:
json { "conversations": [ { "from": "system", "value": "" }, { "from": "human", "value": "Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic." }, { "from": "gpt", "value": "Heres the logic behind this:
- We know that February has 28 days in a non-leap year.
- If the tree drops 7 leaves every day, then over the course of February, it would drop: Leaves dropped in February = Leaves per day * Days in February = 7 leaves * 28 days = 196 leaves
So, the tree would drop 196 leaves in February in a non-leap year." } ], "source": "airoboros2.2", "category": "orca" }
部分示例包含数据集的元数据,如 "category" 为 "orca",以及许多示例包含其来源数据集。
引用
bibtex @misc{OpenHermes 2.5, title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants}, author = {Teknium}, year = {2023}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5} }




