OpenHermes-2.5
收藏魔搭社区2026-05-23 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/swift/OpenHermes-2.5
下载链接
链接失效反馈官方服务:
资源简介:

# Dataset Card for Dataset Name
This is the dataset that made OpenHermes 2.5 and Nous Hermes 2 series of models.
Support me on GitHub sponsors <3 : https://github.com/sponsors/teknium1
## Dataset Details
### Dataset Description
The Open Hermes 2/2.5 and Nous Hermes 2 models have made significant advancements of SOTA LLM's over recent months, and are underpinned by this exact compilation and curation of many open source datasets and custom created synthetic datasets.
The Open Hermes 2.5 dataset is a continuation of the Open Hermes 1 dataset, at a much larger scale, much more diverse, and much higher quality compilation, reaching 1M, primarily synthetically generated instruction and chat samples.
## Lilac Integration
This dataset has been pushed to Lilac's (a data curation and exploration platform) live HuggingFace spaces, that hosts many popular OS Datasets for exploration and curation, as well as does Text Embedding searches and Clustering of those datasets
Check out that out here: https://lilacai-lilac.hf.space/datasets#lilac/OpenHermes-2.5
## Dataset Sources
### Airoboros 2.2
By Jon Durbin: https://huggingface.co/datasets/jondurbin/airoboros-2.2
### CamelAI Domain Expert Datasets (Physics, Math, Chemistry & Biology)
By CamelAI: https://huggingface.co/camel-ai
### ChatBot Arena (GPT-4 Only)
By LMSys: https://huggingface.co/datasets/lmsys/lmsys-chat-1m
### Collective Cognition (09-11-2023)
By Teknium: https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22
### CoT Alpaca GPT4
I have lost the source page for this dataset, sorry
### Evol Instruct 70K && 140K
By WizardLM:
70K: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
140k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
### Glaive Code Assistant
By Sahil & Glaive: https://huggingface.co/datasets/glaiveai/glaive-code-assistant
### GPT4-LLM
By Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao
https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
### GPTeacher
By Teknium & Kuruminha: https://github.com/teknium1/GPTeacher
### Medical Tasks
By CogStack: https://github.com/CogStack/OpenGPT
### MetaMath 40k
By MetaMath: https://huggingface.co/datasets/meta-math/MetaMathQA
### SlimOrca 550K
By Wing Lian and Guan Wang and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and Teknium:
https://huggingface.co/datasets/Open-Orca/SlimOrca
### Platypus
By Garage-bAInd: https://huggingface.co/datasets/garage-bAInd/Open-Platypus
### ShareGPT (GPT4-Only)
By LMSys: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
### Unnatural Instructions GPT4
By Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao:
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
## Dataset Structure
The dataset follows a sharegpt structure. This means it is a list of dictionaries, with each dictionary containing a new list of dicts called conversations. Each turn in a conversation has two dictionaries, a "from" field, which denotes the role of that turn, and a "value" field which contains the actual text.
Here is an example of an entry:
```
{
"conversations": [
{
"from": "system",
"value": ""
},
{
"from": "human",
"value": "Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic."
},
{
"from": "gpt",
"value": "Here's the logic behind this:\n\n1. We know that February has 28 days in a non-leap year.\n2. If the tree drops 7 leaves every day, then over the course of February, it would drop:\n Leaves dropped in February = Leaves per day * Days in February\n = 7 leaves * 28 days\n = 196 leaves\n\nSo, the tree would drop 196 leaves in February in a non-leap year."
}
],
"source": "airoboros2.2",
"category": "orca"
},
```
Some examples, like the one above, contain the metadata that came with the dataset, such as "category" being "orca", and many contain the source dataset that it was curated from.
## Citation
```bibtex
@misc{OpenHermes 2.5,
title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants},
author = {Teknium},
year = {2023},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5}
}
```

# 数据集卡片(数据集名称)
本数据集为支撑OpenHermes 2.5与Nous Hermes 2系列模型研发的核心数据集。
欢迎在GitHub Sponsors中支持我:https://github.com/sponsors/teknium1
## 数据集详情
### 数据集概述
近期数月内,Open Hermes 2/2.5与Nous Hermes 2系列模型已在大语言模型(Large Language Model,LLM)领域取得当前最优(State-of-the-Art,SOTA)级别的显著进展,其底层支撑正是本数据集对众多开源数据集的整合与精选,以及自研合成数据集。
Open Hermes 2.5 数据集是 Open Hermes 1 数据集的延续,其整合规模更大、数据多样性更强、质量更优,总计包含约100万条主要为合成生成的指令与对话样本。
## Lilac 集成
本数据集已部署至Lilac(一款数据精选与探索平台)的官方Hugging Face Spaces空间,该空间汇聚了众多热门开源数据集以供探索与精选,同时支持对这些数据集进行文本嵌入检索与聚类分析。
可通过以下链接体验:https://lilacai-lilac.hf.space/datasets#lilac/OpenHermes-2.5
## 数据集来源
### Airoboros 2.2
作者:Jon Durbin:https://huggingface.co/datasets/jondurbin/airoboros-2.2
### CamelAI 领域专家数据集(涵盖物理、数学、化学与生物学)
作者:CamelAI:https://huggingface.co/camel-ai
### ChatBot Arena (仅GPT-4)
作者:LMSys:https://huggingface.co/datasets/lmsys/lmsys-chat-1m
### Collective Cognition (2023年9月11日)
作者:Teknium:https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22
### 思维链(Chain of Thought,CoT) Alpaca GPT4
该数据集的源页面已丢失,敬请谅解
### Evol Instruct 70K 与 140K
作者:WizardLM:
70K:https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
140k:https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
### Glaive 代码助手
作者:Sahil 与 Glaive:https://huggingface.co/datasets/glaiveai/glaive-code-assistant
### GPT4-LLM
作者:Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao
https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
### GPTeacher
作者:Teknium 与 Kuruminha:https://github.com/teknium1/GPTeacher
### 医学任务数据集
作者:CogStack:https://github.com/CogStack/OpenGPT
### MetaMath 40k
作者:MetaMath:https://huggingface.co/datasets/meta-math/MetaMathQA
### SlimOrca 550K
作者:Wing Lian、Guan Wang、Bleys Goodson、Eugene Pentland、Austin Cook、Chanvichet Vong 与 Teknium:
https://huggingface.co/datasets/Open-Orca/SlimOrca
### Platypus
作者:Garage-bAInd:https://huggingface.co/datasets/garage-bAInd/Open-Platypus
### ShareGPT (仅GPT-4)
作者:LMSys:https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
### Unnatural Instructions GPT4
作者:Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao:
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
## 数据集结构
本数据集遵循ShareGPT格式规范:整体为字典列表,每个字典包含一个名为`conversations`的子列表,该子列表存储单轮或多轮对话数据。每一轮对话包含两个核心字典字段:`from`字段用于标识该轮次的角色,`value`字段则存储该轮次的具体文本内容。
以下为一条数据条目示例:
{
"conversations": [
{
"from": "system",
"value": ""
},
{
"from": "human",
"value": "Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic."
},
{
"from": "gpt",
"value": "Here's the logic behind this:
1. We know that February has 28 days in a non-leap year.
2. If the tree drops 7 leaves every day, then over the course of February, it would drop:
Leaves dropped in February = Leaves per day * Days in February
= 7 leaves * 28 days
= 196 leaves
So, the tree would drop 196 leaves in February in a non-leap year."
}
],
"source": "airoboros2.2",
"category": "orca"
}
部分示例(如上所示)包含数据集自带的元数据,例如`category`字段值为`orca`;多数条目还会标注其源自的精选数据集。
## 引用
bibtex
@misc{OpenHermes 2.5,
title = {OpenHermes 2.5:面向通用大语言模型助手的开源合成数据集},
author = {Teknium},
year = {2023},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5}
}
提供机构:
maas
创建时间:
2024-06-06
搜集汇总
数据集介绍

背景与挑战
背景概述
OpenHermes-2.5是一个包含约100万条合成指令和聊天样本的高质量数据集,用于训练OpenHermes 2.5和Nous Hermes 2系列模型,整合了多个开源数据集并采用ShareGPT结构组织数据。
以上内容由遇见数据集搜集并总结生成



