Locutusque/hercules-v1.0
收藏Hugging Face2024-01-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Locutusque/hercules-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- code
size_categories:
- 100K<n<1M
task_categories:
- text-generation
- conversational
- question-answering
tags:
- biology
- math
- chemistry
- code
- not-for-all-audiences
---
# hercules-v1.0 dataset

The Hercules-v1.0 dataset is a turbo-charged version of teknium/openhermes, achieved by augmenting its data sources. Some of the datasets used in teknium/openhermes are older versions. Hercules-v1.0 addresses this issue by updating the data sources such as airoboros and WizardLM. Additionally, Hercules-v1.0 uses ise-uiuc/Magicoder-Evol-Instruct-110K instead of sahil2801/CodeAlpaca-20k as the primary code dataset.
Furthermore, I have removed the Unnatural Instructions dataset, as it may contain "outlier" examples.
The following is a list of data sources used to generate this dataset:
- GPTeacher by teknium
- ise-uiuc/Magicoder-Evol-Instruct-110K
- jondurbin/airoboros-3.2
- WizardLM/WizardLM_evol_instruct_V2_196k
- camel-ai/math
- camel-ai/chemistry
- camel-ai/physics
- camel-ai/biology
- teknium/GPT4-LLM-Cleaned
Just like the original openhermes, this dataset underwent cleaning to eliminate RLHF refusals. This removed approximately 50,000 examples from the dataset.
example count: 462,912
# disclaimer
This dataset contains jondurbin/airoboros-3.2, which is said to have toxic examples. As a result, you must acknowledge/agree to the following to use this data:
- a small sampling of the data contained within is "toxic"/"harmful", and contains profanity and other types of sensitive content
- none of the content or views contained in the dataset necessarily align with my personal beliefs or opinions, they are simply text generated by LLMs without a great amount of validation
- you are able to use the dataset lawfully, particularly in locations with less-than-free speech laws
- you, and you alone are responsible for having downloaded and used the dataset, and I am completely indemnified from any and all liabilities
提供机构:
Locutusque
原始信息汇总
Hercules-v1.0 数据集
数据集概述
Hercules-v1.0 数据集是 teknium/openhermes 数据集的增强版本,通过更新和替换部分数据源实现。该数据集主要用于文本生成、对话和问答任务。
数据集特点
- 语言: 英语
- 数据规模: 100K<n<1M
- 任务类型: 文本生成、对话、问答
- 标签: 生物学、数学、化学、代码、不适合所有观众
数据源
Hercules-v1.0 数据集包含以下数据源:
- GPTeacher by teknium
- ise-uiuc/Magicoder-Evol-Instruct-110K
- jondurbin/airoboros-3.2
- WizardLM/WizardLM_evol_instruct_V2_196k
- camel-ai/math
- camel-ai/chemistry
- camel-ai/physics
- camel-ai/biology
- teknium/GPT4-LLM-Cleaned
数据处理
- 移除了 Unnatural Instructions 数据集,因其可能包含“异常”示例。
- 进行了数据清洗,移除了约 50,000 个示例,以消除 RLHF 拒绝的样本。
数据集规模
- 示例数量: 462,912
免责声明
- 数据集中包含的 jondurbin/airoboros-3.2 含有有毒示例,可能包含敏感内容。
- 数据集内容不代表作者观点,仅为 LLM 生成的文本。
- 使用者需确保合法使用,特别是在言论自由受限的地区。
- 使用者需自行承担使用数据集的责任,作者不承担任何责任。



