five

Locutusque/hercules-v1.0

收藏
Hugging Face2024-01-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Locutusque/hercules-v1.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - code size_categories: - 100K<n<1M task_categories: - text-generation - conversational - question-answering tags: - biology - math - chemistry - code - not-for-all-audiences --- # hercules-v1.0 dataset ![Futuristic City](https://th.bing.com/th/id/OIG2.bVF4ufrWlwPjo7VIHIVD?pid=ImgGn) The Hercules-v1.0 dataset is a turbo-charged version of teknium/openhermes, achieved by augmenting its data sources. Some of the datasets used in teknium/openhermes are older versions. Hercules-v1.0 addresses this issue by updating the data sources such as airoboros and WizardLM. Additionally, Hercules-v1.0 uses ise-uiuc/Magicoder-Evol-Instruct-110K instead of sahil2801/CodeAlpaca-20k as the primary code dataset. Furthermore, I have removed the Unnatural Instructions dataset, as it may contain "outlier" examples. The following is a list of data sources used to generate this dataset: - GPTeacher by teknium - ise-uiuc/Magicoder-Evol-Instruct-110K - jondurbin/airoboros-3.2 - WizardLM/WizardLM_evol_instruct_V2_196k - camel-ai/math - camel-ai/chemistry - camel-ai/physics - camel-ai/biology - teknium/GPT4-LLM-Cleaned Just like the original openhermes, this dataset underwent cleaning to eliminate RLHF refusals. This removed approximately 50,000 examples from the dataset. example count: 462,912 # disclaimer This dataset contains jondurbin/airoboros-3.2, which is said to have toxic examples. As a result, you must acknowledge/agree to the following to use this data: - a small sampling of the data contained within is "toxic"/"harmful", and contains profanity and other types of sensitive content - none of the content or views contained in the dataset necessarily align with my personal beliefs or opinions, they are simply text generated by LLMs without a great amount of validation - you are able to use the dataset lawfully, particularly in locations with less-than-free speech laws - you, and you alone are responsible for having downloaded and used the dataset, and I am completely indemnified from any and all liabilities
提供机构:
Locutusque
原始信息汇总

Hercules-v1.0 数据集

数据集概述

Hercules-v1.0 数据集是 teknium/openhermes 数据集的增强版本,通过更新和替换部分数据源实现。该数据集主要用于文本生成、对话和问答任务。

数据集特点

  • 语言: 英语
  • 数据规模: 100K<n<1M
  • 任务类型: 文本生成、对话、问答
  • 标签: 生物学、数学、化学、代码、不适合所有观众

数据源

Hercules-v1.0 数据集包含以下数据源:

  • GPTeacher by teknium
  • ise-uiuc/Magicoder-Evol-Instruct-110K
  • jondurbin/airoboros-3.2
  • WizardLM/WizardLM_evol_instruct_V2_196k
  • camel-ai/math
  • camel-ai/chemistry
  • camel-ai/physics
  • camel-ai/biology
  • teknium/GPT4-LLM-Cleaned

数据处理

  • 移除了 Unnatural Instructions 数据集,因其可能包含“异常”示例。
  • 进行了数据清洗,移除了约 50,000 个示例,以消除 RLHF 拒绝的样本。

数据集规模

  • 示例数量: 462,912

免责声明

  • 数据集中包含的 jondurbin/airoboros-3.2 含有有毒示例,可能包含敏感内容。
  • 数据集内容不代表作者观点,仅为 LLM 生成的文本。
  • 使用者需确保合法使用,特别是在言论自由受限的地区。
  • 使用者需自行承担使用数据集的责任,作者不承担任何责任。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作