five

Replete-AI/OpenHermes-2.5-Uncensored

收藏
Hugging Face2024-06-13 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Replete-AI/OpenHermes-2.5-Uncensored
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - eng pretty_name: OpenHermes 2.5 tags: - synthetic - GPT-4 - Distillation - Compilation --- ## This is the teknium/OpenHermes-2.5 dataset with 2,697 censored lines removed using my uncensored code found bellow. - https://huggingface.co/datasets/Replete-AI/data_processing_code ## Thank you teknium for the original dataset, you can find it bellow. - https://huggingface.co/datasets/teknium/OpenHermes-2.5 ## This is the same version of Open-Hermes-2.5 that was used in code_bagel_hermes-2.5 found bellow: - https://huggingface.co/datasets/Replete-AI/code_bagel_hermes-2.5 ______________________________________________________________________________________________________ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/S1OhWCy0EWcvFda4w5w_o.png) # Dataset Card for Dataset Name This is the dataset that made OpenHermes 2.5 and Nous Hermes 2 series of models. Support me on GitHub sponsors <3 : https://github.com/sponsors/teknium1 ## Dataset Details ### Dataset Description The Open Hermes 2/2.5 and Nous Hermes 2 models have made significant advancements of SOTA LLM's over recent months, and are underpinned by this exact compilation and curation of many open source datasets and custom created synthetic datasets. The Open Hermes 2.5 dataset is a continuation of the Open Hermes 1 dataset, at a much larger scale, much more diverse, and much higher quality compilation, reaching 1M, primarily synthetically generated instruction and chat samples. ## Lilac Integration This dataset has been pushed to Lilac's (a data curation and exploration platform) live HuggingFace spaces, that hosts many popular OS Datasets for exploration and curation, as well as does Text Embedding searches and Clustering of those datasets Check out that out here: https://lilacai-lilac.hf.space/datasets#lilac/OpenHermes-2.5 ## Dataset Sources ### Airoboros 2.2 By Jon Durbin: https://huggingface.co/datasets/jondurbin/airoboros-2.2 ### CamelAI Domain Expert Datasets (Physics, Math, Chemistry & Biology) By CamelAI: https://huggingface.co/camel-ai ### ChatBot Arena (GPT-4 Only) By LMSys: https://huggingface.co/datasets/lmsys/lmsys-chat-1m ### Collective Cognition (09-11-2023) By Teknium: https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22 ### CoT Alpaca GPT4 I have lost the source page for this dataset, sorry ### Evol Instruct 70K && 140K By WizardLM: 70K: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k 140k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k ### Glaive Code Assistant By Sahil & Glaive: https://huggingface.co/datasets/glaiveai/glaive-code-assistant ### GPT4-LLM By Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned ### GPTeacher By Teknium & Kuruminha: https://github.com/teknium1/GPTeacher ### Medical Tasks By CogStack: https://github.com/CogStack/OpenGPT ### MetaMath 40k By MetaMath: https://huggingface.co/datasets/meta-math/MetaMathQA ### SlimOrca 550K By Wing Lian and Guan Wang and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and Teknium: https://huggingface.co/datasets/Open-Orca/SlimOrca ### Platypus By Garage-bAInd: https://huggingface.co/datasets/garage-bAInd/Open-Platypus ### ShareGPT (GPT4-Only) By LMSys: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered ### Unnatural Instructions GPT4 By Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM ## Dataset Structure The dataset follows a sharegpt structure. This means it is a list of dictionaries, with each dictionary containing a new list of dicts called conversations. Each turn in a conversation has two dictionaries, a "from" field, which denotes the role of that turn, and a "value" field which contains the actual text. Here is an example of an entry: ``` { "conversations": [ { "from": "system", "value": "" }, { "from": "human", "value": "Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic." }, { "from": "gpt", "value": "Here's the logic behind this:\n\n1. We know that February has 28 days in a non-leap year.\n2. If the tree drops 7 leaves every day, then over the course of February, it would drop:\n Leaves dropped in February = Leaves per day * Days in February\n = 7 leaves * 28 days\n = 196 leaves\n\nSo, the tree would drop 196 leaves in February in a non-leap year." } ], "source": "airoboros2.2", "category": "orca" }, ``` Some examples, like the one above, contain the metadata that came with the dataset, such as "category" being "orca", and many contain the source dataset that it was curated from. ## Citation ```bibtex @misc{OpenHermes 2.5, title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants}, author = {Teknium}, year = {2023}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5} } ```
提供机构:
Replete-AI
原始信息汇总

数据集卡片

数据集描述

Open Hermes 2/2.5 和 Nous Hermes 2 模型在过去几个月中取得了显著的进展,这些进展基于多个开源数据集和自定义生成的合成数据集的精确编译和策划。

Open Hermes 2.5 数据集是 Open Hermes 1 数据集的延续,规模更大、多样性更高、质量更高,主要包含 100 万个合成生成的指令和聊天样本。

数据集来源

主要来源

  • Airoboros 2.2

    • 作者:Jon Durbin
    • 链接:https://huggingface.co/datasets/jondurbin/airoboros-2.2
  • CamelAI 领域专家数据集(物理、数学、化学和生物学)

    • 作者:CamelAI
    • 链接:https://huggingface.co/camel-ai
  • ChatBot Arena (仅 GPT-4)

    • 作者:LMSys
    • 链接:https://huggingface.co/datasets/lmsys/lmsys-chat-1m
  • Collective Cognition (2023-09-22)

    • 作者:Teknium
    • 链接:https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22
  • CoT Alpaca GPT4

    • 作者:未知
  • Evol Instruct 70K 和 140K

    • 作者:WizardLM
    • 链接:
      • 70K: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
      • 140K: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
  • Glaive Code Assistant

    • 作者:Sahil & Glaive
    • 链接:https://huggingface.co/datasets/glaiveai/glaive-code-assistant
  • GPT4-LLM

    • 作者:Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao
    • 链接:https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
  • GPTeacher

    • 作者:Teknium & Kuruminha
    • 链接:https://github.com/teknium1/GPTeacher
  • Medical Tasks

    • 作者:CogStack
    • 链接:https://github.com/CogStack/OpenGPT
  • MetaMath 40k

    • 作者:MetaMath
    • 链接:https://huggingface.co/datasets/meta-math/MetaMathQA
  • SlimOrca 550K

    • 作者:Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, Teknium
    • 链接:https://huggingface.co/datasets/Open-Orca/SlimOrca
  • Platypus

    • 作者:Garage-bAInd
    • 链接:https://huggingface.co/datasets/garage-bAInd/Open-Platypus
  • ShareGPT (仅 GPT-4)

    • 作者:LMSys
    • 链接:https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
  • Unnatural Instructions GPT4

    • 作者:Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao
    • 链接:https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM

数据集结构

数据集遵循 sharegpt 结构,即一个字典列表,每个字典包含一个名为 conversations 的新列表。每个对话中的每一轮有两个字典,一个是 "from" 字段,表示该轮的角色,另一个是 "value" 字段,包含实际文本。

示例:

json { "conversations": [ { "from": "system", "value": "" }, { "from": "human", "value": "Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic." }, { "from": "gpt", "value": "Heres the logic behind this:

  1. We know that February has 28 days in a non-leap year.
  2. If the tree drops 7 leaves every day, then over the course of February, it would drop: Leaves dropped in February = Leaves per day * Days in February = 7 leaves * 28 days = 196 leaves

So, the tree would drop 196 leaves in February in a non-leap year." } ], "source": "airoboros2.2", "category": "orca" }

部分示例包含数据集的元数据,如 "category" 为 "orca",以及许多示例包含其来源数据集。

引用

bibtex @misc{OpenHermes 2.5, title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants}, author = {Teknium}, year = {2023}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作