totally-not-an-llm/EverythingLM-data
收藏Hugging Face2023-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/totally-not-an-llm/EverythingLM-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# EverythingLM Dataset
**EverythingLM** is a diverse instruct dataset consisting of ~1k sets of system prompts, instructions, and corresponding responses. These sets were generated using principles from both evol-instruct and Orca. The dataset encompasses a wide array of topics and interactions.
### Categories:
- Reasoning
- Creative Writing
- General Knowledge
- Brainstorming
- Search Query
- Coding
- Basic Instruct
We also leverage various system prompts for evol-instruct and for responding to prompts.
This dataset has also been filtered to remove OpenAI alignment.
### How it stands out:
- Long, detailed outputs
- Humanlike creativity
- CoT reasoning
- Complex & challenging tasks
### Plans:
- Train Llama 7b & 13b models
- Train Llama 70b QLoRA
- Generate V2 of the dataset, with more categories and GPT-4
### How does it work?
1. Generate list of categories, prompts, sysprompts, etc (human)
2. Generate seed prompts (GPT)
3. Evolve prompts (GPT)
4. Generate responses (GPT)
5. Convert to Alpaca dataset format
Included in this repo is the script to generate the dataset. However, it is buggy and probably not the best implementation possible.
提供机构:
totally-not-an-llm
原始信息汇总
EverythingLM Dataset 概述
数据集描述
EverythingLM 是一个包含约1000组系统提示、指令及相应响应的多样化指令数据集。该数据集基于 evol-instruct 和 Orca 原则生成,覆盖多种主题和交互类型。
数据集类别
- 推理
- 创意写作
- 常识知识
- 头脑风暴
- 搜索查询
- 编程
- 基础指令
数据集特点
- 长而详细的输出
- 人类般的创造力
- 链式思维推理
- 复杂且挑战性的任务
未来计划
- 训练 Llama 7b 和 13b 模型
- 训练 Llama 70b QLoRA
- 生成数据集 V2,增加更多类别并使用 GPT-4
数据集生成流程
- 生成类别、提示、系统提示等列表(人工)
- 生成种子提示(GPT)
- 演化提示(GPT)
- 生成响应(GPT)
- 转换为 Alpaca 数据集格式
数据集使用
本仓库包含生成数据集的脚本,但存在缺陷,可能不是最佳实现。



