netmouse/PromisedChat_Instruction

Name: netmouse/PromisedChat_Instruction
Creator: netmouse
Published: 2024-07-14 12:15:58
License: 暂无描述

Hugging Face2024-07-14 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/netmouse/PromisedChat_Instruction

下载链接

链接失效反馈

官方服务：

资源简介：

Alpaca-Cleaned数据集是原始Alpaca数据集的清理版本，修复了原始数据集中存在的多个问题，如幻觉、合并指令、空输出、空代码示例、生成图像的指令、N/A输出、不一致的输入字段、错误答案、非清晰指令和多余的转义字符等。数据集包含52,000条指令和演示，用于指令微调语言模型。数据集的创建基于OpenAI的`text-davinci-003`引擎，并采用了更激进的批量解码策略以降低成本。数据集的结构包括指令、输入、输出和格式化文本字段。数据集的许可证为CC BY-NC 4.0，适用于非商业用途。

The Alpaca-Cleaned dataset is a cleaned version of the original Alpaca Dataset, addressing several issues present in the original release, such as hallucinations, merged instructions, empty outputs, missing code examples, instructions to generate images, N/A outputs, inconsistent input fields, wrong answers, non-sensical instructions, and extraneous escape characters. The dataset contains 52,000 instructions and demonstrations for instruction-tuning language models. The dataset was created using OpenAIs `text-davinci-003` engine and employed more aggressive batch decoding to reduce costs. The dataset structure includes fields for instruction, input, output, and formatted text. The dataset is licensed under CC BY-NC 4.0 and is intended for non-commercial use.

提供机构：

netmouse

原始信息汇总

数据集概述

数据集描述

数据集名称

Alpaca-Cleaned

数据集来源

原始数据集: 由斯坦福大学发布的Alpaca数据集。
清理版本: 本数据集是对原始Alpaca数据集的清理版本，修复了原始数据集中的多个问题。

数据集问题修复

幻觉问题: 原始数据集中的许多指令引用了互联网上的数据，导致GPT3产生幻觉答案。
合并指令: 原始数据集中存在许多指令被合并在一起的情况。
空输出: 原始数据集中存在一些条目的输出为空。
缺少代码示例: 原始数据集中的一些描述缺少代码示例。
生成图像指令: 原始数据集中包含生成图像的指令，这在实际操作中是不可能的。
N/A输出: 原始数据集中的一些代码片段输出为N/A。
输入字段不一致: 原始数据集在输入字段的使用上不一致。
错误答案: 原始数据集中的一些指令/问题有错误的答案，尤其是数学问题。
非理性/不清晰的指令: 原始数据集中存在许多不清晰或非理性的指令。
多余的字符: 原始数据集中包含多余的转义和控制字符。

原始Alpaca数据集概述

数据生成: 由OpenAI的text-davinci-003引擎生成。
数据量: 52,000条指令和演示。
用途: 用于语言模型的指令微调，使模型更好地遵循指令。
生成方法: 基于Self-Instruct框架，并进行了以下修改：
- 使用text-davinci-003引擎生成指令数据。
- 编写了一个新的提示模板，明确要求生成指令。
- 使用更积极的批量解码，一次生成20条指令，显著降低了数据生成成本。
- 简化了数据生成管道，不再区分分类和非分类指令。
- 每个指令只生成一个实例，而不是2到3个实例。

支持的任务和排行榜

任务: 用于预训练语言模型的指令训练。

语言

语言: 英语（BCP-47 en）。

数据集结构

数据实例

示例: json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

Create a classification task by clustering the given list of items.

Input:

Apples, oranges, bananas, strawberries, pineapples

Response:

Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples" }

数据字段

instruction: 描述模型应执行的任务，每条指令都是唯一的。
input: 任务的可选上下文或输入，约40%的示例有输入。
output: 由text-davinci-003生成的指令答案。
text: 包含instruction、input和output的格式化文本，使用作者用于微调模型的提示模板。

数据分割

训练集: 52,002条数据。

数据集创建

数据集来源

初始数据收集和规范化: 未提供详细信息。
源语言生产者: 未提供详细信息。

标注

标注过程: 未提供详细信息。
标注者: 未提供详细信息。

个人和敏感信息

信息: 未提供详细信息。

使用数据集的注意事项

数据集的社会影响

风险: 发布训练配方揭示了某些能力的可能性，可能被恶意使用。
防御措施: 实施了内容过滤器和模型输出水印，限制了非商业用途。

偏见讨论

偏见: 未提供详细信息。

其他已知限制

限制: 数据由语言模型生成，不可避免地包含一些错误或偏见，建议谨慎使用并提出改进方法。

附加信息

数据集策展人

策展人: 未提供详细信息。

许可信息

许可: Creative Commons NonCommercial (CC BY-NC 4.0)。

引用信息

@misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/tatsu-lab/stanford_alpaca}}, }

贡献

贡献: 未提供详细信息。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集