papaya523/guanjian_anli_test1

Name: papaya523/guanjian_anli_test1
Creator: papaya523
Published: 2024-07-11 13:42:09
License: 暂无描述

Hugging Face2024-07-11 更新2024-07-13 收录

下载链接：

https://hf-mirror.com/datasets/papaya523/guanjian_anli_test1

下载链接

链接失效反馈

官方服务：

资源简介：

Alpaca-Cleaned数据集是对斯坦福大学发布的原始Alpaca数据集的清理版本，解决了原始数据集中存在的多个问题，如幻觉、合并指令、空输出、空代码示例、生成图像的指令、N/A输出、不一致的输入字段、错误答案、非清晰指令以及多余的转义和控制字符。数据集包含52,000条指令和演示，用于指令微调语言模型，使其更好地遵循指令。数据集的创建过程包括使用OpenAI的`text-davinci-003`引擎生成指令数据，并进行了多项改进以降低成本和提高数据多样性。数据集的结构包括指令、输入、输出和格式化文本字段，数据分割为训练集，包含52,002个实例。

The Alpaca-Cleaned dataset is a refined version of the original Alpaca Dataset released by Stanford. This dataset addresses several issues found in the original release, such as hallucinations, merged instructions, empty outputs, and inconsistent input fields. Designed for instruction-tuning of language models, it contains 52,000 unique instructions generated by OpenAIs `text-davinci-003` engine. The data instances include fields for instruction, input, output, and formatted text. The dataset is in English and is licensed under CC BY-NC 4.0.

提供机构：

papaya523

原始信息汇总

数据集卡片：Alpaca-Cleaned

数据集描述

概述

名称: Alpaca-Cleaned
语言: 英语 (BCP-47 en)
任务类别: 文本生成
标签: instruction-finetuning
许可证: CC BY-4.0

数据集来源

原始数据集: 由斯坦福大学发布的Alpaca数据集。
清理版本: 针对原始数据集中的多个问题进行了修正，包括幻觉、合并指令、空输出、空代码示例、生成图像指令、N/A输出、输入字段不一致、错误答案、非逻辑/不清晰指令以及多余转义和控制字符。

数据集结构

数据实例

示例: json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

Create a classification task by clustering the given list of items.

Input:

Apples, oranges, bananas, strawberries, pineapples

Response:

Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples" }

数据字段

instruction: 描述模型应执行的任务，每个任务都是唯一的。
input: 任务的可选上下文或输入，约40%的示例包含输入。
output: 由text-davinci-003生成的指令答案。
text: 包含instruction、input和output的格式化文本，用于微调模型。

数据分割

训练集: 52002条数据

数据集创建

原始数据集生成: 使用OpenAI的text-davinci-003引擎生成，基于Self-Instruct框架进行修改。
生成方法: 使用更激进的批量解码，一次生成20条指令，显著降低了数据生成成本。
数据多样性: 生成的52K数据比Self-Instruct发布的数据更加多样化。

使用注意事项

社会影响: 数据集的发布旨在促进学术界对指令跟随语言模型的科学研究，但也存在潜在风险，如模型可能被滥用。
已知限制: 数据由语言模型生成，不可避免地包含一些错误或偏见，建议用户谨慎使用并提出改进方法。

许可证信息

许可证: Creative Commons NonCommercial (CC BY-NC 4.0)

引用信息

@misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/tatsu-lab/stanford_alpaca}}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集