mrm8488/unnatural-instructions-core

Hugging Face2022-12-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mrm8488/unnatural-instructions-core

下载链接

链接失效反馈

资源简介：

--- dataset_info: features: - name: instruction dtype: string - name: instances list: - name: instruction_with_input dtype: string - name: input dtype: string - name: constraints dtype: string - name: output dtype: string splits: - name: train num_bytes: 54668900 num_examples: 66010 download_size: 28584196 dataset_size: 54668900 --- # Dataset Card for Unnatural Instructions (Core data) This info comes from the **Unnatural Instructions GitHub [repo](https://github.com/orhonovich/unnatural-instructions/)**. Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model. See full details in the paper: "[Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor](https://arxiv.org/abs/2212.09689)" ## 🗃️ Content The Unnatural Instructions core dataset of 68,478 instruction-input-output triplets. ## 📄 Format ### Core data Each example contains: - `input`: An input for the task described by the `instruction` - `instruction_with_input`: The instruction concatenated with the `input` - `constraints`: The task's output space constraints - `output`: The output of executing `instruction` with the given `input` ## 📘 Citation If you make use of Unnatural Instructions, please cite the following paper: ``` @misc{honovich2022unnatural, title = {Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor}, author = {Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo}, url = {https://arxiv.org/abs/2212.09689}, publisher = {arXiv}, year={2022} } ``` [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

mrm8488

原始信息汇总

数据集卡片：Unnatural Instructions（核心数据）

数据集信息

特征:
- instruction: 字符串类型
- instances: 列表类型，包含以下子项:
  - instruction_with_input: 字符串类型
  - input: 字符串类型
  - constraints: 字符串类型
  - output: 字符串类型
拆分:
- train:
  - 字节数: 54668900
  - 样本数: 66010
下载大小: 28584196 字节
数据集大小: 54668900 字节

内容

Unnatural Instructions 核心数据集包含 68,478 个指令-输入-输出三元组。

格式

核心数据

每个示例包含:

input: 任务的输入
instruction_with_input: 指令与输入的连接
constraints: 任务输出空间的约束
output: 执行指令并使用给定输入的输出

引用

如果您使用 Unnatural Instructions 数据集，请引用以下论文:

@misc{honovich2022unnatural, title = {Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor}, author = {Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo}, url = {https://arxiv.org/abs/2212.09689}, publisher = {arXiv}, year={2022} }

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，自动化生成高质量指令数据是提升模型泛化能力的关键。Unnatural Instructions核心数据集通过大型语言模型自动生成指令-输入-输出三元组，构建过程摒弃了传统人工标注的繁复劳动。该方法首先利用少量人工种子指令引导模型生成多样化任务描述，随后通过自展式迭代扩展数据规模，最终形成包含68,478条样本的指令遵循数据集。这种自动化构建机制显著降低了数据收集成本，同时确保了任务类型的丰富性与语义复杂性。

特点

该数据集在指令遵循任务领域展现出独特优势，其核心特征体现在结构化数据组织与任务多样性上。每条样本均包含原始指令、带输入指令、输出约束及对应输出四个字段，这种多维表征方式为模型提供了完整的任务执行上下文。数据覆盖了开放式生成、约束性输出等多种任务类型，指令语言兼具自然表达与人工构造的混合特性。这种设计既保留了自然语言的流畅性，又通过约束条件引入了可控生成要素，为研究指令理解与条件生成提供了理想实验平台。

使用方法

研究者可将该数据集广泛应用于指令调优、模型对齐及任务泛化能力评估等场景。使用时应先通过HuggingFace数据集库加载核心训练集，注意数据已预分割为单训练集形态。典型应用流程包括：将instruction_with_input作为模型输入，output作为目标序列进行监督训练；或利用constraints字段研究受限生成任务。建议结合交叉验证评估模型在未见指令上的表现，同时可通过对比原始instruction与instruction_with_input的差异，深入探究输入上下文对指令理解的影响机制。

背景与挑战

背景概述

在自然语言处理领域，指令微调是提升大型语言模型适应下游任务能力的关键技术。2022年，由Or Honovich、Thomas Scialom、Omer Levy和Timo Schick等研究人员提出的Unnatural Instructions数据集，标志着自动化指令生成的重要突破。该数据集通过大型语言模型自动生成68,478条指令-输入-输出三元组，旨在以极低的人工成本实现模型的高效微调，推动了少样本乃至零样本学习的研究进程，对指令遵循模型的开发产生了深远影响。

当前挑战

该数据集致力于解决指令遵循任务中高质量标注数据稀缺的核心挑战，其构建过程面临多重困难。自动生成的指令需在多样性和复杂性之间取得平衡，确保覆盖广泛的语言理解与生成场景。同时，生成的数据必须保持逻辑一致性与任务可行性，避免引入噪声或矛盾。此外，如何有效定义输出空间的约束条件，以引导模型产生符合预期的响应，亦是构建过程中的关键难题。这些挑战共同考验着自动化数据生成方法的可靠性与泛化能力。

常用场景

经典使用场景

在自然语言处理领域，指令微调已成为提升大型语言模型泛化能力的关键技术。Unnatural Instructions数据集通过自动化生成大量指令-输入-输出三元组，为模型提供了丰富的监督信号。其经典使用场景在于训练模型理解和执行多样化、非自然的人类指令，从而增强模型在零样本或少样本设置下的任务适应性。该数据集通过模拟人类指令的复杂性和多样性，帮助模型学习从抽象描述到具体执行的映射过程，为指令跟随模型的开发奠定了数据基础。

衍生相关工作

基于Unnatural Instructions数据集，研究者们衍生出多项经典工作，进一步拓展了其应用边界。例如，后续研究探索了数据生成质量的提升策略，如引入强化学习或对抗训练来优化指令的多样性与真实性。同时，该数据集启发了对指令泛化机制的深入分析，促进了跨任务迁移学习、多模态指令理解等领域的发展，为构建更通用、更强大的语言模型生态系统提供了重要支撑。

数据集最近研究