snorkelai/snorkel-curated-instruction-tuning

Name: snorkelai/snorkel-curated-instruction-tuning
Creator: snorkelai
Published: 2024-03-11 18:26:46
License: 暂无描述

Hugging Face2024-03-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/snorkelai/snorkel-curated-instruction-tuning

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - text-generation language: - en size_categories: - 10K<n<100K --- ***<p style="font-size: 20px">Please check out our Blog Post - [How we built a better GenAI with programmatic data development](https://snorkel.ai/how-we-built-better-genai-with-programmatic-data-development/) for more details!</p>*** ## Summary `snorkel-curated-instruction-tuning` is a curated dataset that consists of high-quality instruction-response pairs. These pairs were programmatically filtered with weak supervision from open-source datasets [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k), [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1), and [Helpful Instructions](https://huggingface.co/datasets/HuggingFaceH4/helpful_instructions). To enhance the dataset, we also programmatically classified each instruction based on the InstructGPT paper. For a more comprehensive understanding of our methodology, please visit our [blog](https://snorkel.ai/how-we-built-better-genai-with-programmatic-data-development/). ## Dataset Overview & Methodology Instruction tuning is an important step in developing effective [large language models (LLMs)](https://snorkel.ai/large-language-models-llms/) for generative AI tasks. While proprietary datasets have been used by LLM-backed chatbots, the open-source community has created similar datasets accessible to everyone. However, the quality of responses collected by volunteers has been inconsistent, affecting the quality of open-source models. Furthermore, there is currently no standard classification of instructions across datasets (many lack classification altogether), which can complicate measurements of instruction diversity when compiling from multiple sources. Snorkel, with its expertise in converting noisy signals into high-quality supervision, addressed this issue by programmatically scoring, sampling, and filtering open-source datasets. The curated dataset and methodology are now available for public use. Please refer to our [blog](https://snorkel.ai/how-we-built-better-genai-with-programmatic-data-development/) for more details on methods and evaluation. ## File descriptions - `snorkel_curated_11k.jsonl`: 11k high-quality instruction-response pair selected from the mentioned open-source dataset. This is then used to instruction-tune the [snorkelai/RedPajama-7B-Chat-Curated](https://huggingface.co/snorkelai/RedPajama-7B-Chat-Curated/). - `snorkel_hold_out_set.jsonl`: A hold-out set for evaluation, comparing human preferences between models. ## Intended Uses - Instruction-tuning LLMs For more detailed information, please refer to our blog post available at [How we built a better GenAI with programmatic data development](snorkel.ai/how-we-built-a-better-genai-with-programmatic-data-development). ## License/Attribution **Copyright (2023) Snorkel AI, Inc.** This dataset was developed at [Snorkel AI](https://snorkel.ai/) and its use is subject to the Apache 2.0 license. This work comes with the collaboration with Together Computer in releasing the [snorkelai/RedPajama-7B-Chat-Curated](https://huggingface.co/snorkelai/RedPajama-7B-Chat-Curated/) model. Please refer to the licenses of the data subsets you use. - [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1) is under Apache 2.0 license. - [Helpful Instructions](https://huggingface.co/datasets/HuggingFaceH4/helpful_instructions) is under Apache 2.0 license. - [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) is under CC BY-SA 3.0 license. Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors. Databricks (https://www.databricks.com) Copyright © Databricks ## Language English ## Version Version: 1.0 To cite this dataset, please use: ``` @software{snorkel2023instructiontuning, author = {Snorkel AI}, title = {Applying programmatic data development to Generative AI with Snorkel}, month = June, year = 2023, url = {https://huggingface.co/datasets/snorkelai/snorkel-curated-instruction-tuning} } ``` **Owner: Snorkel AI, Inc.** ## Community Join us on [Snorkel AI Slack](snorkel.ai/slack)

提供机构：

snorkelai

原始信息汇总

数据集概述

snorkel-curated-instruction-tuning 是一个精选数据集，包含高质量的指令-响应对。这些对是通过弱监督程序从开源数据集中筛选出来的，包括 Databricks Dolly-15k、Open Assistant 和 Helpful Instructions。

为了增强数据集，我们还根据 InstructGPT 论文程序化地对每个指令进行了分类。

数据集概览与方法论

指令调优是开发有效的生成式 AI 任务的大型语言模型（LLMs）的重要步骤。虽然专有数据集已被 LLM 支持的聊天机器人使用，但开源社区也创建了类似的数据集供所有人访问。然而，志愿者收集的响应质量参差不齐，影响了开源模型的质量。此外，目前跨数据集的指令分类没有标准，这在从多个来源编译时可能会使指令多样性的测量复杂化。Snorkel 通过程序化地评分、采样和过滤开源数据集，解决了这一问题。精选数据集和方法现已公开可用。

文件描述

snorkel_curated_11k.jsonl: 从提到的开源数据集中选出的 11k 高质量指令-响应对，用于指令调优 snorkelai/RedPajama-7B-Chat-Curated。
snorkel_hold_out_set.jsonl: 用于评估的保留集，比较模型之间的人类偏好。

预期用途

指令调优 LLMs

许可/归属

版权 (2023) Snorkel AI, Inc. 该数据集由 Snorkel AI 开发，其使用受 Apache 2.0 许可证约束。

此工作与 Together Computer 合作发布 snorkelai/RedPajama-7B-Chat-Curated 模型。

请参考您使用的数据子集的许可证：

Open Assistant 受 Apache 2.0 许可证约束。
Helpful Instructions 受 Apache 2.0 许可证约束。
Databricks Dolly-15k 受 CC BY-SA 3.0 许可证约束。

数据集中的某些类别材料来自以下来源，受 CC BY-SA 3.0 许可证约束：

Wikipedia (各种页面) - https://www.wikipedia.org/ 版权 © Wikipedia 编辑和贡献者。
Databricks (https://www.databricks.com) 版权 © Databricks

语言

英语

版本

版本：1.0

引用此数据集时，请使用：

@software{snorkel2023instructiontuning, author = {Snorkel AI}, title = {Applying programmatic data development to Generative AI with Snorkel}, month = June, year = 2023, url = {https://huggingface.co/datasets/snorkelai/snorkel-curated-instruction-tuning} }

所有者：Snorkel AI, Inc.

搜集汇总

数据集介绍

构建方式

在指令调优领域，构建高质量数据集是提升大语言模型性能的关键。该数据集采用程序化数据开发方法，从开源数据集Databricks Dolly-15k、Open Assistant和Helpful Instructions中，通过弱监督技术进行系统性筛选与过滤。这一过程不仅对指令-响应对进行质量评分与采样，还依据InstructGPT论文框架对每条指令进行了自动化分类，从而整合出一套经过精细整理的指令调优数据。

特点

该数据集的核心特征在于其经过严格程序化筛选的高质量指令-响应对，有效解决了开源社区数据质量参差不齐的问题。数据集提供了标准化的指令分类，增强了跨数据源整合时对指令多样性的度量能力。其包含约一万一千条精选数据，并附带一个独立的留出评估集，便于直接用于模型训练与基于人类偏好的性能比较。

使用方法

该数据集专为大语言模型的指令调优任务而设计。研究人员可直接加载提供的JSONL格式文件，将其用于训练或微调生成式模型，以提升模型遵循指令和生成高质量回应的能力。留出评估集可用于客观比较不同模型在人类偏好上的表现，为模型优化提供实证依据。

背景与挑战

背景概述

在生成式人工智能领域，指令调优是提升大型语言模型任务适应性的关键环节。2023年，Snorkel AI机构推出了snorkel-curated-instruction-tuning数据集，旨在解决开源指令数据质量参差不齐的问题。该数据集通过程序化方法，从Databricks Dolly-15k、Open Assistant及Helpful Instructions等开源数据集中筛选出高质量指令-响应对，并依据InstructGPT论文对指令进行分类，以促进模型在多样化任务上的泛化能力。这一工作不仅推动了开源社区在生成式AI数据标准化方面的进展，也为后续模型如RedPajama-7B-Chat-Curated的优化提供了核心支持。

当前挑战

该数据集致力于应对指令调优领域的两大核心挑战：一是开源指令数据中响应的质量不一致性，这直接影响模型输出的可靠性与准确性；二是指令缺乏统一分类标准，导致在整合多源数据时难以量化评估指令的多样性。在构建过程中，挑战主要集中于如何通过弱监督方法从噪声数据中有效提取高质量样本，并设计程序化流程实现自动化评分、采样与过滤，以确保最终数据集的纯净度与代表性。

常用场景

经典使用场景

在生成式人工智能领域，指令调优是提升大型语言模型性能的关键环节。snorkel-curated-instruction-tuning数据集通过程序化筛选与分类，从多个开源数据集中萃取出高质量的指令-响应对，为研究者提供了标准化的训练资源。该数据集最经典的使用场景在于为开源大型语言模型提供精细化的指令调优支持，帮助模型更好地理解和遵循人类指令，从而生成更准确、连贯的文本响应。

解决学术问题

该数据集有效解决了开源指令数据集中普遍存在的质量参差不齐与分类标准缺失的学术难题。通过弱监督与程序化过滤，它提升了指令数据的整体质量与一致性，为衡量指令多样性提供了可靠基准。其意义在于推动了开放社区在生成式AI研究中的公平竞争，降低了高质量训练数据的获取门槛，促进了模型性能的可重复性与可比性。

衍生相关工作

基于该数据集衍生的经典工作包括snorkelai/RedPajama-7B-Chat-Curated模型的开发，该模型通过指令调优显著提升了对话生成质量。此外，其程序化数据构建方法论为后续研究提供了范本，激励了更多利用弱监督技术优化训练数据的工作，推动了数据为中心的人工智能研究范式在生成式AI领域的深入应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集