MattiaL/tapir-cleaned-top90

Name: MattiaL/tapir-cleaned-top90
Creator: MattiaL
Published: 2023-05-17 15:07:30
License: 暂无描述

Hugging Face2023-05-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MattiaL/tapir-cleaned-top90

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 language: - en tags: - instruction-finetuning pretty_name: Tapir-Cleaned task_categories: - text-generation size_categories: - 10K<n<100K --- # Dataset Card for Tapir-Cleaned This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. ## Tapir Dataset Summary Tapir is a subset of the larger DAISLab dataset, which comprises 242,480 recipes extracted from the IFTTT platform. After a thorough cleaning process that involved the removal of redundant and inconsistent recipes, the refined dataset was condensed to include 32,403 high-quality recipes. This curated set of instruction data is particularly useful for conducting instruction-tuning exercises for language models, allowing them to more accurately follow instructions and achieve superior performance. The last version of Tapir includes a correlation score that helps to identify the most appropriate description-rule pairs for instruction tuning. Description-rule pairs with a score greater than 0.75 are deemed good enough and are prioritized for further analysis and tuning. ### Supported Tasks and Leaderboards The Tapir dataset designed for instruction training pretrained language models ### Languages The data in Tapir are mainly in English (BCP-47 en). # Dataset Structure ### Data Instances ```json { "instruction":"From the description of a rule: identify the 'trigger', identify the 'action', write a IF 'trigger' THEN 'action' rule.", "input":"If it's raining outside, you'll want some nice warm colors inside!", "output":"IF Weather Underground Current condition changes to THEN LIFX Change color of lights", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nFrom the description of a rule: identify the 'trigger', identify the 'action', write a IF 'trigger' THEN 'action' rule.\n\n### Input:\nIf it's raining outside, you'll want some nice warm colors inside!\n\n### Response:\nIF Weather Underground Current condition changes to THEN LIFX Change color of lights", } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. * `input`: context or input for the task. Each of the 32k input is unique. * `output`: the answer taken from the original Tapir Dataset formatted as an IFTTT recipe. * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors of Alpaca for fine-tuning their models. ### Data Splits | | train | |---------------|------:| | tapir | 32403 | ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{tapir, author = {Mattia Limone, Gaetano Cimino, Annunziata Elefante}, title = {TAPIR: Trigger Action Platform for Information Retrieval}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/MattiaLimone/ifttt_recommendation_system}}, } ```

许可证：cc-by-nc-4.0 语言： - 英语标签： - 指令微调（instruction-finetuning）友好名称：Tapir-Cleaned 任务类别： - 文本生成规模类别： - 10K<n<100K --- # Tapir-Cleaned 数据集卡片本数据集是DAISLab发布的IFTTT规则数据集的修订版本，经过了彻底的清洗、评分与调整，以适配指令微调任务需求。 ## Tapir 数据集概述 Tapir是更大规模DAISLab数据集的子集，该原始数据集包含从IFTTT平台提取的242480条规则配方。经过包含冗余与不一致规则配方移除步骤的全面清洗流程后，精炼后的数据集精简至32403条高质量规则配方。这套经过精心整理的指令数据，尤其适用于大语言模型（Large Language Model, LLM）的指令微调训练，可帮助模型更精准地遵循指令，从而获得更优异的性能表现。 Tapir的最新版本引入了相关性评分机制，可用于筛选适配指令微调任务的最优描述-规则配对。评分高于0.75的描述-规则配对将被视为合格样本，并优先用于后续分析与微调工作。 ### 支持任务与排行榜本Tapir数据集专为预训练大语言模型的指令微调训练设计。 ### 语言说明 Tapir数据集的文本语言主要为英语（BCP-47 标签：en）。 # 数据集结构 ## 数据样例 json { "instruction":"从某条规则的描述中：识别出‘触发条件’，识别出‘执行动作’，编写一条IF‘触发条件’THEN‘执行动作’格式的规则。", "input":"如果外面正在下雨，你希望室内呈现温馨柔和的暖色调！", "output":"IF Weather Underground Current condition changes to THEN LIFX Change color of lights", "text": "以下是一条描述任务的指令，搭配了提供额外上下文的输入。请编写一个合适的响应来完成该请求。 ### 指令: 从某条规则的描述中：识别出‘触发条件’，识别出‘执行动作’，编写一条IF‘触发条件’THEN‘执行动作’格式的规则。 ### 输入: 如果外面正在下雨，你希望室内呈现温馨柔和的暖色调！ ### 响应: IF Weather Underground Current condition changes to THEN LIFX Change color of lights", } ### 数据字段各数据字段说明如下： * `instruction`：描述模型需执行的任务内容。 * `input`：任务所需的上下文或输入数据，本次数据集的32403条输入均为唯一样本。 * `output`：源自原始Tapir数据集的答案，格式为IFTTT规则配方。 * `text`：将`instruction`、`input`与`output`按照Alpaca作者用于模型微调的[提示词模板](https://github.com/tatsu-lab/stanford_alpaca#data-release)拼接而成的完整文本。 ### 数据划分 | | 训练集 | |---------------|------:| | tapir | 32403 | ### 许可证信息本数据集采用[知识共享署名-非商业性使用4.0国际许可协议（cc-by-nc-4.0）](https://creativecommons.org/licenses/by-nc/4.0/legalcode)进行发布。 ### 引用信息 @misc{tapir, author = {Mattia Limone, Gaetano Cimino, Annunziata Elefante}, title = {TAPIR: Trigger Action Platform for Information Retrieval}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/MattiaLimone/ifttt_recommendation_system}}, }

提供机构：

MattiaL

原始信息汇总

Tapir-Cleaned 数据集概述

数据集描述

Tapir-Cleaned 是 DAISLab 数据集的一个修订版本，专门为指令微调（instruction-tuning）进行了彻底的清理、评分和调整。该数据集源自 IFTTT 平台的 242,480 个配方，经过清理后，精选出 32,403 个高质量配方。

数据集用途

Tapir-Cleaned 数据集主要用于训练预训练语言模型进行指令微调，以提高模型遵循指令的能力和性能。

数据集特点

数据清理：原始数据经过清理，去除了冗余和不一致的配方。
质量评分：数据集包含一个相关性评分，描述-规则对得分大于 0.75 的被认为是高质量的，适合进一步分析和微调。

数据集结构

数据实例

每个数据实例包含以下字段：

instruction：描述模型应执行的任务。
input：任务的上下文或输入，每个输入都是唯一的。
output：从原始 Tapir 数据集中提取的答案，格式化为 IFTTT 配方。
text：包含 instruction、input 和 output，使用 Alpaca 模型作者的数据发布模板格式化。

数据分割

训练集：包含 32,403 个实例。

语言

数据集主要使用英语（BCP-47 en）。

许可证

数据集根据 Creative Commons NonCommercial (CC BY-NC 4.0) 许可提供。

引用信息

@misc{tapir, author = {Mattia Limone, Gaetano Cimino, Annunziata Elefante}, title = {TAPIR: Trigger Action Platform for Information Retrieval}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/MattiaLimone/ifttt_recommendation_system}}, }

搜集汇总

数据集介绍

构建方式

在自动化规则挖掘领域，Tapir-Cleaned数据集源于对IFTTT平台原始配方的系统性提炼。其构建过程始于对DAISLab数据集中24万余条初始规则的筛选，通过严谨的数据清洗流程剔除了冗余与不一致的条目，最终精炼出32,403条高质量配方。每条数据均经过相关性评分，仅保留分数高于0.75的描述-规则配对，确保了指令与执行逻辑的高度匹配，为后续的指令微调奠定了可靠的数据基础。

使用方法

研究人员可将该数据集直接应用于语言模型的指令跟随能力训练。典型使用方式包括：以文本字段作为完整输入进行端到端微调，或分别利用指令、输入和输出字段构建监督式训练任务。数据集适用于多种文本生成架构，能够有效提升模型对复杂条件指令的解析与执行能力。在使用时需注意其非商业许可限制，并建议依据相关性分数对训练样本进行优先级排序，以优化模型收敛效果。

背景与挑战

背景概述

在自然语言处理领域，指令微调技术正成为提升大型语言模型遵循人类指令能力的关键途径。MattiaL/tapir-cleaned-top90数据集由Mattia Limone、Gaetino Cimino及Annunziata Elefante等研究人员于2023年构建，其核心源于DAISLab项目中的IFTTT平台规则集合。该数据集通过对原始242,480条规则进行深度清洗与筛选，最终提炼出32,403条高质量指令-规则配对数据，旨在为语言模型提供精准的指令遵循训练资源，推动智能体在自动化任务规划与信息检索场景中的应用发展。

当前挑战

该数据集致力于解决自动化规则生成与理解的复杂挑战，即如何让语言模型准确解析自然语言描述并映射为结构化的“触发-动作”逻辑规则。在构建过程中，研究人员面临多重困难：原始IFTTT数据中存在大量冗余与不一致的规则条目，需设计严谨的清洗流程以确保数据质量；同时，为评估描述与规则间的语义关联性，必须引入相关性评分机制，仅保留分数高于0.75的高置信度配对，这一过程对标注一致性与算法鲁棒性提出了较高要求。

常用场景

经典使用场景

在自然语言处理领域，指令微调已成为提升大型语言模型遵循人类指令能力的关键技术。Tapir-Cleaned数据集作为经过精心清洗和评分的IFTTT规则集合，其经典使用场景在于为语言模型提供高质量的指令微调数据。通过将自然语言描述映射为结构化的“IF-THEN”规则，该数据集能够训练模型准确理解用户意图，并生成符合逻辑的自动化任务指令，从而优化模型在开放式指令遵循任务中的表现。

解决学术问题

该数据集有效解决了指令微调研究中高质量、大规模训练数据稀缺的学术难题。传统指令数据往往存在噪声大、一致性差的问题，而Tapir-Cleaned通过严格的清洗流程和相关性评分机制，筛选出描述与规则高度匹配的样本，确保了数据的纯净度与可靠性。这为研究社区提供了基准数据，助力探索模型在复杂指令理解、逻辑推理及结构化输出生成方面的能力边界，推动了可控文本生成技术的发展。

实际应用

在实际应用层面，Tapir-Cleaned数据集支撑了智能自动化系统的开发。基于其训练的模型可应用于智能家居、工作流自动化及个性化服务推荐等场景。例如，系统能够解析用户如“下雨时打开暖色灯”的自然语言请求，自动转换为可执行的设备控制指令，降低用户使用技术门槛。这种能力使得人机交互更加直观高效，为构建下一代对话式AI助手和自动化服务平台奠定了数据基础。

数据集最近研究