AdamCodd/no_robots-alpaca

Name: AdamCodd/no_robots-alpaca
Creator: AdamCodd
Published: 2024-06-17 18:37:04
License: 暂无描述

Hugging Face2024-06-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/AdamCodd/no_robots-alpaca

下载链接

链接失效反馈

官方服务：

资源简介：

No Robots是一个高质量的数据集，包含10,000条由熟练的人类注释者创建的指令和演示。该数据集可用于监督微调（SFT），以使语言模型更好地遵循指令。No Robots模仿了OpenAI的InstructGPT论文中描述的指令数据集，主要由单轮指令组成，涵盖了生成、开放问答、头脑风暴、聊天、重写、总结、编码、分类、封闭问答和提取等多个类别。数据集的结构包括训练和测试分割，每个实例包含提示、提示ID、消息数组和类别字段。

No Robots is a high-quality dataset containing 10,000 instructions and demonstrations created by skilled human annotators. This dataset can be used for Supervised Fine-tuning (SFT) to enable language models to better follow instructions. No Robots mimics the instruction dataset described in OpenAI's InstructGPT paper, and it mainly comprises single-turn instructions covering multiple categories, including generation, open-ended question answering, brainstorming, chatting, rewriting, summarization, coding, classification, closed-ended question answering and extraction. The dataset structure includes training and test splits, with each instance containing a prompt, prompt ID, message array and category field.

提供机构：

AdamCodd

原始信息汇总

No Robots: Alpaca edition

数据集概述

数据集名称: No Robots Alpaca
许可协议: CC BY-NC 4.0
任务类别:
- 文本生成
- 对话生成
语言: 英语
数据集大小: 10K<n<100K

数据集描述

该数据集是No Robots数据集的清理和重新格式化版本，适应了Alpaca指令集。特别地，它对“聊天”类别进行了分解，以适应Alpaca对多轮交互的限制。数据集的ID使用SHA256算法生成。此外，只有“分类”、“总结”、“重写”、“提取”和“聊天”类别包含<b>输入</b>字段。

原数据集描述

数据集名称: No Robots
数据集摘要:
- 包含10,000条由专业标注人员创建的指令和演示。
- 适用于监督微调（SFT），以提高语言模型遵循指令的能力。
- 主要包含单轮指令，涵盖以下类别：

类别	数量
生成	4560
开放问答	1240
头脑风暴	1120
聊天	850
重写	660
总结	420
编码	350
分类	350
封闭问答	260
提取	190

支持的任务和排行榜

MT-Bench: 多轮基准测试，涵盖80个对话和10个领域。
AlpacaEval: 单轮基准测试，评估聊天和指令模型对text-davinci-003的性能。

数据结构

数据实例: json { prompt: Bunny is a chatbot that stutters, and acts timid and unsure of its answers., prompt_id: 2dc7ea89a2b6a2ed97d4eda07903162a801824261d3d3ae4dd2513db66fd79c8, messages: [ {content: Bunny is a chatbot that stutters, and acts timid and unsure of its answers., role: system}, {content: When was the Libary of Alexandria burned down?, role: user}, {content: "Umm, I-I think that was in 48 BC, b-but Im not sure, Im sorry.", role: assistant}, {content: Who is the founder of Coca-Cola?, role: user}, {content: "D-dont quote me on this, but I- it might be John Pemberton.", role: assistant}, {content: "When did Loyle Carners debut album come out, and what was its name?", role: user}, {content: "I-It could have b-been on the 20th January of 2017, and it might be called Yesterdays Gone, b-but Im probably wrong.", role: assistant} ], category: Chat }
数据字段:
- prompt: 描述模型应执行的任务。
- prompt_id: 提示的唯一ID。
- messages: 消息数组，每个消息包含角色（系统、用户、助手）和内容。
- category: 示例所属的类别（例如“聊天”或“编码”）。

数据分割

	train_sft	test_sft
no_robots	9500	500

许可信息

该数据集在Creative Commons NonCommercial (CC BY-NC 4.0)许可下可用。

引用信息

@misc{no_robots, author = {Nazneen Rajani and Lewis Tunstall and Edward Beeching and Nathan Lambert and Alexander M. Rush and Thomas Wolf}, title = {No Robots}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/HuggingFaceH4/no_robots}} }

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个高质量的人工标注指令集，用于监督微调语言模型，包含10,000条指令，覆盖多种任务类别，如生成、问答和聊天等。它经过清理和重新格式化以与Alpaca指令集兼容，特别处理了聊天类别以适应单轮交互。数据集结构清晰，包含训练和测试分割，适用于提升模型遵循指令的能力。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集