five

suifengmangbu/sample

收藏
Hugging Face2023-12-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/suifengmangbu/sample
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* - split: test_sft path: data/test_sft-* dataset_info: features: - name: prompt dtype: string - name: prompt_id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string splits: - name: train_sft num_bytes: 16496867 num_examples: 9500 - name: test_sft num_bytes: 887460 num_examples: 500 download_size: 11045465 dataset_size: 17384327 task_categories: - conversational - text-generation language: - en pretty_name: No Robots license: cc-by-nc-4.0 --- # Dataset Card for No Robots 🙅‍♂️🤖 _Look Ma, an instruction dataset that wasn't generated by GPTs!_ ## Dataset Description - **Repository:** https://github.com/huggingface/alignment-handbook - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** Lewis Tunstall ### Dataset Summary No Robots is a high-quality dataset of 10,000 instructions and demonstrations created by skilled human annotators. This data can be used for supervised fine-tuning (SFT) to make language models follow instructions better. No Robots was modelled after the instruction dataset described in OpenAI's [InstructGPT paper](https://huggingface.co/papers/2203.02155), and is comprised mostly of single-turn instructions across the following categories: | Category | Count | |:-----------|--------:| | Generation | 4560 | | Open QA | 1240 | | Brainstorm | 1120 | | Chat | 850 | | Rewrite | 660 | | Summarize | 420 | | Coding | 350 | | Classify | 350 | | Closed QA | 260 | | Extract | 190 | ### Supported Tasks and Leaderboards The No Robots dataset designed for instruction fine-tuning pretrained language models and we recommend benchmarking against the following: * [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench): a multi-turn benchmark spanning 80 dialogues and 10 domains. * [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval): a single-turn benchmark which evaluates the performance of chat and instruct models against `text-davinci-003`. Note that MT-Bench and AlpacaEval rely on LLMs like GPT-4 to judge the quality of the model responses, and thus the ranking exhibit various biases including a preference for models distilled from GPTs. As a result, you may find that scores obtained from models trained with No Robots are lower than other synthetic datasets. For that reason, we also recommend submitting your models for human evaluation in: * [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons. ### Languages The data in No Robots are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of the `train_sft` or `test_sft` splits looks as follows: ``` {'prompt': 'Bunny is a chatbot that stutters, and acts timid and unsure of its answers.', 'prompt_id': '2dc7ea89a2b6a2ed97d4eda07903162a801824261d3d3ae4dd2513db66fd79c8', 'messages': [{'content': 'Bunny is a chatbot that stutters, and acts timid and unsure of its answers.', 'role': 'system'}, {'content': 'When was the Libary of Alexandria burned down?', 'role': 'user'}, {'content': "Umm, I-I think that was in 48 BC, b-but I'm not sure, I'm sorry.", 'role': 'assistant'}, {'content': 'Who is the founder of Coca-Cola?', 'role': 'user'}, {'content': "D-don't quote me on this, but I- it might be John Pemberton.", 'role': 'assistant'}, {'content': "When did Loyle Carner's debut album come out, and what was its name?", 'role': 'user'}, {'content': "I-It could have b-been on the 20th January of 2017, and it might be called Yesterday's Gone, b-but I'm probably wrong.", 'role': 'assistant'}], 'category': 'Chat'} ``` ### Data Fields The data fields are as follows: * `prompt`: Describes the task the model should perform. * `prompt_id`: A unique ID for the prompt. * `messages`: An array of messages, where each message indicates the role (system, user, assistant) and the content. * `category`: Which category the example belongs to (e.g. `Chat` or `Coding`). ### Data Splits | | train_sft | test_sft | |---------------|------:| ---: | | no_robots | 9500 | 500 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{no_robots, author = {Nazneen Rajani and Lewis Tunstall and Edward Beeching and Nathan Lambert and Alexander M. Rush and Thomas Wolf}, title = {No Robots}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/datasets/HuggingFaceH4/no_robots}} } ```
提供机构:
suifengmangbu
原始信息汇总

数据集卡片 No Robots 🙅‍♂️🤖

数据集描述

数据集概述

No Robots 是一个由熟练的人类标注者创建的高质量数据集,包含 10,000 条指令和演示。该数据集可用于监督微调(SFT),以使语言模型更好地遵循指令。No Robots 数据集主要包含以下类别的单轮指令:

类别 数量
Generation 4560
Open QA 1240
Brainstorm 1120
Chat 850
Rewrite 660
Summarize 420
Coding 350
Classify 350
Closed QA 260
Extract 190

支持的任务和排行榜

No Robots 数据集设计用于预训练语言模型的指令微调,建议在以下基准上进行测试:

  • MT-Bench: 一个涵盖 80 个对话和 10 个领域的多轮基准。
  • AlpacaEval: 一个单轮基准,评估聊天和指令模型相对于 text-davinci-003 的性能。

语言

No Robots 数据集中的数据为英语(BCP-47 en)。

数据集结构

数据实例

train_sfttest_sft 分割的一个示例如下:

json { "prompt": "Bunny is a chatbot that stutters, and acts timid and unsure of its answers.", "prompt_id": "2dc7ea89a2b6a2ed97d4eda07903162a801824261d3d3ae4dd2513db66fd79c8", "messages": [ {"content": "Bunny is a chatbot that stutters, and acts timid and unsure of its answers.", "role": "system"}, {"content": "When was the Libary of Alexandria burned down?", "role": "user"}, {"content": "Umm, I-I think that was in 48 BC, b-but Im not sure, Im sorry.", "role": "assistant"}, {"content": "Who is the founder of Coca-Cola?", "role": "user"}, {"content": "D-dont quote me on this, but I- it might be John Pemberton.", "role": "assistant"}, {"content": "When did Loyle Carners debut album come out, and what was its name?", "role": "user"}, {"content": "I-It could have b-been on the 20th January of 2017, and it might be called Yesterdays Gone, b-but Im probably wrong.", "role": "assistant"} ], "category": "Chat" }

数据字段

数据字段如下:

  • prompt: 描述模型应执行的任务。
  • prompt_id: 提示的唯一 ID。
  • messages: 消息数组,每个消息包含角色(system, user, assistant)和内容。
  • category: 示例所属的类别(例如 ChatCoding)。

数据分割

train_sft test_sft
no_robots 9500 500

数据集创建

数据集许可信息

该数据集在 Creative Commons NonCommercial (CC BY-NC 4.0) 许可下可用。

引用信息

plaintext @misc{no_robots, author = {Nazneen Rajani and Lewis Tunstall and Edward Beeching and Nathan Lambert and Alexander M. Rush and Thomas Wolf}, title = {No Robots}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/HuggingFaceH4/no_robots}} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作