five

nataliaElv/dolly_tutorial

收藏
Hugging Face2023-06-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nataliaElv/dolly_tutorial
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: 10K<n<100K tags: - rlfh - argilla - human-feedback --- # Dataset Card for dolly_tutorial This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.cfg`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("nataliaElv/dolly_tutorial") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("nataliaElv/dolly_tutorial") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/guides/llms/conceptual_guides/data_model.html) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are suppported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | category | Task category | TextField | True | False | | instruction | Instruction | TextField | True | False | | context | Input | TextField | True | False | | response | Response | TextField | True | False | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, single choice, or multiple choice. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | final-instruction | Final Instruction: | TextQuestion | True | Write the final version of the instruction, making sure that it matches the task category. If the original instruction is ok, copy and paste it here. | N/A | | final-context | Final Input: | TextQuestion | True | Write the final version of the input, making sure that it makes sense with the task category. If the original input is ok, copy and paste it here. Leave this question empty in the case of these task categories: open / general Q&A, brainstorming, creative writing. | N/A | | final-response | Final Response: | TextQuestion | True | Write the final version of the response, making sure that it matches the task category and makes sense for the instruction (and input) provided. If the original response is ok, copy and paste it here. Make sure that the grammar and orthography are correct. | N/A | Finally, the **guidelines** are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": "0", "fields": { "category": "closed_qa", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\u0027s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "instruction": "When did Virgin Australia start operating?", "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route." }, "metadata": null, "responses": [ { "status": "submitted", "user_id": "dc9c373f-c589-4845-b7e8-890520ca7d43", "values": { "final-context": { "value": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\u0027s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney." }, "final-instruction": { "value": "When did Virgin Australia start operating?" }, "final-response": { "value": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue." } } } ] } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "category": "closed_qa", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\u0027s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "external_id": "0", "final-context": { "status": [ "submitted" ], "user_id": [ "dc9c373f-c589-4845-b7e8-890520ca7d43" ], "value": [ "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\u0027s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney." ] }, "final-instruction": { "status": [ "submitted" ], "user_id": [ "dc9c373f-c589-4845-b7e8-890520ca7d43" ], "value": [ "When did Virgin Australia start operating?" ] }, "final-response": { "status": [ "submitted" ], "user_id": [ "dc9c373f-c589-4845-b7e8-890520ca7d43" ], "value": [ "Virgin Australia commenced services on 31 August 2000 as Virgin Blue." ] }, "instruction": "When did Virgin Australia start operating?", "metadata": null, "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route." } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are suppported. These are the ones that will be used to provide responses to the questions. * **category** is of type `TextField`. * **instruction** is of type `TextField`. * (optional) **context** is of type `TextField`. * **response** is of type `TextField`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as rating, text, single choice, or multiple choice. * **final-instruction** is of type `TextQuestion`, and description "Write the final version of the instruction, making sure that it matches the task category. If the original instruction is ok, copy and paste it here.". * (optional) **final-context** is of type `TextQuestion`, and description "Write the final version of the input, making sure that it makes sense with the task category. If the original input is ok, copy and paste it here. Leave this question empty in the case of these task categories: open / general Q&A, brainstorming, creative writing.". * **final-response** is of type `TextQuestion`, and description "Write the final version of the response, making sure that it matches the task category and makes sense for the instruction (and input) provided. If the original response is ok, copy and paste it here. Make sure that the grammar and orthography are correct.". Additionally, we also have one more field which is optional and is the following: * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines # Introduction In this dataset, you will find a collection of records that show a task category, an instruction, an input and a response. The aim of the project is to correct the instructions, inputs and responses to make sure they are of the highest quality and that they match the task category that they belong to. All three texts should be clear and include real information. # Task categories Instructions are classified according to 7 possible task categories. Please, read and understand these categories because they will change the way the instruction and input are formulated. The definitions are based on those made by [Databricks](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm). ## Open / General Q&A Here you will find an open question, for instance, “Why do people like comedy movies?” or “What is the capital of France?”. In some cases, there’s not a correct answer, and in others, it requires drawing on knowledge of the world at large. This type of task shouldn’t have an input. ## Closed Q&A These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?”. In this case, the task should have an instruction (the question), an input (the reference text) and a response. ## Information extraction In this task, the instruction will ask to extract entities or other factual information from a passage. In this case, the task should have an instruction (the question), an input (the reference text) and a response. ## Summarization Instructions of this kind of task will ask to summarize a passage. The passage should be in the input. The response should be a summarized version of the passage. ## Brainstorming Brainstorming instructions should ask for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”. ## Classification Instructions of this type should ask to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review. The item(s) to be classified should appear in the input field. ## Creative writing Instructions of this class include things like writing a poem or a love letter. # Questionnaire To curate the dataset, you will need to provide and answer to the questions below. Please, follow the pointers below to answer each question accordingly. If you are not sure about a record and you prefer not to provide a response, click Discard. ## 1. Final instruction: - The final version of the instruction field. You may copy it using the copy icon in the instruction field. - Leave it as it is if it's ok or apply any necessary corrections. - Remember to change the instruction if it doesn't represent well the task category of the record. - Instructions can contain grammar and orthography errors as long as they are clear. ## 2. Final input: - The final version of the instruction field. You may copy it using the copy icon in the input field. - Leave it as it is if it's ok or apply any necessary corrections. - Remember to add an input to the tasks that need one: closed Q&A, information extraction, summarization and classification. - This question should be blank whenever the task doesn’t need one: open / general Q&A, brainstorming, creative writing. - Inputs can contain grammar and orthography errors as long as they are clear. ## 3. Final response: - The final version of the response field. - You may copy it using the copy icon in the response field. - Leave it as it is if it's ok or apply any necessary corrections. - Check that the response makes sense given all the fields above and that it is as complete and concise as possible. - Responses should have their grammar and orthography checked and correct at all times. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
nataliaElv
原始信息汇总

数据集卡片 for dolly_tutorial

数据集描述

数据集概述

该数据集包含:

  • 符合Argilla数据集格式的配置文件argilla.cfg,用于在使用Argilla的FeedbackDataset.from_huggingface方法时配置数据集。
  • 与HuggingFace datasets兼容的数据集记录,这些记录在使用FeedbackDataset.from_huggingface时会自动加载,也可以通过datasets库独立加载。
  • 用于构建和整理数据集的标注指南(如果已在Argilla中定义)。

加载方式

使用Argilla加载

安装Argilla并使用以下代码加载数据集:

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("nataliaElv/dolly_tutorial")

使用datasets加载

安装datasets库并使用以下代码加载数据集:

python from datasets import load_dataset

ds = load_dataset("nataliaElv/dolly_tutorial")

支持的任务和排行榜

该数据集可以包含多个字段、问题和响应,因此可以用于不同的NLP任务,具体取决于配置。数据集结构在数据集结构部分中描述。

该数据集没有关联的排行榜。

语言

[更多信息待补充]

数据集结构

数据在Argilla中

数据集在Argilla中包含:字段问题指南

字段是数据集记录本身,目前仅支持文本字段。这些字段将用于提供对问题的响应。

字段名称 标题 类型 必填 Markdown
category 任务类别 TextField True False
instruction 指令 TextField True False
context 输入 TextField True False
response 响应 TextField True False

问题是向标注者提出的问题。它们可以是不同类型,如评分、文本、单选或多选。

问题名称 标题 类型 必填 描述 值/标签
final-instruction 最终指令: TextQuestion True 编写指令的最终版本,确保它与任务类别匹配。如果原始指令没问题,请在此处复制粘贴。 N/A
final-context 最终输入: TextQuestion True 编写输入的最终版本,确保它与任务类别匹配。如果原始输入没问题,请在此处复制粘贴。对于以下任务类别,请留空:开放/通用Q&A、头脑风暴、创意写作。 N/A
final-response 最终响应: TextQuestion True 编写响应的最终版本,确保它与任务类别和提供的指令(及输入)匹配。如果原始响应没问题,请在此处复制粘贴。确保语法和拼写正确。 N/A

指南是一个纯字符串,可用于向标注者提供指令。请参阅标注指南部分。

数据实例

在Argilla中的数据集实例示例如下:

json { "external_id": "0", "fields": { "category": "closed_qa", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australias domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "instruction": "When did Virgin Australia start operating?", "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route." }, "metadata": null, "responses": [ { "status": "submitted", "user_id": "dc9c373f-c589-4845-b7e8-890520ca7d43", "values": { "final-context": { "value": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australias domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney." }, "final-instruction": { "value": "When did Virgin Australia start operating?" }, "final-response": { "value": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue." } } } ] }

在HuggingFace datasets中的相同记录示例如下:

json { "category": "closed_qa", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australias domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "external_id": "0", "final-context": { "status": [ "submitted" ], "user_id": [ "dc9c373f-c589-4845-b7e8-890520ca7d43" ], "value": [ "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australias domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney." ] }, "final-instruction": { "status": [ "submitted" ], "user_id": [ "dc9c373f-c589-4845-b7e8-890520ca7d43" ], "value": [ "When did Virgin Australia start operating?" ] }, "final-response": { "status": [ "submitted" ], "user_id": [ "dc9c373f-c589-4845-b7e8-890520ca7d43" ], "value": [ "Virgin Australia commenced services on 31 August 2000 as Virgin Blue." ] }, "instruction": "When did Virgin Australia start operating?", "metadata": null, "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route." }

数据字段

数据集字段包括:

  • 字段:这些是数据集记录本身,目前仅支持文本字段。这些字段将用于提供对问题的响应。

    • categoryTextField 类型。
    • instructionTextField 类型。
    • contextTextField 类型(可选)。
    • responseTextField 类型。
  • 问题:这些是向标注者提出的问题。它们可以是不同类型,如评分、文本、单选或多选。

    • final-instructionTextQuestion 类型,描述为“编写指令的最终版本,确保它与任务类别匹配。如果原始指令没问题,请在此处复制粘贴。”
    • final-contextTextQuestion 类型(可选),描述为“编写输入的最终版本,确保它与任务类别匹配。如果原始输入没问题,请在此处复制粘贴。对于以下任务类别,请留空:开放/通用Q&A、头脑风暴、创意写作。”
    • final-responseTextQuestion 类型,描述为“编写响应的最终版本,确保它与任务类别和提供的指令(及输入)匹配。如果原始响应没问题,请在此处复制粘贴。确保语法和拼写正确。”

此外,还有一个可选字段:

  • external_id:这是一个可选字段,可用于为数据集记录提供外部ID。如果需要将数据集记录链接到外部资源(如数据库或文件),这可能很有用。

数据分割

数据集包含一个分割,即train

数据集创建

整理理由

[更多信息待补充]

源数据

初始数据收集和规范化

[更多信息待补充]

源语言生产者

[更多信息待补充]

标注

标注指南

介绍

在这个数据集中,您将找到一组记录,显示任务类别、指令、输入和响应。项目的目标是修正指令、输入和响应,确保它们具有最高质量并符合它们所属的任务类别。所有三个文本都应清晰并包含真实信息。

任务类别

指令根据7种可能的任务类别进行分类。请阅读并理解这些类别,因为它们将改变指令和输入的表述方式。定义基于Databricks的定义。

开放/通用Q&A

这里您将找到一个开放式问题,例如,“为什么人们喜欢喜剧电影?”或“法国的首都是什么?”。在某些情况下,没有正确答案,而在其他情况下,它需要依赖广泛的世界知识。这种类型的任务不应该有输入。

封闭Q&A

这些问题只能使用参考文本段落中的信息来回答。例如,给定维基百科上关于原子的段落,可能会问,“原子核中质子和中子的比例是多少?”。在这种情况下,任务应该有指令(问题)、输入(参考文本)和响应。

信息提取

在这种任务中,指令将要求从段落中提取实体或其他事实信息。在这种情况下,任务应该有指令(问题)、输入(参考文本)和响应。

总结

这种类型的指令将要求总结一个段落。段落应该在输入中。响应应该是段落的总结版本。

头脑风暴

头脑风暴指令应要求开放式构思和相关的可能选项列表。例如,“这个周末我可以和朋友们做哪些有趣的活动?”。

分类

这种类型的指令应要求对类成员资格进行判断(例如,列表中的项目是动物、矿物还是蔬菜)或判断短文本段落的属性,如电影评论的情感。要分类的项目应出现在输入字段中。

创意写作

这种类型的指令包括写作诗歌或情书等。

问卷

为了整理数据集,您需要提供并回答以下问题。请按照以下提示回答每个问题。

如果您不确定某个记录,并且您不想提供响应,请点击丢弃。

1. 最终指令:

  • 指令字段的最终版本。您可以使用指令字段中的复制图标进行复制。
  • 如果没问题,请保持原样或进行必要的修正。
  • 如果指令不能很好地代表记录的任务类别,请更改指令。
  • 指令可以包含语法和拼写错误,只要它们清晰即可。

2. 最终输入:

  • 输入字段的最终版本。您可以使用输入字段中的复制图标进行复制。
  • 如果没问题,请保持原样或进行必要的修正。
  • 请记住,对于需要输入的任务:封闭Q&A、信息提取、总结和分类,添加输入。
  • 对于以下任务类别,请留空:开放/通用Q&A、头脑风暴、创意写作。
  • 输入可以包含语法和拼写错误,只要它们清晰即可。

3. 最终响应:

  • 响应字段的最终版本。
  • 您可以使用响应字段中的复制图标进行复制。
  • 如果没问题,请保持原样或进行必要的修正。
  • 检查响应是否在所有上述字段中都合理,并且尽可能完整和简洁。
  • 响应应始终检查并纠正语法和拼写。

标注过程

[更多信息待补充]

标注者

[更多信息待补充]

个人和敏感信息

[更多信息待补充]

使用数据的考虑

数据集的社会影响

[更多信息待补充]

讨论偏见

[更多信息待补充]

其他已知限制

[更多信息待补充]

附加信息

数据集策展人

[更多信息待补充]

许可信息

[更多信息待补充]

引用信息

[更多信息待补充]

贡献

[更多信息待补充]

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作