five

argilla/prompt-collective

收藏
Hugging Face2024-02-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/argilla/prompt-collective
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: 1K<n<10K tags: - rlfh - argilla - human-feedback --- # Dataset Card for prompt-collective This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.yaml`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("argilla/prompt-collective") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("argilla/prompt-collective") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | prompt | Prompt | text | True | True | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | quality | Rate the quality of the prompt | label_selection | True | N/A | ['0', '1', '2', '3', '4'] | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "prompt": "Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process." }, "metadata": { "evolved_from": null, "kind": "synthetic", "source": "ultrachat" }, "responses": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "values": { "quality": { "value": "4" } } }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "values": { "quality": { "value": "4" } } } ], "suggestions": [], "vectors": {} } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "external_id": null, "metadata": "{\"source\": \"ultrachat\", \"kind\": \"synthetic\", \"evolved_from\": null}", "prompt": "Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process.", "quality": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "value": "4" }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "value": "4" } ], "quality-suggestion": null, "quality-suggestion-metadata": { "agent": null, "score": null, "type": null } } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **prompt** is of type `text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **quality** is of type `label_selection` with the following allowed values ['0', '1', '2', '3', '4']. * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **quality-suggestion** is of type `label_selection` with the following allowed values ['0', '1', '2', '3', '4']. Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines # Task We are collaboratively creating a database of prompts in English. The annotation guidelines below should help you get started but you can also ask questions in the [Discord Channel](https://discord.com/channels/879548962464493619/1205128865735770142). Our aim is to identify effective prompts and understand the interaction between AI-generated and human-generated prompts. The focus is on rating prompts that are clear, interesting and complex for fine-tuning open source LLMs. What makes a prompt good? That's a difficult question but here are some aspects: - The intent of the user is clear. - The question, instruction or task for the assistant is challenging or interesting because it involves solving a complex problem, reasoning, involving being creative, etc. In other words, first of all the intent (what the user asks) should be clear. Then we can look into how interesting and complex is the task. The most interesting the prompt is the higher rating should be. ## Guidelines You need to assign a rating to each prompt thinking about the complexity for an assistant and if the intent is clear. A very good prompt is one that is challenging but also very clear in the intent of the user. You can use keyboard shortcuts (the numbers) to quickly rate the examples. If you find some pattern, you can also use the search box and filters as well as the bulk labelling mode, please use this with care and only when you find a clear pattern (e.g., prompts that are completely incorrect and share a common issue). If you are unsure about your answer, you can click on the tag and then “Save as a draft” to save if for later. In the case that you feel unequipped of rating a specific prompt, you can use the “Discard” button. ## Ratings ### 1. Very Bad: The prompt doesn't communicate its purpose, is non-sensical or is in a language other than English. The prompt assumes the usage of tools or capabilities that don’t apply to this model, like generating an image or scraping a website. *Examples:* >"Do the thing." >“Hello!” >"asdajflajfada” >“Quiero que redactes una entrada de blog.” >"Extract data from a website.” >“Tell me how you feel when someone insults you.” ### 2. Bad: Suggests a goal but lacks clarity and coherence. *Examples:* >"Find me stuff about that thing, you know?" >“Write something.” >"Tell me about this thing." >"Can you help with this?" >"I need to know more." ### 3. Ok: The intent is understandable, but it's missing information to complete the task. *Examples:* >"I need information on something important." >“Write a blogpost.” ### 4. Good: Presents a clear goal and necessary information, effectively directing the AI, but the prompt could be more specific. *Examples:* >"Provide a summary of renewable energy sources." >“Tell me about Sean Connery.” >"Explain global warming." ### 5. Very Good: Comprehensive and explicit, leaving no room for ambiguity. Perfectly guides the AI and includes details. *Examples:* >"Compare the efficiency and environmental impact of solar and wind energy, including recent advancements and case studies from 2023." >“Make a list of 5 plant-based recipes that I can try that don’t have red peppers as an ingredient.” #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
argilla
原始信息汇总

数据集卡片 for prompt-collective

数据集描述

数据集概述

该数据集包含:

  • 符合 Argilla 数据集格式的配置文件 argilla.yaml。该配置文件将在使用 Argilla 的 FeedbackDataset.from_huggingface 方法时用于配置数据集。
  • 兼容 HuggingFace datasets 格式的数据集记录。这些记录在使用 FeedbackDataset.from_huggingface 时会自动加载,也可以通过 datasets 库的 load_dataset 方法独立加载。
  • 用于构建和整理数据集的标注指南(如果已在 Argilla 中定义)。

加载数据集

使用 Argilla 加载

安装 Argilla:

python pip install argilla --upgrade

加载数据集:

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("argilla/prompt-collective")

使用 datasets 加载

安装 datasets

python pip install datasets --upgrade

加载数据集:

python from datasets import load_dataset

ds = load_dataset("argilla/prompt-collective")

支持的任务和排行榜

该数据集可以包含多个字段、问题和响应,因此可以用于不同的 NLP 任务,具体取决于配置。数据集结构在数据集结构部分中描述。

该数据集没有关联的排行榜。

语言

[更多信息需要]

数据集结构

数据在 Argilla 中

数据集在 Argilla 中创建,包含以下内容:字段问题建议元数据向量指南

字段

字段名称 标题 类型 必需 支持 Markdown
prompt Prompt text True True

问题

问题名称 标题 类型 必需 描述 值/标签
quality 评价提示的质量 label_selection True N/A [0, 1, 2, 3, 4]

建议

建议是人为或机器生成的推荐,用于在标注过程中辅助标注者。这些建议总是与现有问题相关联,并在名称中附加 "-suggestion" 和 "-suggestion-metadata",包含建议的值及其元数据。

元数据

元数据是一个字典,用于提供有关数据集记录的额外信息。这可以用于为标注者提供额外的上下文,或提供有关数据集记录本身的额外信息。元数据总是可选的,并且可以与 argilla.yaml 中定义的 metadata_properties 相关联。

指南

指南是可选的,只是一个用于向标注者提供指令的纯字符串。请参见标注指南部分。

数据实例

在 Argilla 中的示例

json { "external_id": null, "fields": { "prompt": "提供如何用常见家用成分制作安全有效的自制多用途清洁剂的分步说明。该指南应包括测量方法、储存清洁剂的提示以及可以添加的额外变体或香味。此外,该指南应以清晰简洁的语言编写,并附有有助于过程的有用视觉或照片。" }, "metadata": { "evolved_from": null, "kind": "synthetic", "source": "ultrachat" }, "responses": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "values": { "quality": { "value": "4" } } }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "values": { "quality": { "value": "4" } } } ], "suggestions": [], "vectors": {} }

在 HuggingFace datasets 中的示例

json { "external_id": null, "metadata": "{"source": "ultrachat", "kind": "synthetic", "evolved_from": null}", "prompt": "提供如何用常见家用成分制作安全有效的自制多用途清洁剂的分步说明。该指南应包括测量方法、储存清洁剂的提示以及可以添加的额外变体或香味。此外,该指南应以清晰简洁的语言编写,并附有有助于过程的有用视觉或照片。", "quality": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "value": "4" }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "value": "4" } ], "quality-suggestion": null, "quality-suggestion-metadata": { "agent": null, "score": null, "type": null } }

数据字段

数据集字段包括:

  • 字段:这些是数据集记录本身,目前仅支持文本字段。这些字段将用于提供问题的响应。

    • prompttext 类型。
  • 问题:这些问题将向标注者提出。它们可以是不同类型,如 RatingQuestionTextQuestionLabelQuestionMultiLabelQuestionRankingQuestion

    • qualitylabel_selection 类型,允许的值为 [0, 1, 2, 3, 4]。
  • 建议:从 Argilla 1.13.0 开始,建议已包含在内,以在标注过程中为标注者提供建议,以简化或协助标注过程。建议与现有问题相关联,总是可选的,并且不仅包含建议本身,还包含其相关元数据(如果适用)。

    • (可选) quality-suggestionlabel_selection 类型,允许的值为 [0, 1, 2, 3, 4]。

此外,还有两个可选字段:

  • metadata:这是一个可选字段,用于提供有关数据集记录的额外信息。这可以用于为标注者提供额外的上下文,或提供有关数据集记录本身的额外信息。例如,您可以使用此字段提供数据集记录的原始来源链接,或提供有关数据集记录本身的额外信息,如作者、日期或来源。元数据总是可选的,并且可以与 argilla.yaml 中定义的 metadata_properties 相关联。
  • external_id:这是一个可选字段,用于为数据集记录提供外部 ID。如果您希望将数据集记录与外部资源(如数据库或文件)相关联,这可能很有用。

数据分割

数据集包含一个分割,即 train

数据集创建

标注指南

我们正在协作创建一个英语提示的数据库。以下标注指南应帮助您开始,您也可以在 Discord 频道 中提问。

我们的目标是识别有效的提示并理解 AI 生成和人类生成提示之间的交互。

重点是评价清晰、有趣且复杂的提示,用于微调开源大型语言模型(LLMs)。

什么样的提示是好的?

这是一个困难的问题,但以下是一些方面:

  • 用户的意图是明确的。
  • 向助手提出的问题、指令或任务具有挑战性或有趣,因为它涉及解决复杂问题、推理、创造性等。

换句话说,首先用户的意图(用户询问的内容)应该是明确的。然后我们可以考虑任务的有趣程度和复杂性。提示越有趣,评分应该越高。

指南

您需要为每个提示分配一个评分,考虑助手的复杂性和意图是否清晰。一个好的提示是具有挑战性但也非常清晰的。

您可以使用键盘快捷键(数字)快速评分示例。

如果您发现某些模式,也可以使用搜索框和过滤器以及批量标注模式,请谨慎使用,并且只有在发现清晰模式时才使用(例如,完全不正确的提示并共享一个常见问题)。

如果您不确定您的答案,可以点击标签,然后“保存为草稿”以稍后保存。如果您觉得无法评价某个特定提示,可以使用“丢弃”按钮。

评分

1. 非常差:

提示没有传达其目的,是无意义的或使用非英语语言。

提示假设使用不适用于此模型的工具或能力,如生成图像或抓取网站。

示例:

"Do the thing." "Hello!" "asdajflajfada" "Quiero que redactes una entrada de blog." "Extract data from a website." "Tell me how you feel when someone insults you."

2. 差:

提出了一个目标,但缺乏清晰性和连贯性。

示例:

"Find me stuff about that thing, you know?" "Write something." "Tell me about this thing." "Can you help with this?" "I need to know more."

3. 一般:

意图是可理解的,但缺少完成任务的信息。

示例:

"I need information on something important." "Write a blogpost."

4. 好:

提出了清晰的目标和必要信息,有效地指导 AI,但提示可以更具体。

示例:

"Provide a summary of renewable energy sources." "Tell me about Sean Connery." "Explain global warming."

5. 非常好:

全面且明确,没有留下任何歧义。完美地指导 AI 并包含详细信息。

示例:

"比较太阳能和风能的效率和环境影响,包括2023年的最新进展和案例研究。" "列出5种不含红辣椒成分的植物性食谱,我可以尝试。"

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作