five

10k-prompt-collective-argilla-format

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/data-is-better-together/10k-prompt-collective-argilla-format
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for 10k-prompt-collective-argilla This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.yaml`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("DIBT/10k-prompt-collective-argilla") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("DIBT/10k-prompt-collective-argilla") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | prompt | Prompt | text | True | True | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | quality | Rate the quality of the prompt | label_selection | True | N/A | ['0', '1', '2', '3', '4'] | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "prompt": "Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process." }, "metadata": { "evolved_from": null, "kind": "synthetic", "source": "ultrachat" }, "responses": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "values": { "quality": { "value": "4" } } }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "values": { "quality": { "value": "4" } } } ], "suggestions": [], "vectors": {} } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "external_id": null, "metadata": "{\"source\": \"ultrachat\", \"kind\": \"synthetic\", \"evolved_from\": null}", "prompt": "Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process.", "quality": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "value": "4" }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "value": "4" } ], "quality-suggestion": null, "quality-suggestion-metadata": { "agent": null, "score": null, "type": null } } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **prompt** is of type `text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **quality** is of type `label_selection` with the following allowed values ['0', '1', '2', '3', '4']. * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **quality-suggestion** is of type `label_selection` with the following allowed values ['0', '1', '2', '3', '4']. Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines # Task We are collaboratively creating a database of prompts in English. The annotation guidelines below should help you get started but you can also ask questions in the [Discord Channel](https://discord.com/channels/879548962464493619/1205128865735770142). Our aim is to identify effective prompts and understand the interaction between AI-generated and human-generated prompts. The focus is on rating prompts that are clear, interesting and complex for fine-tuning open source LLMs. What makes a prompt good? That's a difficult question but here are some aspects: - The intent of the user is clear. - The question, instruction or task for the assistant is challenging or interesting because it involves solving a complex problem, reasoning, involving being creative, etc. In other words, first of all the intent (what the user asks) should be clear. Then we can look into how interesting and complex is the task. The most interesting the prompt is the higher rating should be. ## Guidelines You need to assign a rating to each prompt thinking about the complexity for an assistant and if the intent is clear. A very good prompt is one that is challenging but also very clear in the intent of the user. You can use keyboard shortcuts (the numbers) to quickly rate the examples. If you find some pattern, you can also use the search box and filters as well as the bulk labelling mode, please use this with care and only when you find a clear pattern (e.g., prompts that are completely incorrect and share a common issue). If you are unsure about your answer, you can click on the tag and then “Save as a draft” to save if for later. In the case that you feel unequipped of rating a specific prompt, you can use the “Discard” button. ## Ratings ### 1. Very Bad: The prompt doesn't communicate its purpose, is non-sensical or is in a language other than English. The prompt assumes the usage of tools or capabilities that don’t apply to this model, like generating an image or scraping a website. *Examples:* >"Do the thing." >“Hello!” >"asdajflajfada” >“Quiero que redactes una entrada de blog.” >"Extract data from a website.” >“Tell me how you feel when someone insults you.” ### 2. Bad: Suggests a goal but lacks clarity and coherence. *Examples:* >"Find me stuff about that thing, you know?" >“Write something.” >"Tell me about this thing." >"Can you help with this?" >"I need to know more." ### 3. Ok: The intent is understandable, but it's missing information to complete the task. *Examples:* >"I need information on something important." >“Write a blogpost.” ### 4. Good: Presents a clear goal and necessary information, effectively directing the AI, but the prompt could be more specific. *Examples:* >"Provide a summary of renewable energy sources." >“Tell me about Sean Connery.” >"Explain global warming." ### 5. Very Good: Comprehensive and explicit, leaving no room for ambiguity. Perfectly guides the AI and includes details. *Examples:* >"Compare the efficiency and environmental impact of solar and wind energy, including recent advancements and case studies from 2023." >“Make a list of 5 plant-based recipes that I can try that don’t have red peppers as an ingredient.” #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

# 10k提示词集合-Argilla数据集卡片 本数据集依托Argilla(Argilla)平台构建完成。 如下章节所述,本数据集可按照[通过Argilla加载](#load-with-argilla)中的说明导入至Argilla平台,亦可直接结合`datasets`库使用,详见[通过datasets库加载](#load-with-datasets)。 ## 数据集描述 - **主页:** https://argilla.io - **代码仓库:** https://github.com/argilla-io/argilla - **相关论文:** - **排行榜:** - **联系方式:** ### 数据集摘要 本数据集包含以下内容: * 符合Argilla数据集格式的数据集配置文件`argilla.yaml`。当使用Argilla中的`FeedbackDataset.from_huggingface`方法时,将通过该配置文件完成数据集的配置。 * 兼容Hugging Face `datasets`库格式的数据集记录。当调用`FeedbackDataset.from_huggingface`时,这些记录将被自动加载;亦可通过`datasets`库的`load_dataset`方法独立加载这些记录。 * 用于构建与整理本数据集的标注指南(若已在Argilla中定义),详见[标注指南](#annotation-guidelines)章节。 ### 通过Argilla加载 若需通过Argilla加载本数据集,仅需执行`pip install argilla --upgrade`命令升级Argilla库,随后运行如下代码: python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("DIBT/10k-prompt-collective-argilla") ### 通过datasets库加载 若需通过`datasets`库加载本数据集,仅需执行`pip install datasets --upgrade`命令升级datasets库,随后运行如下代码: python from datasets import load_dataset ds = load_dataset("DIBT/10k-prompt-collective-argilla") ### 支持任务与排行榜 本数据集包含[多个字段、问题与回复](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset),因此可根据配置用于多种自然语言处理任务,数据集结构详见[数据集结构](#dataset-structure)章节。本数据集暂无关联排行榜。 ### 语言 [需补充更多信息] ## 数据集结构 ### Argilla格式数据 本数据集在Argilla中通过以下元素构建:**字段(fields)**、**问题(questions)**、**建议(suggestions)**、**元数据(metadata)**、**向量(vectors)**与**指南(guidelines)**。 **字段**即数据集记录本身,目前仅支持文本字段,用于接收针对问题的回复。 | 字段名称 | 标题 | 类型 | 是否必填 | 是否支持Markdown | | -------- | ---- | ---- | -------- | ---------------- | | prompt | 提示词 | text | 是 | 是 | **问题**即向标注人员提出的查询,支持多种类型,包括评分、文本、标签选择、多标签选择或排序。 | 问题名称 | 标题 | 类型 | 是否必填 | 描述 | 可选值/标签 | | -------- | ---- | ---- | -------- | ---- | ----------- | | quality | 为提示词质量评分 | label_selection | 是 | 无适用说明 | ['0', '1', '2', '3', '4'] | **建议**指为辅助标注人员完成标注流程而提供的人工或机器生成的回复推荐,始终与对应问题关联,命名方式为在问题名称后追加`-suggestion`与`-suggestion-metadata`,分别存储建议值及其元数据。其可选值与上述表格一致,仅列名追加了对应后缀。 **元数据**为可用于提供数据集记录额外信息的字典,可用于向标注人员提供额外上下文,或补充数据集记录本身的相关信息,例如数据集记录的原始来源链接、作者、日期或来源渠道。元数据为可选字段,可与`argilla.yaml`中定义的数据集配置文件内的`metadata_properties`关联。 | 元数据名称 | 标题 | 类型 | 可选值 | 是否对标注人员可见 | | ---------- | ---- | ---- | ------ | ------------------ | **指南**同样为可选字段,为可用于向标注人员提供操作说明的纯文本字符串,详见[标注指南](#annotation-guidelines)章节。 ### 数据集实例 Argilla格式下的数据集示例如以下JSON所示: json { "external_id": null, "fields": { "prompt": "Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process." }, "metadata": { "evolved_from": null, "kind": "synthetic", "source": "ultrachat" }, "responses": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "values": { "quality": { "value": "4" } } }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "values": { "quality": { "value": "4" } } } ], "suggestions": [], "vectors": {} } 而在Hugging Face `datasets`库中的同一条记录格式如下: json { "external_id": null, "metadata": "{"source": "ultrachat", "kind": "synthetic", "evolved_from": null}", "prompt": "Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process.", "quality": [ { "status": "submitted", "user_id": "d23b12c2-b601-490e-b5b3-2040eb393a00", "value": "4" }, { "status": "submitted", "user_id": "e2bdd868-f28e-46fc-9254-a6ec1e291889", "value": "4" } ], "quality-suggestion": null, "quality-suggestion-metadata": { "agent": null, "score": null, "type": null } } ### 数据集字段 根据用途,数据集字段可分为以下几类: * **字段(Fields)**:即数据集记录本身,目前仅支持文本字段,用于接收针对问题的回复。其中**prompt**字段类型为`text`。 * **问题(Questions)**:即向标注人员提出的查询,支持多种类型,包括`RatingQuestion`、`TextQuestion`、`LabelQuestion`、`MultiLabelQuestion`与`RankingQuestion`。其中**quality**字段类型为`label_selection`,可选值为['0', '1', '2', '3', '4']。 * **建议(Suggestions)**:自Argilla 1.13.0版本起,新增建议字段用于辅助标注人员完成标注流程。建议与对应问题关联,为可选字段,不仅包含建议值本身,还可附带关联元数据(若有)。其中(可选)**quality-suggestion**字段类型为`label_selection`,可选值为['0', '1', '2', '3', '4']。 此外,本数据集还包含两个可选字段: * **metadata**:可选字段,用于提供数据集记录的额外信息,可向标注人员提供上下文或补充记录本身的相关细节,例如原始来源链接、作者、日期或来源渠道,可与`argilla.yaml`配置文件中定义的`metadata_properties`关联。 * **external_id**:可选字段,用于为数据集记录分配外部ID,可用于将数据集记录与外部资源(如数据库或文件)进行关联。 ### 数据集划分 本数据集仅包含一个划分,即`train`训练集。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与归一化 [需补充更多信息] #### 源文本创作者 [需补充更多信息] ### 标注信息 #### 标注指南 # 任务说明 我们正协同构建英文提示词数据库。以下标注指南将协助您快速上手,若有疑问亦可在[Discord频道](https://discord.com/channels/879548962464493619/1205128865735770142)中提问。 本项目旨在筛选优质提示词,并探究AI生成提示词与人工生成提示词之间的交互规律。 本次标注工作聚焦于对用于开源大语言模型(LLM)微调的、清晰易懂且兼具趣味性与复杂度的提示词进行评分。 如何定义优质提示词? 这是一个颇具挑战性的问题,但可参考以下标准: - 用户意图清晰明确 - 向助手提出的问题、指令或任务具备挑战性或趣味性,例如涉及复杂问题求解、逻辑推理或创意生成等场景。 简言之,首先需确保用户意图(即用户的提问内容)清晰明确,其次再评估任务的趣味性与复杂度。提示词越具吸引力,评分理应越高。 ## 标注规则 您需为每条提示词评分,评分依据为提示词对助手而言的复杂度以及用户意图的清晰度。优质提示词应兼具挑战性与清晰的用户意图。 您可使用键盘数字快捷键快速完成示例评分。 若发现通用标注模式,亦可借助搜索框、筛选器及批量标注模式进行标注,使用时请谨慎,仅在发现明确共性模式时使用(例如完全错误且存在共同问题的提示词)。 若对评分结果存疑,可点击标签后选择“保存为草稿”,以便后续处理。若您认为无法胜任某条提示词的评分工作,可点击“丢弃”按钮。 ## 评分标准 ### 1. 极差 提示词无法明确表达用途,语义混乱无意义,或使用非英语语言。 提示词涉及本模型不支持的工具或功能调用,例如生成图像或抓取网页内容。 *示例:* >“执行该操作。” >“你好!” >“asdajflajfada” >“Quiero que redactes una entrada de blog.”(西班牙语,意为“请帮我写一篇博客文章”) >“从网页中提取数据。” >“告诉我当有人侮辱你时你的感受。” ### 2. 较差 提示词提及了目标任务,但缺乏清晰性与连贯性。 *示例:* >“帮我找一些关于那个东西的资料,你懂的。” >“写点东西。” >“给我讲讲这个东西。” >“你能帮我解决这个问题吗?” >“我需要了解更多信息。” ### 3. 一般 用户意图可被理解,但缺少完成任务所需的必要信息。 *示例:* >“我需要了解一些重要信息。” >“写一篇博客文章。” ### 4. 良好 提示词明确了目标与必要信息,可有效引导AI完成任务,但仍可补充更多细节以提升细节丰富度。 *示例:* >“请总结可再生能源的相关信息。” >“给我讲讲肖恩·康纳利。” >“解释一下全球变暖的原理。” ### 5. 极佳 提示词内容全面且表述明确,无任何歧义,可完美引导AI完成任务并包含所有必要细节。 *示例:* >“对比太阳能与风能的效率及环境影响,包括2023年的最新技术进展与案例研究。” >“为我列出5种可尝试的不含红辣椒的植物性食谱。” #### 标注流程 [需补充更多信息] #### 标注人员 [需补充更多信息] ## 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏见讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集整理者 [需补充更多信息] ### 许可信息 [需补充更多信息] ### 引用信息 [需补充更多信息] ### 贡献者 [需补充更多信息]
提供机构:
maas
创建时间:
2025-07-10
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集采用Argilla格式构建,包含用于配置的argilla.yaml文件、与HuggingFace datasets兼容的记录以及可选的标注指南。数据核心为'prompt'文本字段,并附带一个基于0-4分评估提示质量的标注问题,适用于多种NLP任务。标注指南详细规定了根据用户意图清晰度和任务复杂性对提示进行评级的规则。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作