five

DIBT/10k_prompts_ranked

收藏
Hugging Face2024-03-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/DIBT/10k_prompts_ranked
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other size_categories: - 1K<n<10K task_categories: - text-classification - text-generation - reinforcement-learning pretty_name: 10k_prompts_ranked dataset_info: features: - name: prompt dtype: string id: field - name: quality list: - name: user_id dtype: string id: question - name: value dtype: string id: suggestion - name: status dtype: string id: question - name: metadata dtype: string id: metadata - name: avg_rating dtype: float64 - name: num_responses dtype: int64 - name: agreement_ratio dtype: float64 - name: raw_responses sequence: int64 - name: kind dtype: string - name: cluster_description dtype: string - name: topic dtype: string splits: - name: train num_bytes: 8705892 num_examples: 10331 download_size: 3579688 dataset_size: 8705892 configs: - config_name: default data_files: - split: train path: data/train-* tags: - preference - prompts - argilla - synthetic --- # Dataset Card for 10k_prompts_ranked `10k_prompts_ranked` is a dataset of prompts with quality rankings created by 314 members of the open-source ML community using Argilla, an open-source tool to label data. The prompts in this dataset include both synthetic and human-generated prompts sourced from a variety of heavily used datasets that include prompts. The dataset contains 10,331 examples and can be used for training and evaluating language models on prompt ranking tasks. The dataset is the output of a novel crowdsourcing effort and can thus also be used to study the behavior of annotators contributing rankings as part of a community effort to rank prompts. <center> <div> <img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/mj1JOorVwP-LT9POfyJiN.png" width="50%"> </div> <em>Data is Better Together</em> </center> **Want to contribute to the V2 release of this dataset?** You can start rating prompts in a few seconds [here](https://huggingface.co/spaces/DIBT/prompt-collective) ## Dataset Details This dataset is the first release out of the `Data-is-Better-Together` collective, a project created by [Argilla](https://huggingface.co/argilla) and Hugging Face to explore how Argilla and [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) could be used to collectively create impactful datasets within the community. The dataset was created by collecting prompts from various existing sources and ranking them using an instance of [Argilla](https://argilla.io/) hosted on a Hugging Face Space with Hugging Face authentication enabled. This allowed anyone with an existing Hugging Face account to very quickly begin contributing to the dataset. <center> <a href="https://huggingface.co/spaces/DIBT/prompt-collective"> <img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/SCykTMYyc29kYgv7Frg_-.png", alt="Sign in page for Argilla on Spaces" width="75%"/></a> </center> ### Dataset Description - **Curated by:** Co-created by Argilla, Hugging Face, and the Prompt Collective community. - **Language(s) (NLP):** English - **License:** [More Information Needed] #### Data Visualization Click the [Nomic Atlas](https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5) map below to visualize the distribution of the prompts in the dataset and explore the topics identified in the prompts by Nomic Atlas. <center> <a href="https://atlas.nomic.ai/data/hivemind/dibt-10k-prompt-collective/map"> <img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/SGP-N-zjyJwfRJDKpIJe0.png" alt="Nomic-Atlas 10K_prompts_ranked Map" width="75%"/> </a> </center> ## Uses There are many potential uses for this dataset. Key uses include: - Training and evaluating language models on prompt ranking tasks. - As a dataset that can be filtered only to include high-quality prompts. These can serve as seed data for generating synthetic prompts and generations. Beyond this direct use, the dataset is also the output of a novel crowdsourcing effort and can be used to study the behaviour of annotators contributing to datasets as part of a community effort to rank prompts. This includes exploring: - The distribution of prompt rankings based on the source of the prompt. - The distribution of prompt rankings based on the prompt's type, length, or other features. - The agreement of annotators on prompt rankings and the factors that influence agreement, i.e. prompt source, prompt type, prompt length, etc. ### Direct Use To load the data using the `datasets` library, you can use the following code: ```python from datasets import load_dataset ds = load_dataset("10k_prompts_ranked") ``` ### Out-of-Scope Use This dataset only contains rankings for prompts, not prompt/response pairs so it is not suitable for direct use for supervised fine-tuning of language models. ## Dataset Structure A single instance of the dataset looks as follows: ```python {'prompt': 'Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process.', 'quality': [{'user_id': 'd23b12c2-b601-490e-b5b3-2040eb393a00', 'value': '4', 'status': 'submitted'}, {'user_id': 'e2bdd868-f28e-46fc-9254-a6ec1e291889', 'value': '4', 'status': 'submitted'}], 'metadata': {'evolved_from': None, 'kind': 'synthetic', 'source': 'ultrachat'}, 'avg_rating': 5.0, 'num_responses': 2, 'agreement_ratio': 1.0, 'raw_responses': [5, 5], 'kind': 'synthetic'} ``` The dataset contains the following fields: - prompt: The prompt to be ranked. - quality: A list of user rankings for the prompt. Each ranking includes the user_id, the value of the ranking, and the status of the ranking (we only include rankings that have been submitted). - metadata: Additional information about the prompt including the source of the prompt, whether it was synthetic or human-generated, and whether it was evolved from another prompt. - avg_rating: The average rating of the prompt. - num_responses: The number of responses for the prompt. - agreement_ratio: The agreement ratio for the prompt. - raw_responses: The raw responses for the prompt by annotators. This can be used to calculate the agreement ratio differently. - kind: The kind of prompt (synthetic or human-generated). ## Dataset Creation Version one of the dataset was created in about 3 weeks. The first week involved some prep work and the creation of the Argilla instance. The actual generation of 10,000 prompt rankings was done in two weeks. ### Curation Rationale The dataset was created to explore how Argilla and Hugging Face Spaces could be used to create impactful datasets within the community collectively. The dataset was also created to provide a high-quality dataset for prompt ranking tasks and to study the behavior of annotators contributing rankings as part of a community effort to rank prompts. ### Source Data As discussed above, the prompts in this dataset are derived from a variety of heavily used datasets that include prompts. The following table shows the sources of the prompts in the dataset and the number of examples from each source. Datasets with a `#` in the dataset indicate the subset of the dataset that was used. | Dataset | # Examples | | ----------------------------------------- | ---------- | | ewof/sharegpt-instruct-unfiltered-deduped | 4,479 | | evol_instruct | 1,381 | | ultrachat | 1,307 | | OpenAssistant/oasst2 | 734 | | argilla/DistiCoder-dpo-binarized | 705 | | flan_v2_cot | 360 | | argilla/distilabel-reasoning-prompts | 328 | | argilla/distilabel-evol-prompt-collective | 282 | | LDJnr/Capybara#Dove | 253 | | ProlificAI/social-reasoning-rlhf | 145 | | LDJnr/Capybara#GOAT | 123 | | LDJnr/Capybara#TaskSource | 117 | | LDJnr/Capybara#TheoremQA | 88 | | LDJnr/Capybara#Verified-Camel | 19 | | fka/awesome-chatgpt-prompts | 8 | | LDJnr/Capybara#Tigerbot | 2 | #### Synthetic vs Human-Generated Prompts The breakdown of the prompts in the dataset by kind is as follows: <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/mIWyxv1y5-3A54hGv-Re-.png", alt="Sign in page for Argilla on Spaces" width="75%"/>< </center> The "unknown" kind is a result of the fact that the source of the prompt was not known for some of the prompts in the dataset. #### Who are the source data producers? The source datasets used to generate the prompts in this dataset were created by academics, industry researchers, and open-source contributors. ### Annotations This dataset contains human-generated annotations of prompt quality. Prompts are ranked on a scale of 1-5, with 1 being the lowest quality and 5 being the highest quality. The dataset contains 10,331 examples. | Number of rankings | Frequency | | -----------------: | --------: | | 1 | 6,730 | | 2 | 2,600 | | 3 | 748 | | 4 | 192 | | 5 | 52 | | 6 | 5 | | 7 | 3 | | 8 | 1 | #### Distribution of ratings across dataset type <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/ttqT8izhSMI-SZ9OS3Rig.png", alt="Sign in page for Argilla on Spaces" width="75%"/>< </center> #### Annotation process The dataset was created by collecting prompts from various sources and then ranking them using an instance of Argilla hosted on a Hugging Face Space with Hugging Face authentication enabled. This allowed anyone with an existing Hugging Face account to rank the prompts. #### Who are the annotators? The annotators are 314 Hugging Face community members. We do not have demographic information about the annotators. #### Personal and Sensitive Information We are not aware of any personal or sensitive information in the dataset. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] ## Glossary <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> - **Argilla**: An open source annotation tool focused on methods for efficiently building high-quality datasets for LLMs and other NLP models. - **Hugging Face Spaces**: A platform for hosting machine learning applications and demos. - **Synthetic data**: Data that is generated using some computational method (primarily and Large Language Model)
提供机构:
DIBT
原始信息汇总

数据集卡片 for 10k_prompts_ranked

数据集概述

10k_prompts_ranked 是一个包含提示及其质量排名的数据集,由314名开源机器学习社区成员使用Argilla创建。该数据集包含合成和人类生成的提示,来源于多种常用数据集。

数据集详情

数据集包含10,331个示例,可用于训练和评估语言模型在提示排名任务上的表现。该数据集是新颖众包努力的成果,也可用于研究标注者在社区努力中对提示进行排名的行为。

数据集描述

  • 语言(NLP): 英语
  • 许可证: [需要更多信息]

数据可视化

点击Nomic Atlas地图,可视化数据集中提示的分布并探索Nomic Atlas识别的提示主题。

数据集用途

该数据集有多种潜在用途,主要包括:

  • 训练和评估语言模型在提示排名任务上的表现。
  • 作为仅包含高质量提示的数据集,可作为生成合成提示和生成的种子数据。

此外,该数据集还可用于研究标注者在社区努力中对提示进行排名的行为,包括:

  • 基于提示来源的提示排名分布。
  • 基于提示类型、长度或其他特征的提示排名分布。
  • 标注者对提示排名的共识及其影响因素,如提示来源、类型、长度等。

直接使用

使用datasets库加载数据,可以使用以下代码: python from datasets import load_dataset ds = load_dataset("10k_prompts_ranked")

超出范围的使用

该数据集仅包含提示的排名,不包含提示/响应对,因此不适合直接用于语言模型的监督微调。

数据集结构

数据集的单个实例如下所示: python {prompt: 提供如何用常见家用原料制作安全有效的自制万能清洁剂的逐步指南。该指南应包括测量、储存清洁剂的技巧以及可以添加的额外变体或香味。此外,指南应以清晰简洁的语言编写,并附有有助于过程的视觉或照片。, quality: [{user_id: d23b12c2-b601-490e-b5b3-2040eb393a00, value: 4, status: submitted}, {user_id: e2bdd868-f28e-46fc-9254-a6ec1e291889, value: 4, status: submitted}], metadata: {evolved_from: None, kind: synthetic, source: ultrachat}, avg_rating: 5.0, num_responses: 2, agreement_ratio: 1.0, raw_responses: [5, 5], kind: synthetic}

数据集包含以下字段:

  • prompt: 待排名的提示。
  • quality: 用户对提示的排名列表,每个排名包括用户ID、排名值和排名状态(仅包含已提交的排名)。
  • metadata: 关于提示的额外信息,包括提示来源、是否为合成或人类生成,以及是否由其他提示演变而来。
  • avg_rating: 提示的平均评分。
  • num_responses: 提示的响应数量。
  • agreement_ratio: 提示的共识比率。
  • raw_responses: 标注者对提示的原始响应,可用于不同方式计算共识比率。
  • kind: 提示的类型(合成或人类生成)。

数据集创建

数据集的第一版在约3周内创建。第一周涉及一些准备工作和Argilla实例的创建,实际生成10,000个提示排名在两周内完成。

数据集创建理由

该数据集旨在探索如何使用Argilla和Hugging Face Spaces在社区中共同创建有影响力的数据集。此外,该数据集还旨在为提示排名任务提供高质量数据集,并研究标注者在社区努力中对提示进行排名的行为。

源数据

数据集中的提示来源于多种常用数据集,具体来源和示例数量如下:

数据集 示例数量
ewof/sharegpt-instruct-unfiltered-deduped 4,479
evol_instruct 1,381
ultrachat 1,307
OpenAssistant/oasst2 734
argilla/DistiCoder-dpo-binarized 705
flan_v2_cot 360
argilla/distilabel-reasoning-prompts 328
argilla/distilabel-evol-prompt-collective 282
LDJnr/Capybara#Dove 253
ProlificAI/social-reasoning-rlhf 145
LDJnr/Capybara#GOAT 123
LDJnr/Capybara#TaskSource 117
LDJnr/Capybara#TheoremQA 88
LDJnr/Capybara#Verified-Camel 19
fka/awesome-chatgpt-prompts 8
LDJnr/Capybara#Tigerbot 2

合成与人类生成提示

数据集中提示的类型分布如下:

  • 合成提示
  • 人类生成提示
  • 未知类型

标注者

标注者为314名Hugging Face社区成员,我们没有标注者的 demographic 信息。

个人和敏感信息

我们不知道数据集中是否包含个人或敏感信息。

引用

BibTeX: [需要更多信息]

术语表

  • Argilla: 一个专注于高效构建高质量数据集的开放源代码标注工具,主要用于LLM和其他NLP模型。
  • Hugging Face Spaces: 一个用于托管机器学习应用程序和演示的平台。
  • 合成数据: 使用计算方法(主要是大型语言模型)生成的数据。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个包含约1万条提示语的质量排名数据集,由314名开源社区成员通过Argilla工具标注,每条提示语都带有1-5分的质量评分。数据集整合了来自多个来源的合成和人工生成提示语,主要用于训练和评估语言模型的提示排名能力,并支持社区标注行为的研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作