10k_prompts_ranked
收藏魔搭社区2025-11-27 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/data-is-better-together/10k_prompts_ranked
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for 10k_prompts_ranked
`10k_prompts_ranked` is a dataset of prompts with quality rankings created by 314 members of the open-source ML community using Argilla, an open-source tool to label data. The prompts in this dataset include both synthetic and human-generated prompts sourced from a variety of heavily used datasets that include prompts.
The dataset contains 10,331 examples and can be used for training and evaluating language models on prompt ranking tasks. The dataset is the output of a novel crowdsourcing effort and can thus also be used to study the behavior of annotators contributing rankings as part of a community effort to rank prompts.
<center>
<div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/mj1JOorVwP-LT9POfyJiN.png" width="50%">
</div>
<em>Data is Better Together</em>
</center>
**Want to contribute to the V2 release of this dataset?** You can start rating prompts in a few seconds [here](https://huggingface.co/spaces/DIBT/prompt-collective)
## Dataset Details
This dataset is the first release out of the `Data-is-Better-Together` collective, a project created by [Argilla](https://huggingface.co/argilla) and Hugging Face to explore how Argilla and [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) could be used to collectively create impactful datasets within the community.
The dataset was created by collecting prompts from various existing sources and ranking them using an instance of [Argilla](https://argilla.io/) hosted on a Hugging Face Space with Hugging Face authentication enabled. This allowed anyone with an existing Hugging Face account to very quickly begin contributing to the dataset.
<center>
<a href="https://huggingface.co/spaces/DIBT/prompt-collective">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/SCykTMYyc29kYgv7Frg_-.png", alt="Sign in page for Argilla on Spaces" width="75%"/></a>
</center>
### Dataset Description
- **Curated by:** Co-created by Argilla, Hugging Face, and the Prompt Collective community.
- **Language(s) (NLP):** English
- **License:** [More Information Needed]
#### Data Visualization
Click the [Nomic Atlas](https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5) map below to visualize the distribution of the prompts in the dataset and explore the topics identified in the prompts by Nomic Atlas.
<center>
<a href="https://atlas.nomic.ai/data/hivemind/dibt-10k-prompt-collective/map">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/SGP-N-zjyJwfRJDKpIJe0.png" alt="Nomic-Atlas 10K_prompts_ranked Map" width="75%"/>
</a>
</center>
## Uses
There are many potential uses for this dataset. Key uses include:
- Training and evaluating language models on prompt ranking tasks.
- As a dataset that can be filtered only to include high-quality prompts. These can serve as seed data for generating synthetic prompts and generations.
Beyond this direct use, the dataset is also the output of a novel crowdsourcing effort and can be used to study the behaviour of annotators contributing to datasets as part of a community effort to rank prompts. This includes exploring:
- The distribution of prompt rankings based on the source of the prompt.
- The distribution of prompt rankings based on the prompt's type, length, or other features.
- The agreement of annotators on prompt rankings and the factors that influence agreement, i.e. prompt source, prompt type, prompt length, etc.
### Direct Use
To load the data using the `datasets` library, you can use the following code:
```python
from datasets import load_dataset
ds = load_dataset("10k_prompts_ranked")
```
### Out-of-Scope Use
This dataset only contains rankings for prompts, not prompt/response pairs so it is not suitable for direct use for supervised fine-tuning of language models.
## Dataset Structure
A single instance of the dataset looks as follows:
```python
{'prompt': 'Provide step-by-step instructions on how to make a safe and effective homemade all-purpose cleaner from common household ingredients. The guide should include measurements, tips for storing the cleaner, and additional variations or scents that can be added. Additionally, the guide should be written in clear and concise language, with helpful visuals or photographs to aid in the process.',
'quality': [{'user_id': 'd23b12c2-b601-490e-b5b3-2040eb393a00',
'value': '4',
'status': 'submitted'},
{'user_id': 'e2bdd868-f28e-46fc-9254-a6ec1e291889',
'value': '4',
'status': 'submitted'}],
'metadata': {'evolved_from': None,
'kind': 'synthetic',
'source': 'ultrachat'},
'avg_rating': 5.0,
'num_responses': 2,
'agreement_ratio': 1.0,
'raw_responses': [5, 5],
'kind': 'synthetic'}
```
The dataset contains the following fields:
- prompt: The prompt to be ranked.
- quality: A list of user rankings for the prompt. Each ranking includes the user_id, the value of the ranking, and the status of the ranking (we only include rankings that have been submitted).
- metadata: Additional information about the prompt including the source of the prompt, whether it was synthetic or human-generated, and whether it was evolved from another prompt.
- avg_rating: The average rating of the prompt.
- num_responses: The number of responses for the prompt.
- agreement_ratio: The agreement ratio for the prompt.
- raw_responses: The raw responses for the prompt by annotators. This can be used to calculate the agreement ratio differently.
- kind: The kind of prompt (synthetic or human-generated).
## Dataset Creation
Version one of the dataset was created in about 3 weeks. The first week involved some prep work and the creation of the Argilla instance. The actual generation of 10,000 prompt rankings was done in two weeks.
### Curation Rationale
The dataset was created to explore how Argilla and Hugging Face Spaces could be used to create impactful datasets within the community collectively. The dataset was also created to provide a high-quality dataset for prompt ranking tasks and to study the behavior of annotators contributing rankings as part of a community effort to rank prompts.
### Source Data
As discussed above, the prompts in this dataset are derived from a variety of heavily used datasets that include prompts. The following table shows the sources of the prompts in the dataset and the number of examples from each source. Datasets with a `#` in the dataset indicate the subset of the dataset that was used.
| Dataset | # Examples |
| ----------------------------------------- | ---------- |
| ewof/sharegpt-instruct-unfiltered-deduped | 4,479 |
| evol_instruct | 1,381 |
| ultrachat | 1,307 |
| OpenAssistant/oasst2 | 734 |
| argilla/DistiCoder-dpo-binarized | 705 |
| flan_v2_cot | 360 |
| argilla/distilabel-reasoning-prompts | 328 |
| argilla/distilabel-evol-prompt-collective | 282 |
| LDJnr/Capybara#Dove | 253 |
| ProlificAI/social-reasoning-rlhf | 145 |
| LDJnr/Capybara#GOAT | 123 |
| LDJnr/Capybara#TaskSource | 117 |
| LDJnr/Capybara#TheoremQA | 88 |
| LDJnr/Capybara#Verified-Camel | 19 |
| fka/awesome-chatgpt-prompts | 8 |
| LDJnr/Capybara#Tigerbot | 2 |
#### Synthetic vs Human-Generated Prompts
The breakdown of the prompts in the dataset by kind is as follows:
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/mIWyxv1y5-3A54hGv-Re-.png", alt="Sign in page for Argilla on Spaces" width="75%"/><
</center>
The "unknown" kind is a result of the fact that the source of the prompt was not known for some of the prompts in the dataset.
#### Who are the source data producers?
The source datasets used to generate the prompts in this dataset were created by academics, industry researchers, and open-source contributors.
### Annotations
This dataset contains human-generated annotations of prompt quality. Prompts are ranked on a scale of 1-5, with 1 being the lowest quality and 5 being the highest quality. The dataset contains 10,331 examples.
| Number of rankings | Frequency |
| -----------------: | --------: |
| 1 | 6,730 |
| 2 | 2,600 |
| 3 | 748 |
| 4 | 192 |
| 5 | 52 |
| 6 | 5 |
| 7 | 3 |
| 8 | 1 |
#### Distribution of ratings across dataset type
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/ttqT8izhSMI-SZ9OS3Rig.png", alt="Sign in page for Argilla on Spaces" width="75%"/><
</center>
#### Annotation process
The dataset was created by collecting prompts from various sources and then ranking them using an instance of Argilla hosted on a Hugging Face Space with Hugging Face authentication enabled. This allowed anyone with an existing Hugging Face account to rank the prompts.
#### Who are the annotators?
The annotators are 314 Hugging Face community members. We do not have demographic information about the annotators.
#### Personal and Sensitive Information
We are not aware of any personal or sensitive information in the dataset.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
## Glossary
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
- **Argilla**: An open source annotation tool focused on methods for efficiently building high-quality datasets for LLMs and other NLP models.
- **Hugging Face Spaces**: A platform for hosting machine learning applications and demos.
- **Synthetic data**: Data that is generated using some computational method (primarily and Large Language Model)
# 10k_prompts_ranked 数据集卡片
`10k_prompts_ranked` 是一个包含带质量排序提示词的数据集,由开源机器学习社区的314位成员使用**Argilla(Argilla)**——一款开源的数据标注工具——完成标注。本数据集内的提示词涵盖合成生成与人工创作两类,均源自多个广泛使用的提示词数据集。
该数据集共计10331条样本,可用于训练与评估面向提示词排序任务的大语言模型(Large Language Model,LLM)。同时,本数据集源自一项创新性的众包项目,因此也可用于研究在社区协作的提示词排序工作中,标注者的参与行为特征。
<center>
<div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/mj1JOorVwP-LT9POfyJiN.png" width="50%">
</div>
<em>Data is Better Together</em>
</center>
**想要参与本数据集V2版本的共建工作?** 您可在数秒内于[此处](https://huggingface.co/spaces/DIBT/prompt-collective)开始为提示词标注评分。
## 数据集详情
本数据集是`Data-is-Better-Together`协作项目的首个发布版本,该项目由**Argilla(Argilla)**与**Hugging Face**联合发起,旨在探索如何借助Argilla与**Hugging Face Spaces(Hugging Face Spaces)**在社区内协作构建具有影响力的数据集。
本数据集通过收集多个现有来源的提示词,并依托部署于Hugging Face Space、启用Hugging Face身份验证的Argilla实例完成排序标注。任何拥有Hugging Face账号的用户均可快速参与本数据集的共建工作。
<center>
<a href="https://huggingface.co/spaces/DIBT/prompt-collective">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/SCykTMYyc29kYgv7Frg_-.png" alt="Sign in page for Argilla on Spaces" width="75%"/></a>
</center>
### 数据集说明
- **整理方:** 由Argilla、Hugging Face与Prompt Collective社区联合共创。
- **语言(自然语言处理领域):** 英语
- **授权协议:** [需补充更多信息]
#### 数据可视化
点击下方的[Nomic Atlas(Nomic Atlas)](https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5)地图,即可查看本数据集内提示词的分布情况,并探索Nomic Atlas从提示词中识别出的主题。
<center>
<a href="https://atlas.nomic.ai/data/hivemind/dibt-10k-prompt-collective/map">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/SGP-N-zjyJwfRJDKpIJe0.png" alt="Nomic-Atlas 10K_prompts_ranked Map" width="75%"/>
</a>
</center>
## 应用场景
本数据集具备多种潜在应用价值,核心用途包括:
- 训练与评估面向提示词排序任务的大语言模型;
- 可仅筛选出高质量提示词,作为生成合成提示词与模型输出的种子数据集。
除上述直接应用外,本数据集作为一项创新性众包项目的产出成果,还可用于研究社区协作式提示词排序任务中,标注者的参与行为特征,具体可探索以下方向:
- 基于提示词来源的排序评分分布;
- 基于提示词类型、长度或其他特征的排序评分分布;
- 标注者对提示词排序的一致性及其影响因素,例如提示词来源、类型、长度等。
### 直接使用方法
若需使用**Hugging Face Datasets库(datasets库)**加载本数据集,可使用以下代码:
python
from datasets import load_dataset
ds = load_dataset("10k_prompts_ranked")
### 不适用场景
本数据集仅包含提示词的排序评分,未包含提示词与模型输出的配对数据,因此不适用于直接对大语言模型进行监督微调。
## 数据集结构
本数据集的单条样本格式如下:
python
{'prompt': '请提供分步指南,说明如何使用常见家居用品制作安全有效的通用自制清洁剂。指南应包含用量、储存技巧及可添加的额外配方或香氛,同时需用清晰简洁的语言撰写,并辅以有助于理解流程的实用图表或照片。',
'quality': [{'user_id': 'd23b12c2-b601-490e-b5b3-2040eb393a00',
'value': '4',
'status': 'submitted'},
{'user_id': 'e2bdd868-f28e-46fc-9254-a6ec1e291889',
'value': '4',
'status': 'submitted'}],
'metadata': {'evolved_from': None,
'kind': 'synthetic',
'source': 'ultrachat'},
'avg_rating': 5.0,
'num_responses': 2,
'agreement_ratio': 1.0,
'raw_responses': [5, 5],
'kind': 'synthetic'}
本数据集包含以下字段:
- `prompt`:待排序评分的提示词;
- `quality`:用户对该提示词的评分列表,每条评分记录包含用户ID(user_id)、评分值(value)及评分状态(status,本数据集仅收录已提交的评分);
- `metadata`:提示词的附加信息,包括提示词来源、是否为合成生成/人工创作,以及是否源自其他提示词迭代而来;
- `avg_rating`:该提示词的平均评分;
- `num_responses`:该提示词的有效评分次数;
- `agreement_ratio`:该提示词的评分一致性比率;
- `raw_responses`:标注者给出的原始评分数据,可用于自定义计算评分一致性;
- `kind`:提示词的类型(合成生成或人工创作)。
## 数据集构建
本数据集的V1版本耗时约3周完成:第一周用于前期准备与Argilla实例的搭建;后续两周完成了10000条提示词的排序标注工作。
### 整理初衷
本数据集旨在探索如何借助Argilla与Hugging Face Spaces,在社区内协作构建具有影响力的数据集;同时为提示词排序任务提供高质量数据集,并研究社区协作式提示词排序工作中,标注者的参与行为特征。
### 源数据
如前文所述,本数据集的提示词源自多个广泛使用的提示词数据集。下表列出了本数据集的提示词来源及各来源的样本数量,数据集名称中带有`#`的代表使用了该数据集的对应子集。
| 数据集名称 | 样本数量 |
| :----------------------------------------- | -------: |
| ewof/sharegpt-instruct-unfiltered-deduped | 4,479 |
| evol_instruct | 1,381 |
| ultrachat | 1,307 |
| OpenAssistant/oasst2 | 734 |
| argilla/DistiCoder-dpo-binarized | 705 |
| flan_v2_cot | 360 |
| argilla/distilabel-reasoning-prompts | 328 |
| argilla/distilabel-evol-prompt-collective | 282 |
| LDJnr/Capybara#Dove | 253 |
| ProlificAI/social-reasoning-rlhf | 145 |
| LDJnr/Capybara#GOAT | 123 |
| LDJnr/Capybara#TaskSource | 117 |
| LDJnr/Capybara#TheoremQA | 88 |
| LDJnr/Capybara#Verified-Camel | 19 |
| fka/awesome-chatgpt-prompts | 8 |
| LDJnr/Capybara#Tigerbot | 2 |
#### 合成生成提示词与人工创作提示词分布
本数据集按提示词类型的分布情况如下:
(此处附对应图表)
标注为“unknown(未知)”的样本,是由于部分提示词的原始来源无法确认。
#### 源数据生产者
本数据集所使用的源数据集均由学术界研究者、工业界科研人员及开源社区贡献者创建。
### 标注信息
本数据集包含人工标注的提示词质量评分,评分范围为1至5分,其中1分代表最低质量,5分代表最高质量。本数据集共计10331条样本,各评分次数的分布如下表所示:
| 单条提示词的有效评分次数 | 样本数量 |
| :---------------------: | ---------: |
| 1 | 6,730 |
| 2 | 2,600 |
| 3 | 748 |
| 4 | 192 |
| 5 | 52 |
| 6 | 5 |
| 7 | 3 |
| 8 | 1 |
#### 不同数据集类型的评分分布
(此处附对应图表)
#### 标注流程
本数据集通过收集多源提示词,并依托部署于Hugging Face Space、启用Hugging Face身份验证的Argilla实例完成排序标注,任何拥有Hugging Face账号的用户均可参与提示词评分工作。
#### 标注者信息
本次标注工作由314位Hugging Face社区成员完成,我们未收集标注者的人口统计相关信息。
#### 个人与敏感信息说明
经核查,本数据集未包含任何个人或敏感信息。
## 引用
**BibTeX格式:**
[需补充更多信息]
## 术语表
- **Argilla(Argilla)**:一款开源标注工具,专注于为大语言模型及其他自然语言处理模型高效构建高质量数据集的相关技术;
- **Hugging Face Spaces(Hugging Face Spaces)**:用于托管机器学习应用与演示程序的平台;
- **合成数据(Synthetic data)**:通过计算方法(主要为大语言模型)生成的数据。
提供机构:
maas
创建时间:
2025-07-10



