rfm-rm-as-user-dataset
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/google/rfm-rm-as-user-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# RFM Reward Model As User Dataset
This dataset was generated for the NeurIPS 2025 paper titled ["Capturing Individual Human Preferences with Reward Features"](https://arxiv.org/abs/2503.17338). It is released to support the reproducibility of the experiments described in the paper, particularly those in the "Modelling groups of real users" section.
Instead of containing preferences from human raters, this dataset uses 8 publicly available reward models (RMs) as proxies for human raters. This allows for large-scale research into preference heterogeneity and adaptive reward modeling.
## Dataset Description
The dataset is built using prompts and responses from the [UltraFeedback dataset](https://huggingface.co/datasets/allenai/ultrafeedback_binarized). The preference scores and rankings are generated by the 8 public reward models listed below.
The dataset is provided in three main splits for each of the 8 "rater" models:
1. **Train:** Contains 60,819 prompts and their corresponding responses from the UltraFeedback training set. *Note: The original dataset contained duplicated prompt_ids and we removed them from our process. That's why the number of rows is slightly less than the original dataset.*
2. **Test:** Contains 985 prompts and their corresponding responses from the UltraFeedback test set.
### Rater Models
The following 8 publicly available reward models from Hugging Face were used to generate the preference scores in this dataset:
* `[OpenAssistant_reward-model-deberta-v3-large-v2](https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2)`
* `[weqweasdas_RM-Mistral-7B](https://huggingface.co/weqweasdas/RM-Mistral-7B)`
* `[OpenAssistant_oasst-rm-2.1-pythia-1.4b-epoch-2.5](https://huggingface.co/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5)`
* `[Ray2333_GRM-Gemma-2B-sftreg](https://huggingface.co/Ray2333/GRM-Gemma-2B-sftreg)`
* `[Ray2333_reward-model-Mistral-7B-instruct-Unified-Feedback](https://huggingface.co/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback)`
* `[weqweasdas_RM-Gemma-7B](https://huggingface.co/weqweasdas/RM-Gemma-7B)`
* `[internlm_internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward)`
* `[openbmb_Eurus-RM-7b](https://huggingface.co/openbmb/Eurus-RM-7b)`
## Data Structure and Schema
All files are provided in CSV format.
### File Naming Convention
* `merged_dedup_reward_model_as_user_train.csv`: The training split of the deduplicated UltraFeedback dataset.
* `merged_reward_model_as_user_test.csv`: The test split of UltraFeedback dataset
### Column Schema
The CSV files contain the following columns:
| Column | Description |
| :--- | :--- |
| `prompt_id` | The prompt ID from the original UltraFeedback dataset. |
| `prompt` | The text prompt used to generate the responses. Sourced from UltraFeedback. |
| `response0` | The text response. Sourced from UltraFeedback (for train/test files). |
| `response1` | The text response. Sourced from UltraFeedback (for train/test files). |
| `response0_score_{model_name}` | The numerical score assigned to the `prompt` + `response0` pair by the `model_name RM. |
| `response0_score_{model_name}` | The numerical score assigned to the `prompt` + `response1` pair by the `model_name RM. |
## License
This dataset is licensed under the **CC-BY 4.0 License** (Creative Commons Attribution 4.0 International).
## Citation
If you use this dataset in your research, please cite the original paper:
```bibtex
@inproceedings{barreto2025capturing,
title={Capturing Individual Human Preferences with Reward Features},
author={Andre Barreto and Vincent Dumoulin and Yiran Mao and Mark Rowland and Nicolas Perez-Nieves and Bobak Shahriari and Yann Dauphin and Doina Precup and Hugo Larochelle},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
# 以奖励模型作为用户的RFM数据集
本数据集为神经信息处理系统大会(NeurIPS)2025会议论文《Capturing Individual Human Preferences with Reward Features》(https://arxiv.org/abs/2503.17338)构建,旨在支持该论文中所述实验的可复现性,尤其是“真实用户群体建模”章节中的实验。
本数据集未采用人类评分者的偏好数据,而是以8个公开可用的奖励模型(Reward Model)作为人类评分者的代理,这为偏好异质性与自适应奖励建模的大规模研究提供了支撑。
## 数据集说明
本数据集基于[UltraFeedback数据集(UltraFeedback dataset)](https://huggingface.co/datasets/allenai/ultrafeedback_binarized)中的提示词与回复构建,偏好分数与排序由下文列出的8个公开奖励模型生成。
本数据集针对8个“评分者”模型分别提供三个主要拆分:
1. **训练集(Train)**:包含UltraFeedback训练集中的60819条提示词及其对应回复。*注:原始数据集存在重复的prompt_id,我们在处理过程中已将其移除,因此本数据集的行数略少于原始数据集。*
2. **测试集(Test)**:包含UltraFeedback测试集中的985条提示词及其对应回复。
### 评分者模型
本数据集使用以下8个来自Hugging Face的公开奖励模型生成偏好分数:
* `[OpenAssistant_reward-model-deberta-v3-large-v2](https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2)`
* `[weqweasdas_RM-Mistral-7B](https://huggingface.co/weqweasdas/RM-Mistral-7B)`
* `[OpenAssistant_oasst-rm-2.1-pythia-1.4b-epoch-2.5](https://huggingface.co/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5)`
* `[Ray2333_GRM-Gemma-2B-sftreg](https://huggingface.co/Ray2333/GRM-Gemma-2B-sftreg)`
* `[Ray2333_reward-model-Mistral-7B-instruct-Unified-Feedback](https://huggingface.co/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback)`
* `[weqweasdas_RM-Gemma-7B](https://huggingface.co/weqweasdas/RM-Gemma-7B)`
* `[internlm_internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward)`
* `[openbmb_Eurus-RM-7b](https://huggingface.co/openbmb/Eurus-RM-7b)`
## 数据结构与Schema
所有数据文件均以CSV格式存储。
### 文件命名规范
* `merged_dedup_reward_model_as_user_train.csv`:去重后的UltraFeedback数据集训练拆分文件。
* `merged_reward_model_as_user_test.csv`:UltraFeedback数据集测试拆分文件。
### 列Schema
CSV文件包含以下字段:
| 字段名 | 字段说明 |
| :--- | :--- |
| `prompt_id` | 原始UltraFeedback数据集中的提示词ID。 |
| `prompt` | 用于生成回复的提示词文本,源自UltraFeedback数据集。 |
| `response0` | 回复文本,源自UltraFeedback数据集(训练/测试集文件均适用)。 |
| `response1` | 回复文本,源自UltraFeedback数据集(训练/测试集文件均适用)。 |
| `response0_score_{model_name}` | 模型`model_name`为`prompt` + `response0`组合分配的数值评分。 |
| `response0_score_{model_name}` | 模型`model_name`为`prompt` + `response1`组合分配的数值评分。 |
## 授权协议
本数据集采用**CC-BY 4.0协议**(知识共享署名4.0国际许可协议)进行授权。
## 引用方式
若您在研究中使用本数据集,请引用以下原创论文:
bibtex
@inproceedings{barreto2025capturing,
title={Capturing Individual Human Preferences with Reward Features},
author={Andre Barreto and Vincent Dumoulin and Yiran Mao and Mark Rowland and Nicolas Perez-Nieves and Bobak Shahriari and Yann Dauphin and Doina Precup and Hugo Larochelle},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-30



