SamWang0405/FinRAG-GRPO
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SamWang0405/FinRAG-GRPO
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- zh
pretty_name: FinRAG-GRPO Preference Dataset
size_categories:
- 1K<n<10K
tags:
- reward-model
- rlhf
- grpo
- preference-data
- customer-service
- reasoning
- synthetic
configs:
- config_name: train_zh
data_files:
- split: train
path: datasets/train_zh.jsonl
- config_name: test_zh
data_files:
- split: test
path: datasets/test_zh.jsonl
- config_name: train_with_sys_zh
data_files:
- split: train
path: datasets/train_with_sys_zh.jsonl
- config_name: test_with_sys_zh
data_files:
- split: test
path: datasets/test_with_sys_zh.jsonl
---
# FinRAG-GRPO Preference Dataset
A Chinese-language preference dataset for training **Reasoning Reward Models (ReasRM)** via GRPO-based reinforcement learning.
> 🚧 This dataset is actively maintained and will be expanded with additional domains and languages over time.
---
## Dataset Summary
This dataset contains pairwise preference samples designed to train a reward model that **reasons before judging** — the model generates an evaluation rationale before outputting a preference label (`<answer>[[A]]</answer>` or `<answer>[[B]]</answer>`).
The current release focuses on **Chinese e-commerce customer service** scenarios, comparing responses across dimensions such as empathy, problem resolution, and communication tone.
---
## Dataset Structure
### Files
| File | Split | Size | Description |
|------|-------|------|-------------|
| `train_zh.jsonl` | Train | 3,000 | Training set, no system prompt |
| `test_zh.jsonl` | Test | 400 | Test set, no system prompt |
| `train_with_sys_zh.jsonl` | Train | 3,000 | Training set with system prompt injected |
| `test_with_sys_zh.jsonl` | Test | 400 | Test set with system prompt injected |
### Data Format
Each record contains:
```json
{
"context_messages": [
{"role": "system", "content": "...evaluation rubric instructions..."},
{"role": "user", "content": "[客户问题]...[客服A]...[客服B]..."}
],
"winner": "model_a | model_b"
}
```
- `context_messages`: Follows standard chat template format, compatible with HuggingFace `apply_chat_template`
- `winner`: Ground truth preference label (`model_a` or `model_b`)
---
## Construction
### Scenarios
15 e-commerce customer service categories including:
logistics delays, quality complaints, returns & refunds, payment issues, account problems, order cancellations, and more.
### Bias Mitigation
- **Position bias**: A/B responses are randomly swapped (50% probability) with labels updated accordingly
- **Length bias**: Length strategies are randomized so the better response is not always longer
### Label Distribution
- `model_a`: ~48.5%
- `model_b`: ~51.5%
---
## Intended Use
This dataset is designed for:
- GRPO / PPO reinforcement learning fine-tuning of LLMs as reward models
- Preference modeling and pairwise ranking tasks
- Research on reasoning-augmented reward models (ReasRM)
### Training Framework
Compatible with [veRL](https://github.com/volcengine/verl) + vLLM rollout pipeline.
---
## Limitations
- Current release is **single-domain** (customer service only); cross-domain generalization is not guaranteed
- Labels are generated by a single LLM teacher model, which may introduce systematic biases
- No hard negatives (cases where both responses are similarly good/bad) in current version
---
## Roadmap
- [ ] Add financial domain preference data (RAG Q&A evaluation)
- [ ] Add English version
- [ ] Add hard negative samples
- [ ] Add multi-turn conversation samples
---
## Citation
If you use this dataset, please cite:
```bibtex
@misc{wang2026finrag-grpo,
author = {Chaoyu Wang},
title = {FinRAG-GRPO Preference Dataset},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/SamWang0405/FinRAG-GRPO}
}
```
---
## License
MIT License
license: MIT许可证
task_categories:
- 文本分类
language:
- 中文
pretty_name: FinRAG-GRPO偏好数据集
size_categories:
- 1000 < n < 10000
tags:
- 奖励模型
- 基于人类反馈的强化学习(RLHF)
- GRPO
- 偏好数据
- 客服
- 推理
- 合成数据
configs:
- config_name: train_zh
data_files:
- split: train
path: datasets/train_zh.jsonl
- config_name: test_zh
data_files:
- split: test
path: datasets/test_zh.jsonl
- config_name: train_with_sys_zh
data_files:
- split: train
path: datasets/train_with_sys_zh.jsonl
- config_name: test_with_sys_zh
data_files:
- split: test
path: datasets/test_with_sys_zh.jsonl
# FinRAG-GRPO偏好数据集
本数据集为中文偏好数据集,用于通过基于GRPO的强化学习训练**推理型奖励模型(Reasoning Reward Models,ReasRM)**。
> 🚧 本数据集仍在持续维护中,后续将逐步拓展覆盖更多领域与语言。
---
## 数据集概述
本数据集包含成对偏好样本,用于训练**先推理后评判**的奖励模型:模型需先生成评估依据,再输出偏好标签(`<answer>[[A]]</answer>` 或 `<answer>[[B]]</answer>`)。
本次发布的数据集聚焦**中文电商客服场景**,从共情能力、问题解决能力、沟通语气等多个维度对回复进行对比排序。
---
## 数据集结构
### 文件列表
| 文件路径 | 数据划分 | 样本量 | 描述 |
|---------|---------|--------|------|
| `train_zh.jsonl` | 训练集 | 3000 | 无系统提示词的训练子集 |
| `test_zh.jsonl` | 测试集 | 400 | 无系统提示词的测试子集 |
| `train_with_sys_zh.jsonl` | 训练集 | 3000 | 注入系统提示词的训练子集 |
| `test_with_sys_zh.jsonl` | 测试集 | 400 | 注入系统提示词的测试子集 |
### 数据格式
每条数据记录包含以下字段:
json
{
"context_messages": [
{"role": "system", "content": "...evaluation rubric instructions..."},
{"role": "user", "content": "[客户问题]...[客服A]...[客服B]..."}
],
"winner": "model_a | model_b"
}
- `context_messages`:遵循标准对话模板格式,可兼容HuggingFace的`apply_chat_template`工具
- `winner`:真实偏好标签(`model_a` 或 `model_b`)
---
## 数据集构建
### 应用场景
涵盖15类电商客服场景,包括:物流延误、质量投诉、退换货、支付问题、账号问题、订单取消等。
### 偏差缓解
- **位置偏差**:A/B回复会以50%的概率随机交换,并同步更新对应标签
- **长度偏差**:回复长度策略随机化,避免优质回复总是更长的情况
### 标签分布
- `model_a`:约48.5%
- `model_b`:约51.5%
---
## 预期用途
本数据集适用于:
- 将大语言模型(LLM)作为奖励模型,通过GRPO/PPO强化学习进行微调
- 偏好建模与成对排序任务
- 面向推理增强型奖励模型(ReasRM)的相关研究
### 训练框架
可兼容[veRL](https://github.com/volcengine/verl) + vLLM 推理流水线。
---
## 局限性
- 本次发布的数据集仅为**单领域**(仅客服场景),无法保证跨领域泛化能力
- 标签由单一大语言模型教师模型生成,可能引入系统性偏差
- 当前版本未包含难负样本(即两条回复质量相近或均优/均劣的样本)
---
## 开发路线
- [ ] 新增金融领域偏好数据(基于RAG的问答评估)
- [ ] 推出英文版本
- [ ] 新增难负样本
- [ ] 新增多轮对话样本
---
## 引用
若使用本数据集,请引用如下文献:
bibtex
@misc{wang2026finrag-grpo,
author = {Chaoyu Wang},
title = {FinRAG-GRPO Preference Dataset},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/SamWang0405/FinRAG-GRPO}
}
---
## 许可证
MIT许可证
提供机构:
SamWang0405



