SamWang0405/FinRAG-GRPO

Name: SamWang0405/FinRAG-GRPO
Creator: SamWang0405
Published: 2026-03-19 23:16:55
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/SamWang0405/FinRAG-GRPO

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification language: - zh pretty_name: FinRAG-GRPO Preference Dataset size_categories: - 1K<n<10K tags: - reward-model - rlhf - grpo - preference-data - customer-service - reasoning - synthetic configs: - config_name: train_zh data_files: - split: train path: datasets/train_zh.jsonl - config_name: test_zh data_files: - split: test path: datasets/test_zh.jsonl - config_name: train_with_sys_zh data_files: - split: train path: datasets/train_with_sys_zh.jsonl - config_name: test_with_sys_zh data_files: - split: test path: datasets/test_with_sys_zh.jsonl --- # FinRAG-GRPO Preference Dataset A Chinese-language preference dataset for training **Reasoning Reward Models (ReasRM)** via GRPO-based reinforcement learning. > 🚧 This dataset is actively maintained and will be expanded with additional domains and languages over time. --- ## Dataset Summary This dataset contains pairwise preference samples designed to train a reward model that **reasons before judging** — the model generates an evaluation rationale before outputting a preference label (`<answer>[[A]]</answer>` or `<answer>[[B]]</answer>`). The current release focuses on **Chinese e-commerce customer service** scenarios, comparing responses across dimensions such as empathy, problem resolution, and communication tone. --- ## Dataset Structure ### Files | File | Split | Size | Description | |------|-------|------|-------------| | `train_zh.jsonl` | Train | 3,000 | Training set, no system prompt | | `test_zh.jsonl` | Test | 400 | Test set, no system prompt | | `train_with_sys_zh.jsonl` | Train | 3,000 | Training set with system prompt injected | | `test_with_sys_zh.jsonl` | Test | 400 | Test set with system prompt injected | ### Data Format Each record contains: ```json { "context_messages": [ {"role": "system", "content": "...evaluation rubric instructions..."}, {"role": "user", "content": "[客户问题]...[客服A]...[客服B]..."} ], "winner": "model_a | model_b" } ``` - `context_messages`: Follows standard chat template format, compatible with HuggingFace `apply_chat_template` - `winner`: Ground truth preference label (`model_a` or `model_b`) --- ## Construction ### Scenarios 15 e-commerce customer service categories including: logistics delays, quality complaints, returns & refunds, payment issues, account problems, order cancellations, and more. ### Bias Mitigation - **Position bias**: A/B responses are randomly swapped (50% probability) with labels updated accordingly - **Length bias**: Length strategies are randomized so the better response is not always longer ### Label Distribution - `model_a`: ~48.5% - `model_b`: ~51.5% --- ## Intended Use This dataset is designed for: - GRPO / PPO reinforcement learning fine-tuning of LLMs as reward models - Preference modeling and pairwise ranking tasks - Research on reasoning-augmented reward models (ReasRM) ### Training Framework Compatible with [veRL](https://github.com/volcengine/verl) + vLLM rollout pipeline. --- ## Limitations - Current release is **single-domain** (customer service only); cross-domain generalization is not guaranteed - Labels are generated by a single LLM teacher model, which may introduce systematic biases - No hard negatives (cases where both responses are similarly good/bad) in current version --- ## Roadmap - [ ] Add financial domain preference data (RAG Q&A evaluation) - [ ] Add English version - [ ] Add hard negative samples - [ ] Add multi-turn conversation samples --- ## Citation If you use this dataset, please cite: ```bibtex @misc{wang2026finrag-grpo, author = {Chaoyu Wang}, title = {FinRAG-GRPO Preference Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/SamWang0405/FinRAG-GRPO} } ``` --- ## License MIT License

license: MIT许可证 task_categories: - 文本分类 language: - 中文 pretty_name: FinRAG-GRPO偏好数据集 size_categories: - 1000 < n < 10000 tags: - 奖励模型 - 基于人类反馈的强化学习（RLHF） - GRPO - 偏好数据 - 客服 - 推理 - 合成数据 configs: - config_name: train_zh data_files: - split: train path: datasets/train_zh.jsonl - config_name: test_zh data_files: - split: test path: datasets/test_zh.jsonl - config_name: train_with_sys_zh data_files: - split: train path: datasets/train_with_sys_zh.jsonl - config_name: test_with_sys_zh data_files: - split: test path: datasets/test_with_sys_zh.jsonl # FinRAG-GRPO偏好数据集本数据集为中文偏好数据集，用于通过基于GRPO的强化学习训练**推理型奖励模型（Reasoning Reward Models，ReasRM）**。 > 🚧 本数据集仍在持续维护中，后续将逐步拓展覆盖更多领域与语言。 --- ## 数据集概述本数据集包含成对偏好样本，用于训练**先推理后评判**的奖励模型：模型需先生成评估依据，再输出偏好标签（`<answer>[[A]]</answer>` 或 `<answer>[[B]]</answer>`）。本次发布的数据集聚焦**中文电商客服场景**，从共情能力、问题解决能力、沟通语气等多个维度对回复进行对比排序。 --- ## 数据集结构 ### 文件列表 | 文件路径 | 数据划分 | 样本量 | 描述 | |---------|---------|--------|------| | `train_zh.jsonl` | 训练集 | 3000 | 无系统提示词的训练子集 | | `test_zh.jsonl` | 测试集 | 400 | 无系统提示词的测试子集 | | `train_with_sys_zh.jsonl` | 训练集 | 3000 | 注入系统提示词的训练子集 | | `test_with_sys_zh.jsonl` | 测试集 | 400 | 注入系统提示词的测试子集 | ### 数据格式每条数据记录包含以下字段： json { "context_messages": [ {"role": "system", "content": "...evaluation rubric instructions..."}, {"role": "user", "content": "[客户问题]...[客服A]...[客服B]..."} ], "winner": "model_a | model_b" } - `context_messages`：遵循标准对话模板格式，可兼容HuggingFace的`apply_chat_template`工具 - `winner`：真实偏好标签（`model_a` 或 `model_b`） --- ## 数据集构建 ### 应用场景涵盖15类电商客服场景，包括：物流延误、质量投诉、退换货、支付问题、账号问题、订单取消等。 ### 偏差缓解 - **位置偏差**：A/B回复会以50%的概率随机交换，并同步更新对应标签 - **长度偏差**：回复长度策略随机化，避免优质回复总是更长的情况 ### 标签分布 - `model_a`：约48.5% - `model_b`：约51.5% --- ## 预期用途本数据集适用于： - 将大语言模型（LLM）作为奖励模型，通过GRPO/PPO强化学习进行微调 - 偏好建模与成对排序任务 - 面向推理增强型奖励模型（ReasRM）的相关研究 ### 训练框架可兼容[veRL](https://github.com/volcengine/verl) + vLLM 推理流水线。 --- ## 局限性 - 本次发布的数据集仅为**单领域**（仅客服场景），无法保证跨领域泛化能力 - 标签由单一大语言模型教师模型生成，可能引入系统性偏差 - 当前版本未包含难负样本（即两条回复质量相近或均优/均劣的样本） --- ## 开发路线 - [ ] 新增金融领域偏好数据（基于RAG的问答评估） - [ ] 推出英文版本 - [ ] 新增难负样本 - [ ] 新增多轮对话样本 --- ## 引用若使用本数据集，请引用如下文献： bibtex @misc{wang2026finrag-grpo, author = {Chaoyu Wang}, title = {FinRAG-GRPO Preference Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/SamWang0405/FinRAG-GRPO} } --- ## 许可证 MIT许可证

提供机构：

SamWang0405

5,000+

优质数据集

54 个

任务类型

进入经典数据集