five

SamWang0405/FinRAG-GRPO

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SamWang0405/FinRAG-GRPO
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - zh pretty_name: FinRAG-GRPO Preference Dataset size_categories: - 1K<n<10K tags: - reward-model - rlhf - grpo - preference-data - customer-service - reasoning - synthetic configs: - config_name: train_zh data_files: - split: train path: datasets/train_zh.jsonl - config_name: test_zh data_files: - split: test path: datasets/test_zh.jsonl - config_name: train_with_sys_zh data_files: - split: train path: datasets/train_with_sys_zh.jsonl - config_name: test_with_sys_zh data_files: - split: test path: datasets/test_with_sys_zh.jsonl --- # FinRAG-GRPO Preference Dataset A Chinese-language preference dataset for training **Reasoning Reward Models (ReasRM)** via GRPO-based reinforcement learning. > 🚧 This dataset is actively maintained and will be expanded with additional domains and languages over time. --- ## Dataset Summary This dataset contains pairwise preference samples designed to train a reward model that **reasons before judging** — the model generates an evaluation rationale before outputting a preference label (`<answer>[[A]]</answer>` or `<answer>[[B]]</answer>`). The current release focuses on **Chinese e-commerce customer service** scenarios, comparing responses across dimensions such as empathy, problem resolution, and communication tone. --- ## Dataset Structure ### Files | File | Split | Size | Description | |------|-------|------|-------------| | `train_zh.jsonl` | Train | 3,000 | Training set, no system prompt | | `test_zh.jsonl` | Test | 400 | Test set, no system prompt | | `train_with_sys_zh.jsonl` | Train | 3,000 | Training set with system prompt injected | | `test_with_sys_zh.jsonl` | Test | 400 | Test set with system prompt injected | ### Data Format Each record contains: ```json { "context_messages": [ {"role": "system", "content": "...evaluation rubric instructions..."}, {"role": "user", "content": "[客户问题]...[客服A]...[客服B]..."} ], "winner": "model_a | model_b" } ``` - `context_messages`: Follows standard chat template format, compatible with HuggingFace `apply_chat_template` - `winner`: Ground truth preference label (`model_a` or `model_b`) --- ## Construction ### Scenarios 15 e-commerce customer service categories including: logistics delays, quality complaints, returns & refunds, payment issues, account problems, order cancellations, and more. ### Bias Mitigation - **Position bias**: A/B responses are randomly swapped (50% probability) with labels updated accordingly - **Length bias**: Length strategies are randomized so the better response is not always longer ### Label Distribution - `model_a`: ~48.5% - `model_b`: ~51.5% --- ## Intended Use This dataset is designed for: - GRPO / PPO reinforcement learning fine-tuning of LLMs as reward models - Preference modeling and pairwise ranking tasks - Research on reasoning-augmented reward models (ReasRM) ### Training Framework Compatible with [veRL](https://github.com/volcengine/verl) + vLLM rollout pipeline. --- ## Limitations - Current release is **single-domain** (customer service only); cross-domain generalization is not guaranteed - Labels are generated by a single LLM teacher model, which may introduce systematic biases - No hard negatives (cases where both responses are similarly good/bad) in current version --- ## Roadmap - [ ] Add financial domain preference data (RAG Q&A evaluation) - [ ] Add English version - [ ] Add hard negative samples - [ ] Add multi-turn conversation samples --- ## Citation If you use this dataset, please cite: ```bibtex @misc{wang2026finrag-grpo, author = {Chaoyu Wang}, title = {FinRAG-GRPO Preference Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/SamWang0405/FinRAG-GRPO} } ``` --- ## License MIT License

license: MIT许可证 task_categories: - 文本分类 language: - 中文 pretty_name: FinRAG-GRPO偏好数据集 size_categories: - 1000 < n < 10000 tags: - 奖励模型 - 基于人类反馈的强化学习(RLHF) - GRPO - 偏好数据 - 客服 - 推理 - 合成数据 configs: - config_name: train_zh data_files: - split: train path: datasets/train_zh.jsonl - config_name: test_zh data_files: - split: test path: datasets/test_zh.jsonl - config_name: train_with_sys_zh data_files: - split: train path: datasets/train_with_sys_zh.jsonl - config_name: test_with_sys_zh data_files: - split: test path: datasets/test_with_sys_zh.jsonl # FinRAG-GRPO偏好数据集 本数据集为中文偏好数据集,用于通过基于GRPO的强化学习训练**推理型奖励模型(Reasoning Reward Models,ReasRM)**。 > 🚧 本数据集仍在持续维护中,后续将逐步拓展覆盖更多领域与语言。 --- ## 数据集概述 本数据集包含成对偏好样本,用于训练**先推理后评判**的奖励模型:模型需先生成评估依据,再输出偏好标签(`<answer>[[A]]</answer>` 或 `<answer>[[B]]</answer>`)。 本次发布的数据集聚焦**中文电商客服场景**,从共情能力、问题解决能力、沟通语气等多个维度对回复进行对比排序。 --- ## 数据集结构 ### 文件列表 | 文件路径 | 数据划分 | 样本量 | 描述 | |---------|---------|--------|------| | `train_zh.jsonl` | 训练集 | 3000 | 无系统提示词的训练子集 | | `test_zh.jsonl` | 测试集 | 400 | 无系统提示词的测试子集 | | `train_with_sys_zh.jsonl` | 训练集 | 3000 | 注入系统提示词的训练子集 | | `test_with_sys_zh.jsonl` | 测试集 | 400 | 注入系统提示词的测试子集 | ### 数据格式 每条数据记录包含以下字段: json { "context_messages": [ {"role": "system", "content": "...evaluation rubric instructions..."}, {"role": "user", "content": "[客户问题]...[客服A]...[客服B]..."} ], "winner": "model_a | model_b" } - `context_messages`:遵循标准对话模板格式,可兼容HuggingFace的`apply_chat_template`工具 - `winner`:真实偏好标签(`model_a` 或 `model_b`) --- ## 数据集构建 ### 应用场景 涵盖15类电商客服场景,包括:物流延误、质量投诉、退换货、支付问题、账号问题、订单取消等。 ### 偏差缓解 - **位置偏差**:A/B回复会以50%的概率随机交换,并同步更新对应标签 - **长度偏差**:回复长度策略随机化,避免优质回复总是更长的情况 ### 标签分布 - `model_a`:约48.5% - `model_b`:约51.5% --- ## 预期用途 本数据集适用于: - 将大语言模型(LLM)作为奖励模型,通过GRPO/PPO强化学习进行微调 - 偏好建模与成对排序任务 - 面向推理增强型奖励模型(ReasRM)的相关研究 ### 训练框架 可兼容[veRL](https://github.com/volcengine/verl) + vLLM 推理流水线。 --- ## 局限性 - 本次发布的数据集仅为**单领域**(仅客服场景),无法保证跨领域泛化能力 - 标签由单一大语言模型教师模型生成,可能引入系统性偏差 - 当前版本未包含难负样本(即两条回复质量相近或均优/均劣的样本) --- ## 开发路线 - [ ] 新增金融领域偏好数据(基于RAG的问答评估) - [ ] 推出英文版本 - [ ] 新增难负样本 - [ ] 新增多轮对话样本 --- ## 引用 若使用本数据集,请引用如下文献: bibtex @misc{wang2026finrag-grpo, author = {Chaoyu Wang}, title = {FinRAG-GRPO Preference Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/SamWang0405/FinRAG-GRPO} } --- ## 许可证 MIT许可证
提供机构:
SamWang0405
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作