ZYao720/WebArbiter-Data

Name: ZYao720/WebArbiter-Data
Creator: ZYao720
Published: 2026-04-09 18:19:41
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ZYao720/WebArbiter-Data

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 10K<n<100K task_categories: - text-generation tags: - web-agent - process-reward-model - preference - sft - rlhf - grpo - reward-model - web-navigation - reasoning-distillation pretty_name: WebArbiter Training Data dataset_info: - config_name: sft features: - name: conversation list: - name: role dtype: string - name: content dtype: string splits: - name: train num_examples: 9642 - config_name: rl features: - name: context_messages list: - name: role dtype: string - name: content dtype: string - name: winner dtype: string splits: - name: train num_examples: 18921 configs: - config_name: sft data_files: - split: train path: sft/* - config_name: rl data_files: - split: train path: rl/* --- <div align="center"> # WebArbiter Training Data **Two-stage training data for the WebArbiter process reward model** **Published at ICLR 2026** [Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html) </div> ## Overview This repository contains the training data for **WebArbiter**, a principle-guided reasoning Process Reward Model (PRM) for web agents. We build on the [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (Chae et al., 2025), which comprises ~30k step-level preference pairs drawn from the Mind2Web environment. WebArbiter is trained via a two-stage pipeline: 1. **Stage 1 — Reasoning Distillation (SFT)**: 9,642 teacher-generated structured justifications (distilled from o3) train the model to produce principle-guided reasoning before emitting a preference verdict. 2. **Stage 2 — RL with Verifiable Rewards (RLVR)**: 18,921 preference pairs are used with Group Relative Policy Optimization (GRPO) to correct teacher biases by directly aligning verdicts with ground-truth correctness via binary verifiable rewards R ∈ {−1, +1}. ``` WebPRM Collection (~30k preference pairs from Mind2Web) ├── D_SFT → config: sft (9,642 examples) → Stage 1: Reasoning Distillation └── D_RL → config: rl (18,921 examples) → Stage 2: RLVR with GRPO ``` ## Dataset Configs ### `sft` — Reasoning Distillation (D_SFT) | Property | Value | |----------|-------| | Examples | 9,642 | | Format | Multi-turn conversation (`system` / `user` / `assistant`) | | Source | Structured justifications distilled from o3 | Each conversation contains: - **System**: Task instructions for pairwise evaluation of web agent actions - **User**: Web state context including intent, AXTree (accessibility tree), trajectory history, start/current URLs, and two candidate responses (each with THOUGHT and ACTION) - **Assistant**: Structured output with `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` tags **Example structure:** ```json { "conversation": [ { "role": "system", "content": "You are a skilled expert at evaluating assistant responses..." }, { "role": "user", "content": "#### Intent ####\nFind the Direction & Parking information...\n#### AXTREE ####\n...\n#### Trajectory ####\n...\n#### Assistant Responses ####\n[The Begin of Response 1]\nTHOUGHT: ...\nACTION: ...\n[The End of Response 1]\n..." }, { "role": "assistant", "content": "<State>\nThe user is on the Chicago \"Food & Drink\" page...\n</State>\n<Criteria>\n1. Goal alignment (weight 0.5)...\n</Criteria>\n<Analysis>\nResponse 1 directly navigates...\n</Analysis>\n<Answer>Response 1</Answer>" } ] } ``` ### `rl` — Reinforcement Learning (D_RL) | Property | Value | |----------|-------| | Examples | 18,921 | | Format | Parquet with conversation context and ground-truth winner | | Reward | Binary verifiable reward: `+1` if verdict matches ground truth, `−1` otherwise | | Algorithm | Group Relative Policy Optimization (GRPO) | Each example contains: - **`context_messages`**: A list of messages (`system` + `user`) providing the same pairwise evaluation prompt as the SFT config (intent, AXTree, trajectory, two candidate responses), but **without** the assistant's response — the model must generate its own justification and verdict during RL rollouts. - **`winner`**: Ground-truth label indicating which response is better (`"response_1"` or `"response_2"`). The winner labels are nearly balanced (9,486 `response_1` vs. 9,435 `response_2`). ## Usage ```python from datasets import load_dataset # Load SFT data (Stage 1: Reasoning Distillation) sft_data = load_dataset("ZYao720/WebArbiter-Data", "sft", split="train") print(len(sft_data)) # 9642 print(sft_data[0]["conversation"][0]["role"]) # "system" # Load RL data (Stage 2: RLVR with GRPO) rl_data = load_dataset("ZYao720/WebArbiter-Data", "rl", split="train") print(len(rl_data)) # 18921 print(rl_data[0]["winner"]) # "response_1" or "response_2" ``` ## Training Details | | Stage 1 (SFT) | Stage 2 (RLVR) | |---|---|---| | Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) | | Method | Reasoning distillation (SFT) | GRPO with binary verifiable rewards | | Teacher | o3 | — | | Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB | | Fine-tuning | LoRA | FSDP + LoRA | See the [paper](https://arxiv.org/abs/2601.21872) (Appendix C) for full hyperparameter details. ## Related Resources | Resource | Link | |----------|------| | WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) | | WebArbiter-8B-Qwen3 (model) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) | | WebArbiter-7B (model) | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) | | WebArbiter-4B-Qwen3 (model) | [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) | | WebArbiter-3B (model) | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) | | Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) | ## License Released under the [MIT License](https://opensource.org/licenses/MIT). The training data is derived from the following source: | Source Dataset | License | |---------------|---------| | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (Chae et al., 2025) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | | [Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web) (underlying environment) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | ## Citation ```bibtex @misc{zhang2026ZYao720principleguidedreasoningprocess, title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp}, year={2026}, eprint={2601.21872}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.21872}, } ```

提供机构：

ZYao720

5,000+

优质数据集

54 个

任务类型

进入经典数据集