saital/browser-agent-phase1-sft-reasoning-action
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/saital/browser-agent-phase1-sft-reasoning-action
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-generation
tags:
- browser-agents
- browsergym
- miniwob
- synthetic-data
- sft
- chain-of-thought
- imitation-learning
size_categories:
- 1K<n<10K
pretty_name: Browser Agent Phase 1 SFT Reasoning+Action
---
# Browser Agent Phase 1 SFT Reasoning+Action
## What this is
Reasoning-plus-action step-level chat SFT data for browser-agent training.
Each example uses the original generation-time system prompt, then appends a short instruction to reason first and output the final action.
Assistant targets contain:
- one `<think>...</think>` block
- then one BrowserGym action
## Why this format
This is an experimental variant for comparing whether explicit reasoning supervision helps or hurts small browser-use models relative to action-only training.
## Collection details
This dataset contains step-level browser-agent trajectories exported from the browser-agent research project.
Source:
- BrowserGym / MiniWoB tasks
- teacher: local Qwen3.5-9B served with vLLM on an RTX 4090
- collection setup: repeated seed-offset production runs over a curated 30-task production subset
Prompting note:
- the export reuses the original generation-time teacher system prompt from each rollout's `resolved_config.yaml`
- the action-only variant appends a short final instruction to output only the action
- the reasoning+action variant appends a short final instruction to reason first, then output the action
Export policy:
- successful episodes only
- max action errors: 0
- max repeated loops: 0
- max sparse observations: 2
- max root-only observations: 0
- max fallback count: 0
- split by run ID
Corpus counts:
- episodes seen: 4200
- episodes kept: 3415
- train rows: 6508
- validation rows: 240
Fields:
- `messages`: chat-format training conversation
- `metadata`: task, episode, run, seed, step index, teacher model, fallback flag
## Limitations
- teacher reasoning can be noisy
- longer targets may reduce training efficiency for small models
- synthetic web-task distribution rather than open-web browsing
提供机构:
saital



