saital/browser-agent-phase1-sft-action-only
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/saital/browser-agent-phase1-sft-action-only
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-generation
tags:
- browser-agents
- browsergym
- miniwob
- synthetic-data
- sft
- imitation-learning
size_categories:
- 1K<n<10K
pretty_name: Browser Agent Phase 1 SFT Action-Only
---
# Browser Agent Phase 1 SFT Action-Only
## What this is
Action-only step-level chat SFT data for browser-agent training.
Each example teaches the model to predict the next BrowserGym action from:
- the original generation-time system prompt used for data collection
- task goal and URL
- short recent history
- current observation text and diagnostics
Assistant targets contain only the next action.
## Why this format
This is the primary training format for small-model SFT because it is cleaner than reasoning-heavy supervision and better aligned with next-action prediction.
## Collection details
This dataset contains step-level browser-agent trajectories exported from the browser-agent research project.
Source:
- BrowserGym / MiniWoB tasks
- teacher: local Qwen3.5-9B served with vLLM on an RTX 4090
- collection setup: repeated seed-offset production runs over a curated 30-task production subset
Prompting note:
- the export reuses the original generation-time teacher system prompt from each rollout's `resolved_config.yaml`
- the action-only variant appends a short final instruction to output only the action
- the reasoning+action variant appends a short final instruction to reason first, then output the action
Export policy:
- successful episodes only
- max action errors: 0
- max repeated loops: 0
- max sparse observations: 2
- max root-only observations: 0
- max fallback count: 0
- split by run ID
Corpus counts:
- episodes seen: 4200
- episodes kept: 3415
- train rows: 6508
- validation rows: 240
Fields:
- `messages`: chat-format training conversation
- `metadata`: task, episode, run, seed, step index, teacher model, fallback flag
## Limitations
- synthetic web-task distribution rather than open-web browsing
- filtered for clean successful trajectories, so it under-represents recovery behavior
- optimized for a narrow research setup, not broad benchmark claims
提供机构:
saital



