McGill-NLP/A3-Synth
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/McGill-NLP/A3-Synth
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
size_categories:
- 10K<n<100K
tags:
- agents
- web
- synthetic
- sft
task_categories:
- text-generation
---
<div align="center">
# A3-Synth
| [**💾 Code**](https://github.com/McGill-NLP/agent-as-annotators) | [**📄 Paper**](https://arxiv.org/abs/2604.07776) | [**🌐 Website**](https://agent-as-annotators.github.io) |
| :--: | :--: | :--: |
| [**🤗 Dataset**](https://huggingface.co/datasets/McGill-NLP/A3-Synth) | [**🤖 Models**](https://huggingface.co/collections/McGill-NLP/a3-agent-as-annotators-69d854ab5b1993b10efc3fba) | [**📦 PyPI**](https://pypi.org/project/agent-as-annotators/) |
[**Structured Distillation of Web Agent Capabilities Enables Generalization**](https://arxiv.org/abs/2604.07776)
*Xing Han Lù, Siva Reddy*
</div>
A3-Synth is a synthetic training dataset for web agents, generated using the Agent-as-Annotators (A3) framework. It contains ~16k SFT training examples produced by Gemini 3 Pro acting as the Annotator across 3,000 tasks on 6 WebArena environments.
## Dataset Structure
```
A3-Synth/
training/
train.jsonl # 16k SFT examples (conversations with screenshots)
tasks/
{site}-0.tasks.json # Task configs for 6 WebArena sites
personas/
personas.json # 250 generated personas
raw/
websynth.*.json # 2,999 full trajectory JSONs
trajectories/
cleaned/screenshots/ # Step-by-step screenshots referenced by train.jsonl
```
## Loading the Training Data
```python
import json
with open("training/train.jsonl") as f:
examples = [json.loads(line) for line in f]
# Each example is a list of messages: [system, user, assistant, user, assistant, ...]
# User messages contain text + image references (screenshot file paths)
# Assistant messages contain the agent's reasoning and actions
```
## Sites
| Site | Description |
|------|-------------|
| shopping | E-commerce (OneStopShop) |
| shopping_admin | E-commerce admin panel |
| reddit | Forum (Reddit-like) |
| gitlab | Code hosting (GitLab) |
| wikipedia | Encyclopedia |
| map | OpenStreetMap |
提供机构:
McGill-NLP



